Data Wrangling with Pandas: Quick Reference Guide

Data Wrangling with pandas Cheat Sheet

IMG_7789.jpeg

Creating DataFrames

  • Creating DataFrames by specifying values for each column:

    df = pd.DataFrame(
    {
    "a": [4, 5, 6],
    "b": [7, 8, 9],
    "c": [10, 11, 12]
    },
    index=[1, 2, 3]
    )

    This DataFrame has indexed rows and specified values for columns a, b, and c.

  • Creating DataFrames by specifying values for each row:

    df = pd.DataFrame(
    [
    [4, 7, 10],
    [5, 8, 11],
    [6, 9, 12]
    ],
    index=[1, 2, 3],
    columns=['a', 'b', 'c']
    )

    This DataFrame uses a list of lists to define each row, indexing the rows and specifying columns.

  • Creating DataFrame with MultiIndex:

    df = pd.DataFrame(
    {
    "a": [4, 5, 6],
    "b": [7, 8, 9],
    "c": [10, 11, 12]
    },
    index=pd.MultiIndex.from_tuples(
    [
    ('d', 1),
    ('d', 2),
    ('e', 2)
    ],
    names=['n', 'v']
    )
    )

    MultiIndex allows hierarchical indexing, providing multiple levels of indexing.

Method Chaining

  • Chaining methods for cleaner code:
    df = (pd.melt(df)
    .rename(columns={
    'variable': 'var',
    'value': 'val'
    })
    .query('val >= 200'))
    This snippet demonstrates the chaining of data manipulation methods for improved readability.

Tidy Data

  • Foundation for wrangling in pandas:
    • Each variable is saved in its own column.
    • Each observation is saved in its own row.

Reshaping Data

  • Change layout, sorting, reindexing, renaming:
    • Melt to gather columns into rows:
      pd.melt(df)
    • Pivot to spread rows into columns:
      df.pivot(columns='var', values='val')
    • Concatenate DataFrames vertically or horizontally:
      pd.concat([df1, df2])
      pd.concat([df1, df2], axis=1)
    • Sort values in DataFrame:
      df.sort_values('mpg')
      df.sort_values('mpg', ascending=False)
    • Rename columns:
      df.rename(columns={'y': 'year'})
    • Set or reset index:
      df.set_index()
      df.reset_index()
    • Drop columns:
      df.drop(columns=['Length', 'Height'])

Subset Observations - Rows

  • Extract rows that meet logical criteria:
    df[df.Length > 7]
  • Remove duplicate rows:
    df.drop_duplicates()
  • Random sampling:
    df.sample(frac=0.5)
    df.sample(n=200)
  • ISNull/NotNull:
    df.isnull().sum()
    2
    df.dropna()

Subset Variables - Columns

  • Select columns:
    df[['width', 'length', 'species']]
  • Select columns with specific names:
    df.width
  • Select columns whose name matches regex:
    df.filter(regex='regex')

Subsets - Rows and Columns

  • .loc and .iloc for selection:
    Use df.loc[] and df.iloc[] to select rows, columns, or both.
    Use df.at[] and df.iat[] to access a single value by row and column.
  • .iloc examples:
    df.iloc[10:20]
    df.iloc[:, 1:2, 5]
  • .loc examples:
    df.loc[:, 'x2':'x1']
    df.loc[:, 'a':'c']
    df.loc[df['a'] > 10]
  • Access single value using index:
    df.iat[1, 2]
    df.at[4, 'A']

Using query

  • Boolean expressions for filtering:
    df.query('Length > 7')
    df.query('Length > 7 and Width < 8')
    df.query('index.str.startswith("abc")', engine="python")

Logic in Python (and pandas)

  • Symbols and their meaning:
    • Less than: <
    • Greater than: >
    • Less than or equal to: <=
    • Greater than or equal to: >=
    • Not equal to: !=
    • Equality: ==
    • isin(values): df.column.isin(values)
    • notin(values): ~df.column.isin(values)
    • isnull(obj): pd.isnull(obj)
    • notnull(obj): pd.notnull(obj)
    • any(): df.any(1)
    • all(): df.all(1)

Regex (Regular Expressions) Examples

  • Matching patterns:
    • r'^abc$': Matches strings containing a period .
    • r'^[a-zA-Z]+$': Matches strings starting with word
    • r'^\d{1,9}s': Matches strings starting with word
    • r'\b(Spal)': Matches strings beginning with word
    • df.columns.regex.r\1,2,3,4,5$: Matches strings ending with word
    • r'\baW\b': Matches Nan.
    • `r'.*()': Matches strings with 'Species'
    • `\baW\b': Matches strings with specific names.

Note: This summary provides an overview of key functionalities and examples, ensuring pandas users can quickly reference and understand the essential operations for effective data manipulation.

Reference:

pandas.pydata.org
[PDF] Creating DataFrames Method Chaining Data Wrangling - Pandas
www.studocu.com
Pandas Cheat Sheet - Creating DataFrames Reshaping Data
www.datacamp.com
Pandas Cheat Sheet: Data Wrangling in Python - DataCamp