Data Wrangling with Pandas: Quick Reference Guide

Data Wrangling with Pandas: Quick Reference Guide

Data Wrangling with pandas Cheat Sheet

IMG_7789.jpeg

Creating DataFrames

  • Creating DataFrames by specifying values for each column:

    df = pd.DataFrame(
        {
            "a": [4, 5, 6],
            "b": [7, 8, 9],
            "c": [10, 11, 12]
        },
        index=[1, 2, 3]
    )
    

    This DataFrame has indexed rows and specified values for columns a, b, and c.

  • Creating DataFrames by specifying values for each row:

    df = pd.DataFrame(
        [
            [4, 7, 10],
            [5, 8, 11],
            [6, 9, 12]
        ],
        index=[1, 2, 3],
        columns=['a', 'b', 'c']
    )
    

    This DataFrame uses a list of lists to define each row, indexing the rows and specifying columns.

  • Creating DataFrame with MultiIndex:

    df = pd.DataFrame(
        {
            "a": [4, 5, 6],
            "b": [7, 8, 9],
            "c": [10, 11, 12]
        },
        index=pd.MultiIndex.from_tuples(
            [
                ('d', 1),
                ('d', 2),
                ('e', 2)
            ],
            names=['n', 'v']
        )
    )
    

    MultiIndex allows hierarchical indexing, providing multiple levels of indexing.

Method Chaining

  • Chaining methods for cleaner code:
    df = (pd.melt(df)
          .rename(columns={
              'variable': 'var',
              'value': 'val'
          })
          .query('val >= 200'))
    
    This snippet demonstrates the chaining of data manipulation methods for improved readability.

Tidy Data

  • Foundation for wrangling in pandas:
    • Each variable is saved in its own column.
    • Each observation is saved in its own row.

Reshaping Data

  • Change layout, sorting, reindexing, renaming:
    • Melt to gather columns into rows:
      pd.melt(df)
      
    • Pivot to spread rows into columns:
      df.pivot(columns='var', values='val')
      
    • Concatenate DataFrames vertically or horizontally:
      pd.concat([df1, df2])
      pd.concat([df1, df2], axis=1)
      
    • Sort values in DataFrame:
      df.sort_values('mpg')
      df.sort_values('mpg', ascending=False)
      
    • Rename columns:
      df.rename(columns={'y': 'year'})
      
    • Set or reset index:
      df.set_index()
      df.reset_index()
      
    • Drop columns:
      df.drop(columns=['Length', 'Height'])
      

Subset Observations - Rows

  • Extract rows that meet logical criteria:
    df[df.Length > 7]
    
  • Remove duplicate rows:
    df.drop_duplicates()
    
  • Random sampling:
    df.sample(frac=0.5)
    df.sample(n=200)
    
  • ISNull/NotNull:
    df.isnull().sum()
    2
    df.dropna()
    

Subset Variables - Columns

  • Select columns:
    df[['width', 'length', 'species']]
    
  • Select columns with specific names:
    df.width
    
  • Select columns whose name matches regex:
    df.filter(regex='regex')
    

Subsets - Rows and Columns

  • .loc and .iloc for selection:
    Use df.loc[] and df.iloc[] to select rows, columns, or both.
    Use df.at[] and df.iat[] to access a single value by row and column.
    
  • .iloc examples:
    df.iloc[10:20]
    df.iloc[:, 1:2, 5]
    
  • .loc examples:
    df.loc[:, 'x2':'x1']
    df.loc[:, 'a':'c']
    df.loc[df['a'] > 10]
    
  • Access single value using index:
    df.iat[1, 2]
    df.at[4, 'A']
    

Using query

  • Boolean expressions for filtering:
    df.query('Length > 7')
    df.query('Length > 7 and Width < 8')
    df.query('index.str.startswith("abc")', engine="python")
    

Logic in Python (and pandas)

  • Symbols and their meaning:
    • Less than: <
    • Greater than: >
    • Less than or equal to: <=
    • Greater than or equal to: >=
    • Not equal to: !=
    • Equality: ==
    • isin(values): df.column.isin(values)
    • notin(values): ~df.column.isin(values)
    • isnull(obj): pd.isnull(obj)
    • notnull(obj): pd.notnull(obj)
    • any(): df.any(1)
    • all(): df.all(1)

Regex (Regular Expressions) Examples

  • Matching patterns:
    • r'^abc$': Matches strings containing a period .
    • r'^[a-zA-Z]+$': Matches strings starting with word
    • r'^\d{1,9}s': Matches strings starting with word
    • r'\b(Spal)': Matches strings beginning with word
    • df.columns.regex.r\1,2,3,4,5$: Matches strings ending with word
    • r'\baW\b': Matches Nan.
    • `r'.*()': Matches strings with 'Species'
    • `\baW\b': Matches strings with specific names.

Note: This summary provides an overview of key functionalities and examples, ensuring pandas users can quickly reference and understand the essential operations for effective data manipulation.

Reference: