Data Wrangling with Pandas: Quick Reference Guide

Data Wrangling with pandas Cheat Sheet

Creating DataFrames

Creating DataFrames by specifying values for each column:

df = pd.DataFrame(
    {
        "a": [4, 5, 6],
        "b": [7, 8, 9],
        "c": [10, 11, 12]
    },
    index=[1, 2, 3]
)

This DataFrame has indexed rows and specified values for columns a, b, and c.

Creating DataFrames by specifying values for each row:

df = pd.DataFrame(
    [
        [4, 7, 10],
        [5, 8, 11],
        [6, 9, 12]
    ],
    index=[1, 2, 3],
    columns=['a', 'b', 'c']
)

This DataFrame uses a list of lists to define each row, indexing the rows and specifying columns.

Creating DataFrame with MultiIndex:

df = pd.DataFrame(
    {
        "a": [4, 5, 6],
        "b": [7, 8, 9],
        "c": [10, 11, 12]
    },
    index=pd.MultiIndex.from_tuples(
        [
            ('d', 1),
            ('d', 2),
            ('e', 2)
        ],
        names=['n', 'v']
    )
)

MultiIndex allows hierarchical indexing, providing multiple levels of indexing.

Method Chaining

Chaining methods for cleaner code:

df = (pd.melt(df)
      .rename(columns={
          'variable': 'var',
          'value': 'val'
      })
      .query('val >= 200'))

This snippet demonstrates the chaining of data manipulation methods for improved readability.

Tidy Data

Foundation for wrangling in pandas:
- Each variable is saved in its own column.
- Each observation is saved in its own row.

Reshaping Data

Change layout, sorting, reindexing, renaming:

Melt to gather columns into rows:
```
pd.melt(df)
```

Pivot to spread rows into columns:

df.pivot(columns='var', values='val')

Concatenate DataFrames vertically or horizontally:

pd.concat([df1, df2])
pd.concat([df1, df2], axis=1)

Sort values in DataFrame:

df.sort_values('mpg')
df.sort_values('mpg', ascending=False)

Rename columns:

df.rename(columns={'y': 'year'})

Set or reset index:

df.set_index()
df.reset_index()

Drop columns:

df.drop(columns=['Length', 'Height'])

Subset Observations - Rows

Extract rows that meet logical criteria:
```
df[df.Length > 7]
```

Remove duplicate rows:

df.drop_duplicates()

Random sampling:

df.sample(frac=0.5)
df.sample(n=200)

ISNull/NotNull:

df.isnull().sum()
2
df.dropna()

Subset Variables - Columns

Select columns:

df[['width', 'length', 'species']]

Select columns with specific names:
```
df.width
```

Select columns whose name matches regex:

df.filter(regex='regex')

Subsets - Rows and Columns

.loc and .iloc for selection:

Use df.loc[] and df.iloc[] to select rows, columns, or both.
Use df.at[] and df.iat[] to access a single value by row and column.

.iloc examples:

df.iloc[10:20]
df.iloc[:, 1:2, 5]

.loc examples:

df.loc[:, 'x2':'x1']
df.loc[:, 'a':'c']
df.loc[df['a'] > 10]

Access single value using index:

df.iat[1, 2]
df.at[4, 'A']

Using query

Boolean expressions for filtering:

df.query('Length > 7')
df.query('Length > 7 and Width < 8')
df.query('index.str.startswith("abc")', engine="python")

Logic in Python (and pandas)

Symbols and their meaning:
- Less than: <
- Greater than: >
- Less than or equal to: <=
- Greater than or equal to: >=
- Not equal to: !=
- Equality: ==
- isin(values): df.column.isin(values)
- notin(values): ~df.column.isin(values)
- isnull(obj): pd.isnull(obj)
- notnull(obj): pd.notnull(obj)
- any(): df.any(1)
- all(): df.all(1)

Regex (Regular Expressions) Examples

Matching patterns:
- r'^abc$': Matches strings containing a period .
- r'^[a-zA-Z]+$': Matches strings starting with word
- r'^\d{1,9}s': Matches strings starting with word
- r'\b(Spal)': Matches strings beginning with word
- df.columns.regex.r\1,2,3,4,5$: Matches strings ending with word
- r'\baW\b': Matches Nan.
- `r'.*()': Matches strings with 'Species'
- `\baW\b': Matches strings with specific names.

Note: This summary provides an overview of key functionalities and examples, ensuring pandas users can quickly reference and understand the essential operations for effective data manipulation.

Reference:

pandas.pydata.org

[PDF] Creating DataFrames Method Chaining Data Wrangling - Pandas

www.studocu.com

Pandas Cheat Sheet - Creating DataFrames Reshaping Data

www.datacamp.com

Pandas Cheat Sheet: Data Wrangling in Python - DataCamp