Data Wrangling with Pandas: Quick Reference Guide
Data Wrangling with pandas Cheat Sheet
Creating DataFrames
-
Creating DataFrames by specifying values for each column:
df = pd.DataFrame( { "a": [4, 5, 6], "b": [7, 8, 9], "c": [10, 11, 12] }, index=[1, 2, 3] )
This DataFrame has indexed rows and specified values for columns a, b, and c.
-
Creating DataFrames by specifying values for each row:
df = pd.DataFrame( [ [4, 7, 10], [5, 8, 11], [6, 9, 12] ], index=[1, 2, 3], columns=['a', 'b', 'c'] )
This DataFrame uses a list of lists to define each row, indexing the rows and specifying columns.
-
Creating DataFrame with MultiIndex:
df = pd.DataFrame( { "a": [4, 5, 6], "b": [7, 8, 9], "c": [10, 11, 12] }, index=pd.MultiIndex.from_tuples( [ ('d', 1), ('d', 2), ('e', 2) ], names=['n', 'v'] ) )
MultiIndex allows hierarchical indexing, providing multiple levels of indexing.
Method Chaining
- Chaining methods for cleaner code:
This snippet demonstrates the chaining of data manipulation methods for improved readability.df = (pd.melt(df) .rename(columns={ 'variable': 'var', 'value': 'val' }) .query('val >= 200'))
Tidy Data
- Foundation for wrangling in pandas:
- Each variable is saved in its own column.
- Each observation is saved in its own row.
Reshaping Data
- Change layout, sorting, reindexing, renaming:
- Melt to gather columns into rows:
pd.melt(df)
- Pivot to spread rows into columns:
df.pivot(columns='var', values='val')
- Concatenate DataFrames vertically or horizontally:
pd.concat([df1, df2]) pd.concat([df1, df2], axis=1)
- Sort values in DataFrame:
df.sort_values('mpg') df.sort_values('mpg', ascending=False)
- Rename columns:
df.rename(columns={'y': 'year'})
- Set or reset index:
df.set_index() df.reset_index()
- Drop columns:
df.drop(columns=['Length', 'Height'])
- Melt to gather columns into rows:
Subset Observations - Rows
- Extract rows that meet logical criteria:
df[df.Length > 7]
- Remove duplicate rows:
df.drop_duplicates()
- Random sampling:
df.sample(frac=0.5) df.sample(n=200)
- ISNull/NotNull:
df.isnull().sum() 2 df.dropna()
Subset Variables - Columns
- Select columns:
df[['width', 'length', 'species']]
- Select columns with specific names:
df.width
- Select columns whose name matches regex:
df.filter(regex='regex')
Subsets - Rows and Columns
- .loc and .iloc for selection:
Use df.loc[] and df.iloc[] to select rows, columns, or both. Use df.at[] and df.iat[] to access a single value by row and column.
- .iloc examples:
df.iloc[10:20] df.iloc[:, 1:2, 5]
- .loc examples:
df.loc[:, 'x2':'x1'] df.loc[:, 'a':'c'] df.loc[df['a'] > 10]
- Access single value using index:
df.iat[1, 2] df.at[4, 'A']
Using query
- Boolean expressions for filtering:
df.query('Length > 7') df.query('Length > 7 and Width < 8') df.query('index.str.startswith("abc")', engine="python")
Logic in Python (and pandas)
- Symbols and their meaning:
- Less than:
<
- Greater than:
>
- Less than or equal to:
<=
- Greater than or equal to:
>=
- Not equal to:
!=
- Equality:
==
- isin(values):
df.column.isin(values)
- notin(values):
~df.column.isin(values)
- isnull(obj):
pd.isnull(obj)
- notnull(obj):
pd.notnull(obj)
- any():
df.any(1)
- all():
df.all(1)
- Less than:
Regex (Regular Expressions) Examples
- Matching patterns:
r'^abc$'
: Matches strings containing a period.
r'^[a-zA-Z]+$'
: Matches strings starting withword
r'^\d{1,9}s'
: Matches strings starting withword
r'\b(Spal)'
: Matches strings beginning withword
df.columns.regex.r\1,2,3,4,5$
: Matches strings ending withword
r'\baW\b'
: Matches Nan.- `r'.*()': Matches strings with 'Species'
- `\baW\b': Matches strings with specific names.
Note: This summary provides an overview of key functionalities and examples, ensuring pandas users can quickly reference and understand the essential operations for effective data manipulation.
Reference: