Data Wrangling with Pandas: Quick Reference Guide
Data Wrangling with pandas Cheat Sheet
Creating DataFrames
-
Creating DataFrames by specifying values for each column:
df = pd.DataFrame({"a": [4, 5, 6],"b": [7, 8, 9],"c": [10, 11, 12]},index=[1, 2, 3])This DataFrame has indexed rows and specified values for columns a, b, and c.
-
Creating DataFrames by specifying values for each row:
df = pd.DataFrame([[4, 7, 10],[5, 8, 11],[6, 9, 12]],index=[1, 2, 3],columns=['a', 'b', 'c'])This DataFrame uses a list of lists to define each row, indexing the rows and specifying columns.
-
Creating DataFrame with MultiIndex:
df = pd.DataFrame({"a": [4, 5, 6],"b": [7, 8, 9],"c": [10, 11, 12]},index=pd.MultiIndex.from_tuples([('d', 1),('d', 2),('e', 2)],names=['n', 'v']))MultiIndex allows hierarchical indexing, providing multiple levels of indexing.
Method Chaining
- Chaining methods for cleaner code:
This snippet demonstrates the chaining of data manipulation methods for improved readability.df = (pd.melt(df).rename(columns={'variable': 'var','value': 'val'}).query('val >= 200'))
Tidy Data
- Foundation for wrangling in pandas:
- Each variable is saved in its own column.
- Each observation is saved in its own row.
Reshaping Data
- Change layout, sorting, reindexing, renaming:
- Melt to gather columns into rows:
pd.melt(df)
- Pivot to spread rows into columns:
df.pivot(columns='var', values='val')
- Concatenate DataFrames vertically or horizontally:
pd.concat([df1, df2])pd.concat([df1, df2], axis=1)
- Sort values in DataFrame:
df.sort_values('mpg')df.sort_values('mpg', ascending=False)
- Rename columns:
df.rename(columns={'y': 'year'})
- Set or reset index:
df.set_index()df.reset_index()
- Drop columns:
df.drop(columns=['Length', 'Height'])
- Melt to gather columns into rows:
Subset Observations - Rows
- Extract rows that meet logical criteria:
df[df.Length > 7]
- Remove duplicate rows:
df.drop_duplicates()
- Random sampling:
df.sample(frac=0.5)df.sample(n=200)
- ISNull/NotNull:
df.isnull().sum()2df.dropna()
Subset Variables - Columns
- Select columns:
df[['width', 'length', 'species']]
- Select columns with specific names:
df.width
- Select columns whose name matches regex:
df.filter(regex='regex')
Subsets - Rows and Columns
- .loc and .iloc for selection:
Use df.loc[] and df.iloc[] to select rows, columns, or both.Use df.at[] and df.iat[] to access a single value by row and column.
- .iloc examples:
df.iloc[10:20]df.iloc[:, 1:2, 5]
- .loc examples:
df.loc[:, 'x2':'x1']df.loc[:, 'a':'c']df.loc[df['a'] > 10]
- Access single value using index:
df.iat[1, 2]df.at[4, 'A']
Using query
- Boolean expressions for filtering:
df.query('Length > 7')df.query('Length > 7 and Width < 8')df.query('index.str.startswith("abc")', engine="python")
Logic in Python (and pandas)
- Symbols and their meaning:
- Less than:
<
- Greater than:
>
- Less than or equal to:
<=
- Greater than or equal to:
>=
- Not equal to:
!=
- Equality:
==
- isin(values):
df.column.isin(values)
- notin(values):
~df.column.isin(values)
- isnull(obj):
pd.isnull(obj)
- notnull(obj):
pd.notnull(obj)
- any():
df.any(1)
- all():
df.all(1)
- Less than:
Regex (Regular Expressions) Examples
- Matching patterns:
r'^abc$'
: Matches strings containing a period.
r'^[a-zA-Z]+$'
: Matches strings starting withword
r'^\d{1,9}s'
: Matches strings starting withword
r'\b(Spal)'
: Matches strings beginning withword
df.columns.regex.r\1,2,3,4,5$
: Matches strings ending withword
r'\baW\b'
: Matches Nan.- `r'.*()': Matches strings with 'Species'
- `\baW\b': Matches strings with specific names.
Note: This summary provides an overview of key functionalities and examples, ensuring pandas users can quickly reference and understand the essential operations for effective data manipulation.
Reference:
pandas.pydata.org
[PDF] Creating DataFrames Method Chaining Data Wrangling - Pandas
www.studocu.com
Pandas Cheat Sheet - Creating DataFrames Reshaping Data
www.datacamp.com
Pandas Cheat Sheet: Data Wrangling in Python - DataCamp