How AlmaBetter created an
IMPACT!Arunav Goswami
Data Science Consultant at almaBetter
Explore the essential Pandas cheat sheet for Python, covering DataFrames, data manipulation, cleaning, aggregation, and visualization, tailored for data science
Pandas is an essential data manipulation library in Python, built on top of NumPy, allowing users to work with labeled data in Python. It’s especially useful for handling structured data, and its DataFrame and Series structures allow easy manipulation and analysis. This cheat sheet covers the basics of Pandas, focusing on DataFrames, data manipulation, and commonly used functions to boost data science workflows in Python.
DataFrames are the primary data structure in Pandas, representing a table-like collection of data.
import pandas as pd # From dictionary data = {'A': [1, 2, 3], 'B': [4, 5, 6]} df = pd.DataFrame(data) # From list of lists data = [[1, 4], [2, 5], [3, 6]] df = pd.DataFrame(data, columns=['A', 'B']) |
df.head() # First 5 rows df.tail() # Last 5 rows df.shape # Dimensions (rows, columns) df.info() # Summary of data df.describe() # Summary statistics df.columns # Column names df.dtypes # Data types of columns df.isnull().sum() # Count missing values per column |
# Select column df['column_name'] df.column_name # Select multiple columns df[['col1', 'col2']] # Select rows by index df.iloc[0] # First row df.iloc[0:5] # First 5 rows # Select rows by label df.loc[0] # Row with label 0 df.loc[0:5, 'col1'] # Rows 0-5 for col1 |
# Filter by a single condition filtered_df = df[df['A'] > 2] # Filter by multiple conditions filtered_df = df[(df['A'] > 1) & (df['B'] < 6)] |
# Add Column
|
# Sort by column values df.sort_values(by='col1', ascending=False) # Sort by index df.sort_index() |
Data cleaning is critical for preparing data for analysis. Pandas offers many methods to clean data effectively.
# Check for Missing Data:
df.fillna(0, inplace=True)
|
# Drop duplicate rows
|
# Convert column data types df['Column'] = df['Column'].astype('int') |
# Replace specific values df['Column'].replace({0: 'Zero', 1: 'One'}, inplace=True) |
# Rename columns df.rename(columns={'old_name': 'new_name'}, inplace=True) |
Aggregation functions allow for summarizing and gaining insights into data patterns.
# Group by a column and aggregate df.groupby('col1').sum() df.groupby('Col').count() df.groupby('col1')['col2'].mean() # Mean of col2 grouped by col1 # Multiple aggregations df.groupby('col1').agg({'col2': 'sum', 'col3': 'mean'}) |
df['col'].mean() # Mean df['col'].median() # Median df['col'].mode() # Mode df['col'].std() # Standard deviation df['col'].var() # Variance df['col'].min() # Minimum df['col'].max() # Maximum df['col'].quantile(0.75) # 75th percentile |
# Apply function to each column df.apply(lambda x: x*2) # Apply function to each row df.apply(lambda row: row.sum(), axis=1) # Apply function to a single column df['col'] = df['col'].apply(lambda x: x*2) |
Advanced operations in Pandas enhance the ability to manage and reshape data for complex data science tasks.
# Merge two DataFrames on a key pd.merge(df1, df2, on='key') # Left, Right, Outer Join pd.merge(df1, df2, on='key', how='left') pd.merge(df1, df2, on='key', how='right') pd.merge(df1, df2, on='key', how='outer') # Join two DataFrames by index joined_df = df1.join(df2, lsuffix='_left', rsuffix='_right') # Concatenate along rows or columns pd.concat([df1, df2], axis=0) # Rows pd.concat([df1, df2], axis=1) # Columns |
# Pivot table df.pivot_table(values='col', index='col1', columns='col2', aggfunc='mean') # Crosstab pd.crosstab(df['col1'], df['col2'] |
# Melt (Unpivot) df.melt(id_vars=['A'], value_vars=['B', 'C']) # Pivot (Opposite of melt) df.pivot(index='A', columns='B', values='C') |
# Convert to datetime df['date'] = pd.to_datetime(df['date']) # Extract date parts df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day |
# Convert to lowercase df['col'] = df['col'].str.lower() # Check if a string contains a pattern df[df['col'].str.contains('pattern')] # Replace string patterns df['col'] = df['col'].str.replace('old', 'new') |
Pandas simplifies data import and export, making it versatile for reading and saving various file formats.
# CSV df = pd.read_csv('file.csv') # Excel df = pd.read_excel('file.xlsx') # JSON df = pd.read_json('file.json') |
# CSV df.to_csv('output.csv', index=False) # Excel df.to_excel('output.xlsx', index=False) # JSON df.to_json('file.json', orient='records') |
For quick visualization, Pandas provides basic plotting capabilities based on Matplotlib.
import matplotlib.pyplot as plt # Line plot df.plot(kind='line') plt.show() # Bar plot df.plot(kind='bar') plt.show() # Histogram df['Column'].plot(kind='hist') plt.show() |
# Customize plot style and labels df.plot(kind='line', color='green', linewidth=2, title='Line Plot') plt.xlabel('X-axis Label') plt.ylabel('Y-axis Label') plt.show() |
# Scatter plot between two columns df.plot.scatter(x='Column1', y='Column2') plt.show() |
This Python Pandas cheat sheet provides an essential overview of functions, from DataFrame manipulation to data cleaning, aggregation, and visualization. By mastering these core features, users can efficiently manage and analyze data in Python, making it an indispensable tool in data science workflows.
More Cheat Sheets and Top Picks