Data Science

Machine Learning Cheat Sheet (Basics to Advanced)

Last Updated: 24th January, 2025

Arunav Goswami

Data Science Consultant at almaBetter

Explore this comprehensive machine learning cheat sheet covering algorithms, metrics, libraries and concepts. Ideal for interviews and practical ML applications

Machine learning (ML) is a pivotal aspect of artificial intelligence (AI) that equips systems with the ability to learn and improve from experience without explicit programming. This cheat sheet compiles essential machine learning concepts, algorithms, metrics, and models to serve as a quick reference, especially for interviews and practical applications.

Key Steps in an ML Pipeline

Define Problem: Understand the business problem and formulate it for ML.
Data Collection: Gather relevant data.
Data Preprocessing:
- Handle missing values.
- Normalize or scale features.
- Encode categorical variables.
Feature Engineering: Create meaningful features and scale data as necessary.
Model Selection: Evaluate algorithms based on the problem type and data.
Cross-Validation: Use techniques like k-fold to prevent overfitting.
Hyperparameter Tuning: Optimize model performance using Grid Search or Random Search.
Interpretability: Use tools like SHAP or LIME to explain model predictions.
Deployment: Deploy the model into production.

ML Concepts Cheat Sheet

Supervised Learning
- Uses labeled data for training.
- Goals include classification (e.g., spam detection) and regression (e.g., house price prediction).
Unsupervised Learning
- Utilizes unlabeled data to uncover hidden patterns.
- Common tasks: clustering (e.g., customer segmentation) and dimensionality reduction (e.g., PCA).
Semi-Supervised Learning
- Combines a small amount of labeled data with a large unlabeled dataset.
- Useful in scenarios with limited labeled data.
Reinforcement Learning
- Agents learn optimal actions by interacting with an environment.
- Example: AlphaGo mastering Go by self-play.

Machine Learning Algorithms Cheat Sheet

Supervised Learning Algorithms

Linear Regression
- Predicts continuous values.
- Equation: y=mx+c.
Logistic Regression
- For binary classification problems.
- Output is a probability.
Decision Trees
- Splits data hierarchically based on features.
- Suitable for both regression and classification.
Random Forest
- Ensemble method using multiple decision trees.
- Reduces overfitting and improves accuracy.
Support Vector Machines (SVM)
- Effective for high-dimensional spaces.
- Utilizes hyperplanes to separate data points.
k-Nearest Neighbors (k-NN)
- Classifies data based on proximity to kkk neighbors.
- Non-parametric and lazy learning algorithm.

Unsupervised Learning Algorithms

K-Means Clustering
- Groups data into k clusters.
- Iterative approach with centroid adjustments.
Hierarchical Clustering
- Builds nested clusters using dendrograms.
- Works in both agglomerative and divisive modes.
Principal Component Analysis (PCA)
- Reduces data dimensionality while retaining variance.
- Useful for visualization and speeding up algorithms.

Reinforcement Learning Algorithms

Q-Learning: Model-free learning using a Q-table.
Deep Q-Networks (DQN): Combines Q-learning with deep neural networks.

Machine Learning Metrics Cheat Sheet

Classification Metrics

Accuracy: Correct Predictions/Total Predictions
Formula = (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
Precision: Measures true positive accuracy.
Formula = True Positives / (True Positives + False Positives)
Recall: Ratio of true positives to actual positives.
Formula = True Positives / (True Positives + False Negatives)
F1-Score: Harmonic mean of precision and recall.
Formula = 2 * (Precision * Recall) / (Precision + Recall)

Regression Metrics

Mean Squared Error (MSE): Average squared difference between predictions and actuals.
Root Mean Squared Error (RMSE): Square root of MSE.
R-squared (R2): Proportion of variance explained by the model.

Clustering Metrics

Silhouette Score: Measures how well-separated the clusters are.
Davies-Bouldin Index: Lower values indicate better clustering.

Key Machine Learning Concepts

Overfitting vs. Underfitting:
- Overfitting: Model too complex, captures noise.
- Underfitting: Model too simple, misses patterns.
Bias-Variance Tradeoff:
- High Bias → Underfitting.
- High Variance → Overfitting.
Regularization:
- L1 (Lasso): Adds λ∑∣β∣
- L2 (Ridge): Adds λ∑β2
Cross-Validation: Split data to evaluate model generalization (e.g., K-Fold CV).

Feature Scaling

Standardization (Z-score Normalization): z = (x - μ) / σ
where μ = mean of the feature values, σ = standard deviation of the feature values
Normalization: x_scaled = (x - min(x)) / (max(x) - min(x))
where min(x) = minimum value of the feature, max(x) = maximum value of the feature

Machine Learning Models Cheat Sheet

Linear Models

Ideal for datasets with linear relationships.
Algorithms: Linear Regression, Logistic Regression.

Tree-Based Models

Handle non-linear relationships effectively.
Algorithms: Decision Trees, Random Forests, Gradient Boosting Machines (GBMs).

Neural Networks

Mimic the human brain with layers of neurons.
Suitable for image recognition, NLP, etc.

Ensemble Models

Combine multiple models to improve performance.
Examples: Bagging (Random Forest), Boosting (XGBoost, AdaBoost).

Deep Learning Essentials

Neural Networks:
- Forward Propagation: Compute output.
- Backpropagation: Update weights using gradient descent.
Activation Functions:
- Sigmoid (Logistic): f(x) = 1 / (1 + e^(-x))
- Tanh (Hyperbolic Tangent): f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
- ReLU (Rectified Linear Unit): f(x) = max(0, x)
- Softmax (for a vector x = [x_1, x_2, ..., x_n]): f_i(x) = e^(x_i) / ( Σ over j of e^(x_j) )
Optimization: SGD, Adam, RMSProp.
Loss Functions:
- Regression: MSE, MAE.
- Classification: Cross-Entropy Loss.

ML Tools and Libraries

Scikit-learn: Comprehensive Python library for ML.
TensorFlow: Open-source framework for building neural networks.
PyTorch: Flexible framework for deep learning.
XGBoost: Optimized gradient boosting library.
Keras: High-level API for neural networks, integrated with TensorFlow.

ML Cheat Sheet for Interview

Review foundational concepts, such as bias-variance tradeoff and overfitting.
Practice common algorithms and know their computational complexities.
Be prepared to explain ML workflows: data preprocessing, training, evaluation.
Discuss metrics selection for given scenarios.
Practice coding ML algorithms from scratch.

Quick Reference Table

Category	Key Algorithms	Metrics
Supervised Learning	Linear Regression, Random Forest	Accuracy, F1-Score
Unsupervised Learning	K-Means, PCA	Silhouette Score
Reinforcement Learning	Q-Learning, DQN	Cumulative Rewards

Conclusion

This ML cheat sheet provides an overview of the core concepts, algorithms, metrics, and tools essential for mastering ML and excelling in interviews. Whether tackling supervised tasks, exploring unsupervised methods, or diving into reinforcement learning, this reference ensures readiness for both theoretical and practical challenges.

More Cheat Sheets and Top Picks