Normalization in Machine Learning

Course Outline

Data Cleaning in Data Science

Introduction to Exploratory Data Analysis (EDA)

Feature Engineering for Machine Learning

Normalization in Machine Learning

Last Updated: 17th August, 2024

Data Normalization is an vital pre-processing step in Machine Learning (ML) that makes a difference to make sure that all input parameters are scaled to a common range. It is a procedure that's utilized to progress the exactness and proficiency of ML algorithms by changing the information into a normal distribution. Now let’s understand what is normalization of data in detail.

What is Data Normalization?

Data normalization in ML that transforms data into a common format, so it can be used in analytics and machine learning algorithms. It is typically used to transform the raw data into a more useful form for ML algorithms such as linear regression, logistic regression, and neural networks. Data Normalization in machine learning can be applied to numerical and categorical data, and it can help to reduce the complexity of the data.

For illustration, it can offer assistance to diminish the number of features by combining diverse features into one or by expelling redundant features. It can moreover be utilized to standardize the information to guarantee that all input parameters are inside the same range. In expansion, it can offer assistance to decrease the impact of outliers and decrease the chances of overfitting.

Types of Data

Nominal data is a type of data that is not ordered or ranked. It is usually qualitative in nature and can be used to categorize items. For example, hair color (blonde, brunette, black, etc.) is a type of nominal data. When normalizing nominal data, it's important to use techniques that don't rely on the numeric values of the data, such as one-hot encoding.
Ordinal data is a type of data that is ordered or ranked. It is usually qualitative in nature and can be used to rank items. For example, a survey scale that asks respondents to rate something on a scale of 1 to 5 is an example of ordinal data. When normalizing ordinal data, it's important to use techniques that preserve the order of the data, such as min-max normalization.
Interval data is a type of data that is ordered, but not necessarily ranked. It is usually quantitative in nature and can be used to measure the difference between two items. For example, temperature is an example of interval data. When normalizing interval data, it's important to use techniques that preserve the interval between the data points, such as z-score normalization.
Ratio data is a type of data that is both ordered and ranked. It is usually quantitative in nature and can be used to measure the ratio between two items. For example, length is an example of ratio data. When normalizing ratio data, it's important to use techniques that preserve the ratio between the data points, such as log normalization.

Why do We use Normalization Techniques?

To remove the impact of scale: Data can have vastly different scales and ranges, which can create issues in analysis and modeling. Normalization helps to remove the impact of scale and put all features on the same scale.
To improve the performance of models: Many machine learning algorithms work better when the input data is normalized. Normalizing the data can lead to faster training and better performance of the model.
To address skewness in the data: Normalization in machine learning can help to address skewness in the data, which can be caused by outliers or by the data being distributed in a non-normal way. By transforming the data into a more normal distribution, it can be easier to analyze and model.
To improve the interpretability of the data: Normalization can make the data more interpretable and easier to understand. By putting all features on the same scale, it can be easier to see the relationships between different variables and make meaningful comparisons.

Normalization Techniques in Machine Learning

Min-Max Scaling

This normalization method is utilized to convert information into a range between 0 and 1, by subtracting the minimum value from each data point and after that partitioning by the difference between the greatest and least values. This normalization method is valuable when managing with information that has exceptions, so that the exceptions do not skew the information much as well.

X_scaled = (X - X_min) / (X_max - X_min)

Where:

X is the original value.
X_min is the minimum value of the feature.

X_max is the maximum value of the feature.

Z-Score Normalization (Standardization)

This normalization strategy is utilized to convert information into a standard normal distribution, by subtracting the mean from each data point and, after that dividing by the standard deviation. This procedure is valuable when the information contains a normal distribution, because it makes a difference to create the information more interpretable.

X_standardized = (X - μ) / σ

Where:

X is the original value.
μ is the mean of the feature.

σ is the standard deviation of the feature.

Decimal Scaling

This normalization strategy is utilized to convert information into a range from 0 to 1, by subtracting the minimum value from each data point and after that dividing by the difference between the greatest and least values. This normalization procedure is valuable when managing with exceptionally expansive datasets, because it makes a difference in diminishing the information to a manageable range.

X_scaled = X / 10^j

Where:

X is the original value.
j is the smallest integer such that max(|X_scaled|) < 1.

Log Transformation

This normalization method is utilized to convert information into a logarithmic scale, by taking the log of each data point. This procedure is useful when managing with information that incorporates a wide extend of values, because it makes a difference to decrease the variety in the information. This technique is additionally valuable when managing with information that has outliers, because it makes a difference to decrease their impact on the information.

X_transformed = log(X + 1)

Where:

X is the original value.
The +1 is added to handle cases where X is zero or negative (depending on the specific context, this might be adjusted).

Advantages and Disadvantages of Normalization in Machine Learning

Advantages

Normalization in machine learning helps to reduce data redundancy and improve data integrity. By dividing huge tables into smaller, related tables, normalization decreases the amount of information stored in each table. This makes the information less demanding to access and modify, and reduces the sum of capacity space vital.
Normalization moreover improves information consistency. By organizing data into multiple tables, you can guarantee that the same data isn't stored in different locations. When changes are made, they are reflected in all related tables.
Normalization also helps to reduce the complexity of queries. By dividing huge tables into smaller, related tables, queries can access as it were the information they require and do not have to process unnecessary data.

Disadvantages

Normalization in machine learning can cause performance issues. Joining numerous tables together can cause the database to run slower, particularly when a huge number of rows are included.
Normalization can moreover make it more difficult to query the data. For illustration, in the event that you need to recover information from different tables, you may need to type in complex SQL queries.

Normalization techniques ought to be utilized when data redundancy and data integrity could be a concern. It can moreover be utilized when information consistency is important and when queries need to be simplified. Finally, normalization can offer assistance to reduce the complexity of data structures.

Normalization in Machine Learning

Normalization in machine learning plays a really important part within the precision of calculations. Normalization makes a difference to scale features so that the information is inside a certain range, usually between 0 and 1. This guarantees that all features contribute equally to the analysis, otherwise it might lead to bias towards one include.
Normalization in machine learning too helps in increasing the convergence rate of machine learning algorithms such as clustering, neural networks, and regression. Typically since the algorithms work better when the data points are near to each other and inside the same range. With normalization, the data points are more homogenous and the machine learning algorithm can learn and make more accurate predictions
Additionally, normalization makes a difference to decrease the sum of noise in the data. When the information is centered around a mean of zero, it'll be simpler to identify the vital designs and relationships. This will lead to better comes about and more exact predictions.

Implementing Normalization in Machine Learning

Let's consider the A_Z Handwritten Data.csv to demonstrate data normalization in python

Link: https://www.kaggle.com/datasets/sachinpatel21/az-handwritten-alphabets-in-csv-format

Loading...

This code demonstrates how to implement normalization using Python libraries such as Pandas, NumPy, and Scikit-Learn. The code reads in a handwritten recognition dataset, separates the features and labels, and normalizes the features using the StandardScaler from Scikit-Learn. The normalized features are then converted to a DataFrame which can be printed to verify that the normalization has been applied correctly.

Difference between Normalization and Standardization

Normalization

Definition: Normalization is a data preprocessing technique that rescales the values of a feature to a specific range, typically [0, 1]. This process ensures that all features contribute equally to the model by constraining the data within a consistent and predictable range.
Formula: The formula for normalization is: (X - X_min) / (X_max - X_min). where X is the original value, Xmin is the minimum value of the feature, and Xmax is the maximum value of the feature. This formula ensures that the transformed values will lie within the specified range, usually between 0 and 1.
Range of Transformed Data: After normalization, the values of the transformed data will typically fall within the range of [0, 1]. In some cases, normalization might be adjusted to a range of [-1, 1] depending on specific requirements or to handle features with negative values. This range helps to standardize the scale of features, which is especially useful in algorithms sensitive to the magnitude of input data.
Sensitivity to Outliers: Normalization is highly sensitive to outliers because it relies on the minimum and maximum values of the data. An extreme value can significantly affect the range and therefore distort the normalized data, making the feature less representative of the underlying distribution.
Use Case: Normalization is particularly useful when the data needs to be constrained within a specific range, such as in algorithms like neural networks or K-nearest neighbors, where the distance between data points is critical. It is also beneficial when dealing with features that have different units or scales, ensuring that they contribute equally to the model.
Effect on Data Distribution: Normalization does not change the shape of the data distribution; it only rescales the values to a different range. For example, if the original data was skewed, the normalized data will still be skewed, but within the new range. This is important when preserving the distribution shape is crucial for the model's interpretation.
Impact on Euclidean Distance: Normalization directly influences Euclidean distance, as it rescales all features to the same range. This is crucial for algorithms like K-means clustering or KNN, where distance calculations are a key part of the algorithm. Without normalization, features with larger scales could dominate the distance metric, leading to biased results.
Interpretability: Normalized features are easier to interpret within the context of the specified range. For example, a value of 0.8 in a normalized feature clearly indicates a higher position relative to other values. This makes it straightforward to understand the relative magnitude of different data points.
Algorithm Preference: Normalization is often the preferred technique for algorithms like K-means clustering, K-nearest neighbors (KNN), and neural networks, where the distance between data points or the scale of input data directly affects the model's performance. These algorithms benefit from features being within the same range, ensuring that no single feature disproportionately influences the results.
Computational Complexity: Normalization is generally simpler and faster to compute, as it only requires determining the minimum and maximum values of the feature and applying a straightforward rescaling formula. This makes it computationally efficient and easy to implement, especially in large datasets.

Standardization

Definition: Standardization is a data preprocessing technique that rescales the values of a feature so that the resulting distribution has a mean of 0 and a standard deviation of 1. This process helps in centering the data and making it follow a standard normal distribution, which can be beneficial for algorithms that assume normally distributed input data.
Formula: The formula for standardization i: (X - μ) / σ, where X is the original value, μ is the mean of the feature, and σ is the standard deviation of the feature. This transformation shifts the data so that it is centered around the mean and scales it to have a consistent spread, defined by the standard deviation.
Range of Transformed Data: The values after standardization are not bounded to a specific range. However, they are rescaled so that the feature has a mean of 0 and a standard deviation of 1. This means that the transformed data will be distributed around 0, with most values falling between -3 and 3 if the data is normally distributed.
Sensitivity to Outliers: Standardization is less sensitive to outliers compared to normalization because it relies on the mean and standard deviation, which are more robust against extreme values. However, outliers can still impact the calculation of the mean and standard deviation, leading to potential distortions in the standardized data.
Use Case: Standardization is preferred in scenarios where the data is expected to follow a normal distribution or when the algorithm assumes normally distributed input data. It is commonly used in linear regression, logistic regression, and principal component analysis (PCA), where centering and scaling the data helps in better model performance and more meaningful comparisons between features.
Effect on Data Distribution: Standardization can affect the shape of the data distribution, especially if the original data is not normally distributed. By centering and scaling the data, standardization can bring non-normally distributed data closer to a normal distribution, which can be advantageous for certain statistical analyses and machine learning models.
Impact on Euclidean Distance: Standardization does not have as direct an impact on Euclidean distance as normalization, but it still ensures that all features contribute equally by centering them around the mean and scaling them to unit variance. This can be particularly beneficial when the algorithm assumes that all features are on a similar scale, such as in PCA or SVM.
Interpretability: Standardized features are centered around 0, making them less directly interpretable in terms of magnitude. However, standardization allows for easier comparison between features, as they are all measured on the same scale. This is useful when understanding the relative importance of different features in a model.
Algorithm Preference: Standardization is preferred for algorithms that assume or benefit from normally distributed data, such as linear regression, logistic regression, and principal component analysis (PCA). In these cases, standardization helps the model converge faster and perform better by ensuring that all features are on a comparable scale and centered around zero.
Computational Complexity: Standardization involves calculating the mean and standard deviation of each feature, which can be slightly more complex and computationally intensive than normalization. However, the additional computational effort is often justified by the improved model performance, especially in cases where the model relies on normally distributed data.

Real-World Examples of Normalization in Data Science

Customer Segmentation

An online retail company is using customer segmentation to better understand its customer base and target potential new customers. The company collects data on customer demographics, purchasing habits, and product preferences. In order to segment their customers, they normalize this data by taking into account a variety of factors such as age, location, income, gender, and purchase frequency. This normalization of data allows the company to group customers into similar segments, which they can then use to target marketing campaigns or create personalized product recommendations.

Image Recognition

An artificial intelligence company is developing a system that can identify and classify different types of images. In order to do this, they first normalize the images by adjusting the brightness, contrast, and colour saturation to ensure that the images are consistent and can be accurately classified. Then, they use machine learning algorithms to identify objects within the images and classify them accordingly. This process of normalizing the images and using machine learning algorithms allows the system to accurately identify and classify different types of images.

Fraud Detection

A financial institution is using data normalization to detect fraudulent transactions. They collect data on customer transaction patterns, such as frequency and amount of transactions. By normalizing this data, they can identify outliers or suspicious patterns that may indicate fraudulent activity. The normalized data can then be used to trigger automated alerts or further investigation into potentially fraudulent transactions. This process of normalizing data allows the financial institution to detect and prevent fraud.

Conclusion

After the data were normalized, the company was able to use the data to gain meaningful insights about its customer base and make better-informed decisions about its marketing and sales strategies. The company was also able to use the data to create more accurate and powerful algorithms to power its analytics and machine learning processes.

Key Takeaways

Use a standard normalization technique, such as min-max or z-score, to transform your data into a common range.
Ensure your data is in a proper format to begin with by using an appropriate data type.
Use built-in Python modules such as Numpy and Pandas for efficient data normalization.
Avoid data leakage by splitting your data into training and test sets before normalizing it.
Choose the right normalization technique based on the data distribution and the desired outcome.
Re-check your data after normalization to ensure it is within the desired range.

Quiz

What is the main purpose of normalization in machine learning?
1. To reduce the complexity of data
2. To make data more interpretable
3. To reduce the variance of data
4. To reduce the risk of overfitting

Answer:c. To reduce the variance of data

What is the most common type of normalization used in machine learning?
1. Z-Score Normalization
2. Min-Max Normalization
3. Decimal Scaling
4. Feature Scaling

Answer:b. Min-Max Normalization

What is the range of values for data normalized using min-max normalization?
1. 0 to 1
2. -1 to 1
3. 0 to 100
4. -100 to 100

Answer:a. 0 to 1

What is a benefit of using data normalization?
1. It can speed up the training process
2. It can help make data more interpretable
3. It can reduce the complexity of data
4. It can reduce the risk of overfitting

Answer:a. It can speed up the training process

Module 2: Data preparation and EDA