Free Masterclass on Mar 21
Beginner AI Workshop: Build an AI Agent & Start Your AI Career
Overview
Data cleaning is the method of preparing a dataset for machine learning algorithms. It includes evaluating the quality of information, taking care of missing values, taking care of outliers, transforming data, merging and deduplicating data, and handling categorical variables. This basic process is required to ensure if the information is ready for machine learning algorithms, as it helps to diminish the hazard of blunders and enhances the accuracy of the models.
Data quality assessment:
Data merging and deduplication in machine learning is the method of combining two or more datasets into one and expelling any duplicate data points. Usually done to guarantee that the information utilized to construct the machine learning models is accurate and complete. Data merging includes combining datasets to preserve the integrity of the information, whereas deduplication includes recognizing and evacuating any duplicate data points from the dataset.
Example
Lets consider iris dataset :
Loading...
The code example is performing data quality assessment by checking for the shape of the dataset, number of missing values, duplicates, outliers, and data imbalance.
Handling missing values:
Handling missing values in machine learning is an important preprocessing step that is essential for building accurate and reliable models. Missing values can occur for various reasons, such as data entry errors, sensor failures, or simply because certain data points were not collected.
Here are some common strategies for handling missing values in machine learning:
Eventually, the choice of how to handle missing values depends on the particular context and the nature of the missing values. It is critical to carefully consider the preferences and drawbacks of each approach and to select the one that's most suitable for the issue at hand.
Example
Loading...
This code is performing handling missing values by first checking for the number of missing values, then replacing the missing values with the median of the feature, and finally verifying that there is no missing data.
Handling outliers:
Outliers are data points that are significantly different from the rest of the data. Handling outliers in machine learning is the process of identifying and treating outliers in the dataset. This can be done by either dropping the outliers or transforming them. Dropping the outliers means removing the data points that are considered outliers from the dataset. Transforming the outliers means changing the outlier values to make them more consistent with the rest of the data.But how to check for outliers?
We can check for outliers by using below methods:
Example
Loading...
This code creates a box plot for each variable (sepal length, sepal width, petal length, and petal width) in the iris dataset. Outliers can be identified as individual data points that fall outside the whiskers of the box plot. You can visually inspect the box plots to identify any outliers.
Loading...
This code is performing handling outliers by calculating the z-score of the dataset and then removing any data points with a z-score greater than 3. This ensures that any outliers are removed from the dataset.
Data transformation:
Example
Loading...
This code example is performing data transformation by using the StandardScaler from the scikit-learn library. The StandardScaler is used to transform the data by scaling it to have a mean of 0 and standard deviation of 1. The data is then stored in the variable iris_df_z_transformed.
Data merging and deduplication:
Data merging and deduplication in machine learning is the process of combining two or more datasets into one and removing any duplicate data points. This is done to ensure that the data used to build the machine learning model is accurate and complete. Data merging involves combining datasets in a way that preserves the integrity of the data, while deduplication involves identifying and removing any duplicate data points from the dataset.
Example
Loading...
This code is used to perform data merging and deduplication on two datasets, 'iris_data1' and 'iris_data2'. The datasets are first merged into a single dataframe, 'iris_df', using the concat() method. Then, the duplicates are removed from the merged dataset using the drop_duplicates() method. Finally, the shape of the dataset is verified to ensure that the duplicates have been successfully removed.
Handling categorical variables:
Dealing with categorical variables is the method of changing categorical data into numerical data. This can be done in order to create the information more appropriate for machine learning algorithms, since most machine learning algorithms work with numerical information. This can be done by utilizing methods such as one-hot encoding, label encoding, and binary encoding.
Example
Loading...
This code is used to handle categorical variables in a dataset. The target column is converted from a numerical data type to a categorical data type and then recoded to 0 and 1. This is useful for machine learning algorithms that require categorical data to be represented as numerical values.
Best practices and guidelines for data cleaning:
Conclusion
Data cleaning is an critical step within the handle of machine learning. It includes evaluating the quality of information, dealing with missing values, taking care of outliers, transforming data, merging and deduplicating data, and dealing with categorical variables.By implementing these best practices and guidelines, we can ensure that our dataset is clean and ready for machine learning algorithms.
Key takeaways
Quiz
Answer:a. To identify and remove errors in data
Answer:c. Replace the missing values with the median
Answer:d. Remove the outliers using z-score
Answer:c. Add more data
Top Tutorials

Python
Python is a popular and versatile programming language used for a wide variety of tasks, including web development, data analysis, artificial intelligence, and more.

SQL
The SQL for Beginners Tutorial is a concise and easy-to-follow guide designed for individuals new to Structured Query Language (SQL). It covers the fundamentals of SQL, a powerful programming language used for managing relational databases. The tutorial introduces key concepts such as creating, retrieving, updating, and deleting data in a database using SQL queries.

Applied Statistics
Master the basics of statistics with our applied statistics tutorial. Learn applied statistics techniques and concepts to enhance your data analysis skills.
All Courses (6)
Master's Degree (2)
Fellowship (2)
Certifications (2)