Introduction
With data being at the heart of machine learning, it is inevitable that the performance of all machine learning algorithms is directly affected by the quality of the input data. The saying Garbage in-Garbage out holds in the machine learning case as well: using bad quality data can mislead the training process and result in inaccurate models, longer training times and ultimately poor results. On the other hand, machine learning algorithms trained on accurate, clean, and well-labelled data can identify the patterns hidden in the data and produce models that provide predictions with high accuracy. It is for this reason that it is very important to understand the input, detect and address any issues affecting its quality, before feeding the input to the machine learning algorithm.
In the rest of this article, we discuss which metrics you can use to assess the data quality and ways to address any detected issues.
Data quality evaluation
There are many aspects of data quality and various dimensions that one can consider when evaluating the data at hand. Some of the most common aspects examined in the data quality assessment process are the following:
- Number of missing values. Most of the real-world datasets contain missing values, i.e., feature entries with no data value stored. As many machine learning algorithms do not support missing values, detecting the missing values and properly handling them, can have a significant impact.
- Existence of duplicate values. Duplicate values can take various formats, such as multiple entries of the same data point, multiple instances of an entire column, and repetition of the same value in an I.D. variable. While duplicate instances might be valid in some datasets, they often arise because of errors in the data extraction and integration processes. Hence, it is important to detect any duplicate values and decide if they correspond to invalid values (true duplicates) or form a valid part of the dataset.
- Existence of outliers/anomalies. Outliers are data points that differ substantially from the rest of data, and they may arise due to the diversity of the dataset or because of errors/mistakes. As machine learning algorithms are sensitive to the range and distribution of attribute values, identifying the outliers and their nature is important for assessing the quality of the dataset.
- Existence of invalid/bad formatted values. Datasets often contain inconsistent values, such as variables with different units across the data points and variables with incorrect data type. For example, it is often the case that some special numerical variables, such as percentages and fractions, are mistakenly stored as strings, and one should detect and transform such cases so that the machine learning algorithm can work with the actual numbers.
Improving data quality
After exploring the data to assess its quality and gain an in-depth understanding of the dataset, it is important to resolve any detected issues before proceeding to the next stages of the machine learning pipeline. Below, we give some of the most common ways for addressing such issues.
- Handling missing values. There are different ways of dealing with missing data based on their number and their data type:
- Removing the missing data. If the number of data points containing missing values is small and the size of the dataset is large enough, you may remove such data points. Also, if a variable is containing a very large number of missing values, it may be removed.
# removing rows with missing values from dataset: pandas.DataFrame dataset.dropna(inplace=True) # removing columns with ratio of missing values greater than a threshold dataset = dataset[dataset.columns[dataset.isnull().mean() <= THRESHOLD]]
- Imputation. If the number of missing values is not small enough to be removed and not large enough to be a substantial proportion of the variable entries, you can replace the missing values in a numerical variable with the mean/median of the non-missing entries and the missing values in a categorical variable with the mode, which is the most frequent entry of the variable.
# imputing missing values in each column with the mean of the corresponding # column using scikit-learn from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='mean') # transform the dataset imputed_dataset = imputer.fit_transform(dataset)
- Dealing with duplicate values. True duplicates, i.e., instances of the same data point, are usually removed. In this way, the increase of the sample weight on these points is eliminated, and the risk of any artificial inflation in the performance metrics is reduced.
- Dealing with outliers. As with the case of missing values, common methods of handling the detected outliers include removing the outliers and imputing new values. However, depending on the context of the dataset and the number of the outliers, keeping the outliers unchanged might be the most suitable course of action. For example, keeping the outliers would be suggested in a dataset where the number of outliers is not very small as they might be necessary to correctly understand the dataset.
- Converting bad formatted values. All malformed values are converted and stored with the correct datatype. For example, numerical variables that are stored as strings are fconverted to the corresponding numbers, and strings that represent dates are stored as date objects. Also, it is important to convert and ensure that all entries in a variable correspond to the same unit as otherwise the comparisons between the variable entries will not correspond to the true comparisons.
# converting a string column: pandas. Series to the correspoding numeric column with NaN # values for any invalid entry import pandas as pd numeric_column = pd.to_numeric(column, 'coerce')
Conclusion
As we have seen, understanding the quality of the input, and preparing the dataset such that any issues are resolved, are necessary for the machine learning algorithms to produce accurate predictions. Although they can be labour-intensive, it is very important to include them in the machine learning pipeline as failing to do so can result in unreliable decisions.
We at TurinTech AI, as data scientists and researchers ourselves, understand that data cleaning and data quality assessment can be very time consuming and frustrating. Thus, when building EvoML, an end-to-end AI optimisation platform, we have made sure to incorporate a feature by which data scientists can automatically check the quality of their data and apply the relevant techniques to make it AI-ready. As you can see in Figure 1, EvoML automatically assesses the quality of the input data, provides data quality reports using easy-to-understand tags and statistics, and addresses any detected issues. With EvoML, data preparation can be faster and easier, enabling you to spend more time on understanding and transforming data for better model performance.
About the Author
Dr Chrystalla Pavlou | TurinTech Research Team
Computer Science PhD graduate with a Masters in Theoretical Computer Science and a degree in Electrical and Computer Engineering. Loves reading and hiking.