I am glad to have you for the second part of my complete guide on data cleaning.
- #1: Tidying Messy Data
- #2: Dealing with Missing Data <– You are here
- #3.1: A discussion on the Nature of Outliers
- #3.2: The Origin of Outliers & Detection Techniques
- #4.1: Where does Data Duplication come from?
- #4.2: A Practical Tutorial for Data Deduplication
When you are missing someone, time seems to move slower, and when I’m falling in love with someone, time seems to be moving faster.
— Taylor Swift
Once you realize that your dataset has missing values, you will comprehend the considerable amount of time that you may spend dealing with it and how slow the time may go.
In the previous tutorial, I introduced what a messy dataset looks like, the consequences of dirty data, how to explore & get to know a dataset, the concept of tidy data, and some fundamental data processing techniques.
In this article, I would try to walk you through one of the two most common & troublesome data problems: Missing Data.
3. Missing Data Handling
a. Type of Missing Data
You have doubtlessly heard about the term “missing value” and definitely confronted datasets where a few or thousands of records are null, regardless of whether you are a novice or an experienced analyst. As a wise man once said, "… in this world nothing can be said to be certain, except death and taxes". I would like to interject and include “and missing data as well as outliers”. Look at the following examples, which hopefully recall some painful memories within you ( ´ ꒳ ` ).
Public Github Data (here)
Public Firebase Tracking Dataset (here), indeed these may not necessarily be actual missing values
Public Google Analytics Tracking Dataset (here)
According to Tony Fischetti, there are three types of missing data which you may encounter while working.
-
Missing Completely At Random (MCAR):
Missing Completely At Random, MCAR, occurs when data’s missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor’s survey forms off into the horizon. -
Missing At Random (MAR):
Unfortunately, Missing at random is somewhat misleading. In spite of its name, it implies that there exists a systematic relationship between the absence of an outcome variable and other observed variables, but not the outcome variable itself. Let’s take the climate data as an example, in which we collect humidity, temperature, and CO2. MAR happens when thehumidityandCO2data is absent, particularly below -10 degrees Celsius (recorded intemperaturevariable), due to the freezing of sensors, rendering them incapable of functioning properly. Despite the extreme conditions, these sensors can accurately detect temperature variations even at sub-zero temperatures below -100 degrees Celsius. -
Missing Not At Random (MNAR):
MNAR, conversely, occurs when the absence of a variable is correlated with the variable itself. For instance, consider a scenario where the weighing scale can only measure up to 3,700 pounds and, as a result, the eight cars that exceeded this threshold were recorded asNULLs.
b. Dealing with Missing Data
Analysis of missing values
First of all, you need to summarize how many NULL values are in your dataset. This can be easily achieved in common programming languages through the use of well-developed methods like .describe() or summary(). In SQL-like language such as BigQuery, you should work around a bit.
|
|
Analysis of missing values in the given dataset
Upon initial inspection, it appears that you cannot incorporate the master_branch and integrate_branch in your work due to a significant proportion of missing data. Once you acquire a general understanding of the missingness in your dataset, you can employ one of the following techniques to address the missing data issue.
Method 1: Completely Ignore Nulls
Frankly speaking, you just need to remove all records comprising of missing values in any of the extracted variables or columns. This method is generally known as complete case analysis, a.k.a. list-wise deletion, as you only keep non-null values in your analysis. Well, sometimes less is more I guess.
|
|
Return dataset after list-wise deletion
Upon the implementation of this procedure, you are essentially creating a subset of the provided dataset. In other words, you are drawing a sample from the (possible) population of the interested observations. Undoubtedly, this comes at a cost, which varies depending on the type of missing issue you are tackling, as shown as follows.
Summary of list-wise deletion method’s statistical consequences
Method 2: Partially Ignore Nulls
This methodology is frequently employed when estimating covariance or correlation matrices, requiring the presence of at least one pair of numerical variables that are non-null. That said, it is only necessary to retain records where there exists at least two variables not to be null, rather than all of them simultaneously. This technique is commonly referred to as available-case analysis or pairwise deletion. As far as I know, there should be a wide range of analysis of variance (ANOVA) techniques requiring concerned variables to be non-null.
It is noteworthy that pairwise deletion may result in ridiculous correlation coefficients (above 1) due to the varying number of inputs for calculation across cells of covariance/correlation matrices.
|
|
Return dataset after pairwise deletion
Method 3: Mean Substitution
This approach is relatively straightforward, where all NULL values are substituted with the mean of the variables of interest. In certain instances, median or mode can be utilized as substitutions for missing data.
|
|
Return dataset after substitution
As you may see, this method appears to address the issue of sample size reduction in the previously mentioned techniques, thereby preserving statistical power. Nevertheless, mean substitution yields biased estimates (smaller) of the variance in the case of MCAR due to its disregard for the original dispersion that would have been presented had the data not been missing. Consequently, despite its introduction here, I would not recommend this technique under any circumstances. The statistical implications are summarized below.
Changes in statistics after implementation of mean substitution
Method 4: Hot Deck Imputation
Briefly speaking, this procedure replaces missing values with existing values taken from other records in the dataset. The basic approach involves using a random value, of course, non-null, from the same column to replace a missing value. In more sophisticated approaches, the replaced value is sourced from a row that bears similarity to the row with the missing data.
Indeed, I am introducing this method to familiarize you with the subsequent method — regression imputation.
|
|
Return dataset after imputation
Note: please be aware that there may be a single row that remains null since we use LAST_VALUE() function with the RANGE BETWEEN UNBOUNDED PRECEDING setup, which will return null value if the first row is null.
Method 5: Regression Imputation
Although it may be challenging to apply this technique to SQL-like databases, I would like to introduce the general concept here nonetheless, as it may prove useful in circumstances where a programming language like Python can be utilized. As its name may suggest, this method employs a regression model to predict replaced values of missing items, using other existing columns as independent variables. Let’s consider our beloved Github Dataset (available here) as an example, the missing values of the forks column can be estimated based on the size and watchers columns, sound promising, does not it?
By utilizing information from other columns and patterns of the dataset, this technique generates more conversant replaced values. As long as the predictors are included, the method creates unbiased estimates of the mean and regression coefficients in the case of MCAR & MAR. However, the predicted values lack natural residuals from the regression line, which leads to the underestimation of the variability of the missing values.
Summary of statistics after applying regression imputation
The following chart illustrates the behaviour of different imputation procedures.
Demonstration of different imputation techniques (ref)
As implied by the above diagram, a more effective approach to regression imputation involves incorporating conditional noise into the prediction of the missing values, known as Stochastic Regression Imputation. Additionally, there is a more appealing and sophisticated approach called Multiple Imputation whose result is outstanding. However, I could not introduce these two methods here due to the limitation of SQL itself.
Well, there is one last important section regarding Outlier Removal which will be covered in the next part of this series. I hope that this article may help you have an overview of missing data problems and some simple handling methods. Cheer! 🎉🎉🎉
References
-
Missing Data: Two Big Problems with Mean Imputation: https://www.theanalysisfactor.com/mean-imputation/
-
Mean substitution — skewness and kurtosis: https://stats.stackexchange.com/questions/364711/why-and-how-does-adding-an-interaction-term-affects-the-confidence-interval-of-a
-
Dealing with missing data in ANOVA models: https://www.r-bloggers.com/2018/06/dealing-with-missing-data-in-anova-models/
-
Standard error: https://en.wikipedia.org/wiki/Standard_error
-
Hot deck imputation preserves the distribution of the item values: https://stats.stackexchange.com/questions/307339/hot-deck-imputation-it-preserves-the-distribution-of-the-item-values-how-c