I am glad to have you for the second part of my complete guide on data cleaning.
- #1: Tidying Messy Data
- #2: Dealing with Missing Data <– You are here
- #3.1: A discussion on the Nature of Outliers
- #3.2: The Origin of Outliers & Detection Techniques
- #4.1: Where does Data Duplication come from?
- #4.2: A Practical Tutorial for Data Deduplication
When you are missing someone, time seems to move slower, and when I’m falling in love with someone, time seems to be moving faster.
— Taylor Swift
Once you realize that your dataset has missing values, you will comprehend the considerable amount of time that you may spend dealing with it and how slow the time may go.
In the previous tutorial, I introduced what a messy dataset looks like, the consequences of dirty data, how to explore & get to know a dataset, the concept of tidy data, and some fundamental data processing techniques.
In this article, I would try to walk you through one of the two most common & troublesome data problems: Missing Data.
3. Missing Data Handling
a. Type of Missing Data
You have doubtlessly heard about the term “missing value” and definitely confronted datasets where a few or thousands of records are null
, regardless of whether you are a novice or an experienced analyst. As a wise man once said, "… in this world nothing can be said to be certain, except death and taxes". I would like to interject and include “and missing data as well as outliers”. Look at the following examples, which hopefully recall some painful memories within you ( ´ ꒳ ` ).
According to Tony Fischetti, there are three types of missing data which you may encounter while working.
-
Missing Completely At Random (MCAR):
Missing Completely At Random, MCAR, occurs when data’s missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor’s survey forms off into the horizon. -
Missing At Random (MAR):
Unfortunately, Missing at random is somewhat misleading. In spite of its name, it implies that there exists a systematic relationship between the absence of an outcome variable and other observed variables, but not the outcome variable itself. Let’s take the climate data as an example, in which we collect humidity, temperature, and CO2. MAR happens when thehumidity
andCO2
data is absent, particularly below -10 degrees Celsius (recorded intemperature
variable), due to the freezing of sensors, rendering them incapable of functioning properly. Despite the extreme conditions, these sensors can accurately detect temperature variations even at sub-zero temperatures below -100 degrees Celsius. -
Missing Not At Random (MNAR):
MNAR, conversely, occurs when the absence of a variable is correlated with the variable itself. For instance, consider a scenario where the weighing scale can only measure up to 3,700 pounds and, as a result, the eight cars that exceeded this threshold were recorded asNULL
s.
b. Dealing with Missing Data
Analysis of missing values
First of all, you need to summarize how many NULL
values are in your dataset. This can be easily achieved in common programming languages through the use of well-developed methods like .describe()
or summary()
. In SQL-like language such as BigQuery, you should work around a bit.
|
|
Upon initial inspection, it appears that you cannot incorporate the master_branch
and integrate_branch
in your work due to a significant proportion of missing data. Once you acquire a general understanding of the missingness in your dataset, you can employ one of the following techniques to address the missing data issue.
Method 1: Completely Ignore Nulls
Frankly speaking, you just need to remove all records comprising of missing values in any of the extracted variables or columns. This method is generally known as complete case analysis, a.k.a. list-wise deletion, as you only keep non-null values in your analysis. Well, sometimes less is more I guess.
|
|
Upon the implementation of this procedure, you are essentially creating a subset of the provided dataset. In other words, you are drawing a sample from the (possible) population of the interested observations. Undoubtedly, this comes at a cost, which varies depending on the type of missing issue you are tackling, as shown as follows.
Method 2: Partially Ignore Nulls
This methodology is frequently employed when estimating covariance or correlation matrices, requiring the presence of at least one pair of numerical variables that are non-null. That said, it is only necessary to retain records where there exists at least two variables not to be null, rather than all of them simultaneously. This technique is commonly referred to as available-case analysis or pairwise deletion. As far as I know, there should be a wide range of analysis of variance (ANOVA) techniques requiring concerned variables to be non-null.
It is noteworthy that pairwise deletion may result in ridiculous correlation coefficients (above 1) due to the varying number of inputs for calculation across cells of covariance/correlation matrices.
|
|
Method 3: Mean Substitution
This approach is relatively straightforward, where all NULL
values are substituted with the mean of the variables of interest. In certain instances, median or mode can be utilized as substitutions for missing data.
|
|
As you may see, this method appears to address the issue of sample size reduction in the previously mentioned techniques, thereby preserving statistical power. Nevertheless, mean substitution yields biased estimates (smaller) of the variance in the case of MCAR due to its disregard for the original dispersion that would have been presented had the data not been missing. Consequently, despite its introduction here, I would not recommend this technique under any circumstances. The statistical implications are summarized below.
Method 4: Hot Deck Imputation
Briefly speaking, this procedure replaces missing values with existing values taken from other records in the dataset. The basic approach involves using a random value, of course, non-null, from the same column to replace a missing value. In more sophisticated approaches, the replaced value is sourced from a row that bears similarity to the row with the missing data.
Indeed, I am introducing this method to familiarize you with the subsequent method — regression imputation.
|
|
Note: please be aware that there may be a single row that remains null since we use LAST_VALUE()
function with the RANGE BETWEEN UNBOUNDED PRECEDING
setup, which will return null value if the first row is null.
Method 5: Regression Imputation
Although it may be challenging to apply this technique to SQL-like databases, I would like to introduce the general concept here nonetheless, as it may prove useful in circumstances where a programming language like Python can be utilized. As its name may suggest, this method employs a regression model to predict replaced values of missing items, using other existing columns as independent variables. Let’s consider our beloved Github Dataset (available here) as an example, the missing values of the forks
column can be estimated based on the size
and watchers
columns, sound promising, does not it?
By utilizing information from other columns and patterns of the dataset, this technique generates more conversant replaced values. As long as the predictors are included, the method creates unbiased estimates of the mean and regression coefficients in the case of MCAR & MAR. However, the predicted values lack natural residuals from the regression line, which leads to the underestimation of the variability of the missing values.
The following chart illustrates the behaviour of different imputation procedures.
As implied by the above diagram, a more effective approach to regression imputation involves incorporating conditional noise into the prediction of the missing values, known as Stochastic Regression Imputation. Additionally, there is a more appealing and sophisticated approach called Multiple Imputation whose result is outstanding. However, I could not introduce these two methods here due to the limitation of SQL itself.
Well, there is one last important section regarding Outlier Removal which will be covered in the next part of this series. I hope that this article may help you have an overview of missing data problems and some simple handling methods. Cheer! 🎉🎉🎉
References
-
Missing Data: Two Big Problems with Mean Imputation: https://www.theanalysisfactor.com/mean-imputation/
-
Mean substitution — skewness and kurtosis: https://stats.stackexchange.com/questions/364711/why-and-how-does-adding-an-interaction-term-affects-the-confidence-interval-of-a
-
Dealing with missing data in ANOVA models: https://www.r-bloggers.com/2018/06/dealing-with-missing-data-in-anova-models/
-
Standard error: https://en.wikipedia.org/wiki/Standard_error
-
Hot deck imputation preserves the distribution of the item values: https://stats.stackexchange.com/questions/307339/hot-deck-imputation-it-preserves-the-distribution-of-the-item-values-how-c