Data cleansing accounts for 80% of the work of scientists, and in my experience, that’s true.
Although I always recommend cleaning these missing valuers, sometimes it is not necessary, however there is a case where it is essential :
During the Creation of Statistical Models
And guess what, there’s not only one way around it. So here are 3 ways to work with missing values.
I. Remove Missing Values
The first of these methods is to remove rows or column holding those missing values.
First, you’ll have to ask yourself why those value are missing ?
Because most of the time dropping data from your dataset will lead to bias model (it’s also true with imputing data points) .
There isn’t a best universal way to work with missing data, that’s why you’ll have to explore different option to help you determine what best for your situation.
For Instance, missing data can sometimes help you obtain better forecasts, Let’s imagine a survey of individuals, removing the missing data could bias the results of our model. We could use this missing information to enrich our perception and moreover our model.
For Data entry errors, mechanical errors or because missing data isn’t useful for our question of interest are acceptables cases for dropping our missing values.
A. Drop any row with a missing value.
B. Drop only the row with all missing values.
C. Drop only the rows with missing values in column 3
II. Imputing Values
Imputing values into a dataset is certainly the most common ways professionnal work with missing data.
You commonly fill the missing value by the mean, the median or the mode.
The pros is that you are not directly removing rows or columns associated with missing.
The cons is that you are diluting the power of your features to predict well by reducing variability in those features
By removing or imputing missing value we should be very cautious about the impact this will have in our model.
It is very common to impute in the following ways:
- Impute the mean of a column.
- If you are working with categorical data or a variable with outliers, then use the mode of the column.
- Impute 0, a very small number, or a very large number to differentiate missing values from other values.
- Use knn to impute values based on features that are most similar.