In this chapter, we will cover the basics of how data quality issues are fixed, how data is transformed before modeling, and how to organize your data for convenient data modeling.
First, let’s familiarize with invalid data values and missing entries.
A missing entry is a data value that is not stored for a variable in the observation of interest. There are standard practices how to deal with missing entries.
What is an invalid value depends on the variable you are analyzing or question that you are trying to solve. For example, a FICO score ranges between 300 and 850 so any value below 350 or above 850 should be treated as invalid. Similarly, age, weight or income below zero are invalid as well.
Let’s open American Community Survey Public Use Microdata Sample data from 2016 custdata.RDS and analyze ‘age’ variable.
my_data <- readRDS("R_data_files/custdata.RDS")
plot(sort(my_data$age))
sum(my_data$age==0)
## [1] 77
my_data$age_1=my_data$age;
my_data$age_1[my_data$age_1==0]=NA
summary(my_data$age_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 34.00 48.00 49.22 62.00 120.00 77
#Alternatively, we can use dplyr package
library(dplyr)
my_data <- my_data %>%
mutate(age_2 = na_if(age, 0))
A sorted plot of all age values shows that there are some ‘age’ entries equal to 0. To be exact, there are a total of 77 such entries. I create a new variable age_1 in which values equal to 0 become equal to NA.
Similarly, the directions below will 45 convert negative income values to NA.
sum(my_data$income<0)
## [1] 45
my_data$income_1=my_data$income;
my_data$income_1[my_data$income_1<0]=NA
summary(my_data$income_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 10700 26300 41793 51800 1257000 45
#Alternatively, we can use dplyr package
library(dplyr)
my_data <- my_data %>%
mutate(income_2 = ifelse(income < 0, NA, income))
Some variables are categorical and have to be treated as categorical, not numeric For example, ‘gas_usage’ three values (1,2,3) represnt three different categories with the remaining variables representing monthly gas bill amount. The first three values must be converted either into three separate binary variables, or into a new categorical variable. See below.
summary(my_data$gas_usage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 3.00 10.00 41.17 60.00 570.00 1720
my_data$gas_usage_1=my_data$gas_usage;
my_data$gas_usage_1[my_data$gas_usage_1>3]=NA
my_data$gas_usage_1=as.factor(my_data$gas_usage_1)
summary(my_data$gas_usage_1)
## 1 2 3 NA's
## 2389 6609 24984 39280
my_data$gas_usage_2=my_data$gas_usage;
my_data$gas_usage_2[my_data$gas_usage_2<3]=NA
#Alternatively, three different variables for each category
my_data$gas_usage_A1=my_data$gas_usage==1;
my_data$gas_usage_A2=my_data$gas_usage==2;
my_data$gas_usage_A3=my_data$gas_usage==3;
#Alternatively, we can use dplyr package
my_data <- my_data %>%
mutate(gas_with_rent = (gas_usage == 1),
gas_with_electricity = (gas_usage == 2),
no_gas_bill = (gas_usage == 3) ) %>%
mutate(gas_usage_new = ifelse(gas_usage < 4, NA, gas_usage))
Let’s now take a look at missing values at some columns.
sum(is.na(my_data$age_1))
## [1] 77
sum(is.na(my_data$income_1))
## [1] 45
There are two standard ways to deal with missing values: either drop rows with missing values convert them to a meaningful value. When you run various statistical analyses in R, rows with missing values are dropped by default. If there are a lot of missing values, this can be problematic.
When there are missing values for categorical variables, we can create a new category level and assign all missing entries to that category. When there are muissing values for numeric variables, they can be imputed. Depending on the importance of that variable and why you expect it is missing, a different imputation is appropriate. Often, a mean value of the explanatory variable is computed and inserted into the missing entries. It is a good idea to have a binary variable created that indicates that the entry is imputed. Let’s look at income in our example.
summary(my_data$income_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 10700 26300 41793 51800 1257000 45
mean_income=mean(my_data$income_1,na.rm=TRUE)
my_data$income_1_imputed=0;
my_data$income_1_imputed[is.na(my_data$income_1)]=1;
my_data$income_1[is.na(my_data$income_1)]=mean_income
summary(my_data$income_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 10700 26300 41793 51700 1257000
You should try to figure out/think about why the some values are missing. Different approach is appropriate depending on why the values are missing. For example, there could be missing income values because high income earners do not feel comfortable sharing their income - then imputing an average value for missing values is completely incorrect. If you don’t know whether the missing values are random or systematic, it is a safer approach to assume that the difference is systematic, rather than working hard to impute values to the variables based on the random missingness assumption.
There are numerous tools in R to make your life a lot easier. One package vtreat creates a treatment plan that records all the information needed to repeat the data treatment process. Let’s examine a simple example.
library(vtreat)
library(dplyr)
my_data_orig=readRDS("R_data_files/custdata.RDS")
my_data_orig <- my_data_orig %>%
mutate(age = na_if(age, 0))
my_data_orig <- my_data_orig %>%
mutate(income = ifelse(income < 0, NA, income))
my_data_orig <- my_data_orig %>%
mutate(gas_with_rent = (gas_usage == 1),
gas_with_electricity = (gas_usage == 2),
no_gas_bill = (gas_usage == 3) ) %>%
mutate(gas_usage = ifelse(gas_usage < 4, NA, gas_usage))
variables=setdiff(colnames(my_data_orig), c("custid", "health_ins"))
treatment_plan <- design_missingness_treatment(my_data_orig, varlist = variables)
training_prepared <- prepare(treatment_plan, my_data_orig)
summary(training_prepared)
## custid health_ins sex is_employed
## Length:73262 Mode :logical Length:73262 Min. :0.0000
## Class :character FALSE:7307 Class :character 1st Qu.:0.9505
## Mode :character TRUE :65955 Mode :character Median :1.0000
## Mean :0.9505
## 3rd Qu.:1.0000
## Max. :1.0000
## is_employed_isBAD income income_isBAD marital_status
## Min. :0.0000 Min. : 0 Min. :0.0000000 Length:73262
## 1st Qu.:0.0000 1st Qu.: 10700 1st Qu.:0.0000000 Class :character
## Median :0.0000 Median : 26300 Median :0.0000000 Mode :character
## Mean :0.3518 Mean : 41793 Mean :0.0006142
## 3rd Qu.:1.0000 3rd Qu.: 51700 3rd Qu.:0.0000000
## Max. :1.0000 Max. :1257000 Max. :1.0000000
## housing_type recent_move recent_move_isBAD num_vehicles
## Length:73262 Min. :0.0000 Min. :0.00000 Min. :0.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Mode :character Median :0.0000 Median :0.00000 Median :2.000
## Mean :0.1275 Mean :0.02349 Mean :2.066
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:3.000
## Max. :1.0000 Max. :1.00000 Max. :6.000
## num_vehicles_isBAD age age_isBAD state_of_res
## Min. :0.00000 Min. : 21.00 Min. :0.000000 Length:73262
## 1st Qu.:0.00000 1st Qu.: 34.00 1st Qu.:0.000000 Class :character
## Median :0.00000 Median : 48.00 Median :0.000000 Mode :character
## Mean :0.02348 Mean : 49.22 Mean :0.001051
## 3rd Qu.:0.00000 3rd Qu.: 62.00 3rd Qu.:0.000000
## Max. :1.00000 Max. :120.00 Max. :1.000000
## gas_usage gas_usage_isBAD gas_with_rent gas_with_rent_isBAD
## Min. : 4.00 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.: 50.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 76.01 Median :0.0000 Median :0.00000 Median :0.00000
## Mean : 76.01 Mean :0.4873 Mean :0.03339 Mean :0.02348
## 3rd Qu.: 76.01 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :570.00 Max. :1.0000 Max. :1.00000 Max. :1.00000
## gas_with_electricity gas_with_electricity_isBAD no_gas_bill
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.09238 Mean :0.02348 Mean :0.3492
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
## no_gas_bill_isBAD
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.02348
## 3rd Qu.:0.00000
## Max. :1.00000
Before going ahead with modeling, you should examine for more issues. Among many of them, you would want to convert ‘sex’ variable from character to factor.
summary(training_prepared$sex)
## Length Class Mode
## 73262 character character
training_prepared$sex=as.factor(training_prepared$sex)
summary(training_prepared$sex)
## Female Male
## 37837 35425
The purpose of data transformation is to make data easier to understand, visualize and model. Having a uniform data format standard can be very beneficial.
Let’s open ‘median_income.RDS’ and match with the treated ‘custdata.RDS’. To match data, it is easy to again use dplyr package.
Normalization (also known as rescaling) is useful when absolute quantities are less meaningful than relative ones. For example, you may see a wage of 25. Not knowing the currency and the cost of living, this number alone is not very useful. However, if you see that relative to the higher wage it is 0.5, informs us that the the wage of 25 is half of the max wage in the data. Normalization can be done based on the max, mean, median, it can be conditional on some other variable or unconditional.
In our example, let’s normalize income by state. That is the median income for each state (conditional) will have a value of 1. Alternatively, depending on the question and method of analysis, we could use unconditional normalization (independent on the state of residence). Similarly, we can rescale age variable so that the customer with an average age takes the value of 1.
median_income_table <- readRDS("R_data_files/median_income.RDS")
library(dplyr)
training_prepared <- training_prepared %>%
left_join(., median_income_table, by="state_of_res") %>%
mutate(income_normalized = income/median_income)
head(training_prepared[, c("income", "median_income", "income_normalized")])
## income median_income income_normalized
## 1 22000 21100 1.0426540
## 2 23200 21100 1.0995261
## 3 21000 21100 0.9952607
## 4 37770 21100 1.7900474
## 5 39000 21100 1.8483412
## 6 11100 21100 0.5260664
summary(training_prepared$income_normalized)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4049 1.0000 1.5685 1.9627 46.5556
#Unconditional normalization for income
training_prepared$income_normalized_uncond=training_prepared$income/median(training_prepared$income)
#Normalization of age variable
mean_age <- mean(training_prepared$age)
age_normalized <- training_prepared$age/mean_age
summary(age_normalized)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4267 0.6908 0.9753 1.0000 1.2597 2.4382
You can also rescale your data by using the standard deviation as a unit of distance. It is helpful to think of standard deviations since we know that 68% of data falls within 1 standard deviation. A customer who is within one standard deviation of the mean age is considered not much older or younger than typical. A customer who is more than one or two standard deviations from the mean can be considered much older, or much younger. To make the relative ages even easier to understand, you can also center the data by the mean, so a customer of “typical age” has a centered age of 0.
library(dplyr)
mean_age <- mean(training_prepared$age)
sd_age <- sd(training_prepared$age)
print(mean_age + c(-sd_age, sd_age))
## [1] 31.20407 67.22886
training_prepared$scaled_age <- (training_prepared$age - mean_age) / sd_age
training_prepared %>%
filter(abs(age - mean_age) < sd_age) %>%
select(age, scaled_age) %>%
head()
## age scaled_age
## 1 67 0.9872942
## 2 54 0.2655690
## 3 61 0.6541903
## 4 64 0.8207422
## 5 57 0.4321210
## 6 55 0.3210864
training_prepared %>%
filter(abs(age - mean_age) > sd_age) %>%
select(age, scaled_age) %>%
head()
## age scaled_age
## 1 24 -1.399951
## 2 82 1.820054
## 3 31 -1.011329
## 4 93 2.430745
## 5 76 1.486950
## 6 26 -1.288916
You can rescale variables manually with a couple of lines of code but when you have multiple numeric variables, you can simplify your life and increase efficiency using the scale() function to center and scale multiple variables simultaneously.
dataf <- training_prepared[, c("age", "income", "num_vehicles", "gas_usage")]
summary(dataf)
## age income num_vehicles gas_usage
## Min. : 21.00 Min. : 0 Min. :0.000 Min. : 4.00
## 1st Qu.: 34.00 1st Qu.: 10700 1st Qu.:1.000 1st Qu.: 50.00
## Median : 48.00 Median : 26300 Median :2.000 Median : 76.01
## Mean : 49.22 Mean : 41793 Mean :2.066 Mean : 76.01
## 3rd Qu.: 62.00 3rd Qu.: 51700 3rd Qu.:3.000 3rd Qu.: 76.01
## Max. :120.00 Max. :1257000 Max. :6.000 Max. :570.00
dataf_scaled <- scale(dataf, center=TRUE, scale=TRUE)
(means <- attr(dataf_scaled, 'scaled:center'))
## age income num_vehicles gas_usage
## 49.21647 41792.51062 2.06550 76.00745
(sds <- attr(dataf_scaled, 'scaled:scale'))
## age income num_vehicles gas_usage
## 18.012397 58102.481410 1.156294 50.717778
Normally distributed data fist many natural phenomena and allows us to do much more. However, many variables such as monetary amounts—incomes, customer value, account values, purchase sizes among other often have skewed distributions. This leads us to consider data transformations that make data look “more normal”. One of these transformations is taking the log of the variable. See example below.
par(mfrow = c(1, 2))
plot(density(training_prepared$income))
plot(density(log(training_prepared$income)))
It is also very helpful to use log transformations when data ranges in orders of magnitude. For example, population of towns/cities, incomes, and so on. This is because many modeling techniques run into issues with very wide data ranges. Sometimes, it makes more sense to use log transformations such as in linear regressions. For example, using log of wage in a regression as a dependent variable gives results as percentage change in wage. Remember that log transformation is only possbile with non-negative values.
Sampling is the process of selecting a subset of a population to represent the whole population. Sampling is a very useful tool in analysis and modeling even when we are able to analyze larger and larger datasets. It is easier to test and debug code on small subsamples before moving on to training the model on the whole dataset. You also need to create samples to create training and test splits.
You build a model when you want to make predictions or understand relationships. You also need to check whether the model is making accurate predictions using other/new data. The first set is called the training set. You feed the training set to the model. The second set, test or hould-out set, is used to verify the results - you use new data to check if model’s predictions any good.
One easy way to create/manage random sampling is by adding a new column that will contain a randomly generated number uniformly between 0 and 1. Then you can indicate how to split the data, or how big the sample you want by choosing only those rows for which the randomly generated number is below certain value. For example, if you want to use ten percernt for the test set, let rows belong to the test set if the randomly generated number is below or equal to 0.1. You can do so in R suing runif() function very easily. You should make sure to set the random seed using the command set.seed(). This will make sure you will draw the same sample group every time.
set.seed(25643) #Reusing the same set seed number will allow you to get an identical "random" sample
training_prepared$gp <- runif(nrow(training_prepared))
data_test <- subset(training_prepared, gp <= 0.1) #Use ten percent for the test set
data_train <- subset(training_prepared, gp > 0.1) #Use the remaining data for the training set
dim(data_test)
## [1] 7463 29
dim(data_train)
## [1] 65799 29
In some cases, you have data that tracks not only individual/unit itself but also the household individual belongs to or group unit belongs to. In these cases you may want to do random sampling at the group (household) level (all members of a household should belong to the same group – test or training).
In the example below, we load household data, identify unique household ids (using unique() command) for each of which we create a random number between 0 and 1 (using runif command). In the final step, we merge the newly crated random numbers according to household id. As you can see identical household ids have the same random number associated with it. We can now split the data to different sets at a household level.
household_data = readRDS("R_data_files/hhdata.RDS")
hh = unique(household_data$household_id)
set.seed(243674)
households = data.frame(household_id = hh,gp = runif(length(hh)),stringsAsFactors=FALSE)
household_data = dplyr::left_join(household_data,
households,
by = "household_id")
Finally, it is a good idea to keep track of changes. One way to do it is to record the date and time of the last treatment/modification. That can be done by adding a column that shows the time of the last action. Commands Sys.time() and Sys.Date() could be used to save the time.
household_data$load_date=Sys.Date()
household_data$load_time=Sys.time()
head(household_data)
## household_id customer_id age income gp load_date load_time
## 1 000008385 000008385_01 74 45600 0.2063638 2021-05-08 2021-05-08 11:59:53
## 2 000012408 000012408_01 54 16300 0.4543296 2021-05-08 2021-05-08 11:59:53
## 3 000013288 000013288_01 59 622000 0.9931105 2021-05-08 2021-05-08 11:59:53
## 4 000013288 000013288_02 67 43000 0.9931105 2021-05-08 2021-05-08 11:59:53
## 5 000017554 000017554_01 47 98000 0.6279021 2021-05-08 2021-05-08 11:59:53
## 6 000017554 000017554_02 54 31200 0.6279021 2021-05-08 2021-05-08 11:59:53
References
Zumel, N., & Mount, J. (2014). Practical Data Science With R. Manning Publications Co.
Wooldridge, J. (2019). Introductory econometrics: a modern approach. Boston, MA: Cengage.