STAT 488 Predictive Analytics

# Required libraries
library(dplyr)
library(ggplot2)
library(glmnet)
library(mgcv)
library(methods)
library(mice)

1) (5 points) As model complexity increases, the prediction error within the training data set will generally decrease. Therefore, we should always use the most complex model possible. Is this statement true? Explain why or why not.

This statement is false. Doing this we would be overfitting and the problem with this strategy is that there is no guarantee that the model with the lowest training MSE will also have the lowest test MSE.

This happens because many statistical methods, e.g. maximum likelihood, specifically estimate coefficients so as to minimize the training set MSE. For these methods, the training set MSE can be quite small, but the test MSE is often much larger.

2) (5 points) As the model gets MORE complex, explain what happens to the bias and variance. What happens to bias and variance, as the model gets LESS complex?

The bias-variance trade off is governed by the following equation:

\[E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\]

In general, as we increase the complexity of the model, the bias tends to initially decrease faster than the variance increases.
On the other hand, as the model gets less complex, the variance will decrease and the bias will increase.

3) (5 points) Name something that LASSO does that ridge regression does not do. Explain what the difference is between LASSO and ridge that allows this to happen.

LASSO can perform variable selection, while Rigde cannot.

For LASSO regression we have:

\[\sum_{i = 1}^{n} \big(y_{i} - \beta_{0} - \sum_{j = 1}^{p} \beta_{j} x_{ij} \big)^2 + \lambda \sum_{j = 1}^{p} |\beta_{j}| = RSS + \lambda \sum_{j = 1}^{p} |\beta_{j}|\]
For Ridge regression we have:

\[\sum_{i = 1}^{n} \big(y_{i} - \beta_{0} - \sum_{j = 1}^{p} \beta_{j} x_{ij} \big)^2 + \lambda \sum_{j = 1}^{p} \beta_{j}^{2} = RSS + \lambda \sum_{j = 1}^{p} \beta_{j}^{2}\]

Both LASSO and Ridge have similar formulations. The only difference is that the $\beta_{j}^{2}$ term in the Ridge regression penalty is replaced by $|\beta_{j}|$ in the LASSO penalty. LASSO uses an $\ell_{1}$ penalty instead of an $\ell_{2}$ penalty.

For the LASSO, the $\ell_{1}$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter $\lambda$ is sufficiently large.

4) (5 points) Let’s say you encountered a data set with n = 1000 observations and p = 10000 variables, and you want to perform a regression using the data to model some continuous outcome, and you are primarily interested in prediction of the outcome. Briefly describe any one of the techniques that we have discussed so far this semester that could be appropriate to use here. Also describe one method that we have discussed that would NOT work.

To solve this problem we can use principal components regression (PCR). This approach involves constructing the first M principal components, $Z_{1}, \ldots , Z_{M}$, and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response.

Other methods discussed in the semester, like Linear Discriminant Analysis or k-nearest neighbors would not work in this setting because they are classification algorithms and in this problem we are interested in regression.

5) (15 points) Build a linear model to predict the probability of a recurrence event (i.e. the response variable here is “class”). Make sure to cross validate!

5.1) Data import and cleaning

The first step is to import the data.

# Data import
# https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
breastcancer <- read.csv("https://bit.ly/2qg37Hv", header = FALSE)
names(breastcancer) <- c("class", "age", "menopause",
                         "tumor_size", "inv_nodes",
                         "node_caps", "deg_malig",
                         "breast", "breast_quad",
                         "irradiat")

class(breastcancer)

## [1] "data.frame"

str(breastcancer)

## 'data.frame':    286 obs. of  10 variables:
##  $ class      : Factor w/ 2 levels "no-recurrence-events",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age        : Factor w/ 6 levels "20-29","30-39",..: 2 3 3 5 3 5 4 5 3 3 ...
##  $ menopause  : Factor w/ 3 levels "ge40","lt40",..: 3 3 3 1 3 1 3 1 3 3 ...
##  $ tumor_size : Factor w/ 11 levels "0-4","10-14",..: 6 4 4 3 1 3 5 4 11 4 ...
##  $ inv_nodes  : Factor w/ 7 levels "0-2","12-14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ node_caps  : Factor w/ 3 levels "?","no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ deg_malig  : int  3 2 2 2 2 2 2 1 2 2 ...
##  $ breast     : Factor w/ 2 levels "left","right": 1 2 1 2 2 1 1 1 1 2 ...
##  $ breast_quad: Factor w/ 6 levels "?","central",..: 3 6 3 4 5 3 3 3 3 4 ...
##  $ irradiat   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(breastcancer)

##                   class        age       menopause     tumor_size
##  no-recurrence-events:201   20-29: 1   ge40   :129   30-34  :60  
##  recurrence-events   : 85   30-39:36   lt40   :  7   25-29  :54  
##                             40-49:90   premeno:150   20-24  :50  
##                             50-59:96                 15-19  :30  
##                             60-69:57                 10-14  :28  
##                             70-79: 6                 40-44  :22  
##                                                      (Other):42  
##  inv_nodes   node_caps   deg_malig       breast       breast_quad 
##  0-2  :213   ?  :  8   Min.   :1.000   left :152   ?        :  1  
##  12-14:  3   no :222   1st Qu.:2.000   right:134   central  : 21  
##  15-17:  6   yes: 56   Median :2.000               left_low :110  
##  24-26:  1             Mean   :2.049               left_up  : 97  
##  3-5  : 36             3rd Qu.:3.000               right_low: 24  
##  6-8  : 17             Max.   :3.000               right_up : 33  
##  9-11 : 10                                                        
##  irradiat 
##  no :218  
##  yes: 68  
##           
##           
##           
##           
##

We can see that the data is imported as an object of class “data.frame”. The str() function displays the structure of the data and here we can see that the variables have been imported as factors, which is correct according to the data dictionary in the UCI Machine Learning Repository. The only exception is the variable deg_malig, which is imported as numeric, when it should be a factor. Furthermore, missing values have been imported as “?” instead of NA.

Next, I am going to replace all the “?” for NA and re-define the factor levels and variables where necessary.

# Missing values: Replace "?" for NA
breastcancer <-  replace(breastcancer, breastcancer == "?", NA)

# node_factor, breast_quad and deg_malig
breastcancer <- breastcancer %>% 
  mutate(node_caps = factor(node_caps, levels = c("no", "yes"))) %>%
  mutate(breast_quad = factor(breast_quad, levels = c("central", "left_low", "left_up", "right_low", "right_up"))) %>% 
  mutate(deg_malig = as.factor(deg_malig))
  
str(breastcancer)

## 'data.frame':    286 obs. of  10 variables:
##  $ class      : Factor w/ 2 levels "no-recurrence-events",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age        : Factor w/ 6 levels "20-29","30-39",..: 2 3 3 5 3 5 4 5 3 3 ...
##  $ menopause  : Factor w/ 3 levels "ge40","lt40",..: 3 3 3 1 3 1 3 1 3 3 ...
##  $ tumor_size : Factor w/ 11 levels "0-4","10-14",..: 6 4 4 3 1 3 5 4 11 4 ...
##  $ inv_nodes  : Factor w/ 7 levels "0-2","12-14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ node_caps  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ deg_malig  : Factor w/ 3 levels "1","2","3": 3 2 2 2 2 2 2 1 2 2 ...
##  $ breast     : Factor w/ 2 levels "left","right": 1 2 1 2 2 1 1 1 1 2 ...
##  $ breast_quad: Factor w/ 5 levels "central","left_low",..: 2 5 2 3 4 2 2 2 2 3 ...
##  $ irradiat   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(breastcancer)

##                   class        age       menopause     tumor_size
##  no-recurrence-events:201   20-29: 1   ge40   :129   30-34  :60  
##  recurrence-events   : 85   30-39:36   lt40   :  7   25-29  :54  
##                             40-49:90   premeno:150   20-24  :50  
##                             50-59:96                 15-19  :30  
##                             60-69:57                 10-14  :28  
##                             70-79: 6                 40-44  :22  
##                                                      (Other):42  
##  inv_nodes   node_caps  deg_malig   breast       breast_quad  irradiat 
##  0-2  :213   no  :222   1: 71     left :152   central  : 21   no :218  
##  12-14:  3   yes : 56   2:130     right:134   left_low :110   yes: 68  
##  15-17:  6   NA's:  8   3: 85                 left_up  : 97            
##  24-26:  1                                    right_low: 24            
##  3-5  : 36                                    right_up : 33            
##  6-8  : 17                                    NA's     :  1            
##  9-11 : 10

5.2) Imputation for missing values

From the last call to str() and summary() we can see that there is no longer “?” in the data and that the factor levels for variables node_factor and breast_quad have been redefined. Also, variable deg_malig is now a factor.

We can also see that the ratio between “no” and “yes” for the variable node_factor is almost 4:1, meaning that for every “yes” there are approximately four “no”. Taking this into account, and since there are only eight NA’s in this variable, I will impute the level “no” to the missing values here.

Similarly, for the variable breast_quad, I imputed the most repeated level (“left_low”) to the only missing value.

# Imputation
breastcancer$node_caps[which(is.na(breastcancer$node_caps))] <- "no"
breastcancer$breast_quad[which(is.na(breastcancer$breast_quad))] <- "left_low"

summary(breastcancer)

##                   class        age       menopause     tumor_size
##  no-recurrence-events:201   20-29: 1   ge40   :129   30-34  :60  
##  recurrence-events   : 85   30-39:36   lt40   :  7   25-29  :54  
##                             40-49:90   premeno:150   20-24  :50  
##                             50-59:96                 15-19  :30  
##                             60-69:57                 10-14  :28  
##                             70-79: 6                 40-44  :22  
##                                                      (Other):42  
##  inv_nodes   node_caps deg_malig   breast       breast_quad  irradiat 
##  0-2  :213   no :230   1: 71     left :152   central  : 21   no :218  
##  12-14:  3   yes: 56   2:130     right:134   left_low :111   yes: 68  
##  15-17:  6             3: 85                 left_up  : 97            
##  24-26:  1                                   right_low: 24            
##  3-5  : 36                                   right_up : 33            
##  6-8  : 17                                                            
##  9-11 : 10

Finally, our data set breastcancer has no missing values.

5.3) Checking variables

In this section I made a contingency table for class and every other predictor available. I am checking that there are at least one observation in every cell of the table. If that is not the case, the predict() function will return an error later.

# Contingency tables  
xtabs(~ class + age, data = breastcancer) # Zero for age 20-29

##                       age
## class                  20-29 30-39 40-49 50-59 60-69 70-79
##   no-recurrence-events     1    21    63    71    40     5
##   recurrence-events        0    15    27    25    17     1

xtabs(~ class + menopause, data = breastcancer)

##                       menopause
## class                  ge40 lt40 premeno
##   no-recurrence-events   94    5     102
##   recurrence-events      35    2      48

xtabs(~ class + tumor_size, data = breastcancer) # Zero for tumor size 5-9

##                       tumor_size
## class                  0-4 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49
##   no-recurrence-events   7    27    23    34    36    35    12    16     2
##   recurrence-events      1     1     7    16    18    25     7     6     1
##                       tumor_size
## class                  5-9 50-54
##   no-recurrence-events   4     5
##   recurrence-events      0     3

xtabs(~ class + inv_nodes, data = breastcancer) # Zero for inv_node 24-26

##                       inv_nodes
## class                  0-2 12-14 15-17 24-26 3-5 6-8 9-11
##   no-recurrence-events 167     1     3     0  19   7    4
##   recurrence-events     46     2     3     1  17  10    6

xtabs(~ class + node_caps, data = breastcancer)

##                       node_caps
## class                   no yes
##   no-recurrence-events 176  25
##   recurrence-events     54  31

xtabs(~ class + deg_malig, data = breastcancer)

##                       deg_malig
## class                    1   2   3
##   no-recurrence-events  59 102  40
##   recurrence-events     12  28  45

xtabs(~ class + breast, data = breastcancer)

##                       breast
## class                  left right
##   no-recurrence-events  103    98
##   recurrence-events      49    36

xtabs(~ class + breast_quad, data = breastcancer)

##                       breast_quad
## class                  central left_low left_up right_low right_up
##   no-recurrence-events      17       75      71        18       20
##   recurrence-events          4       36      26         6       13

xtabs(~ class + irradiat, data = breastcancer)

##                       irradiat
## class                   no yes
##   no-recurrence-events 164  37
##   recurrence-events     54  31

We can see that for variables age, tumor_size, and inv_nodes there are zeros in at least one cell of the contingency table. For those cases, I am going to reassign them to another level and redefine the levels of the corresponding factor.

# Redefine age
breastcancer <- breastcancer %>% 
  mutate(age = factor(age, levels = c("20-29", "30-39", "20-39", "40-49", "50-59", "60-69", "70-79")))

idx_age1 <- which(breastcancer$age == "20-29")
breastcancer$age[idx_age1] <- "20-39"

idx_age2 <- which(breastcancer$age == "30-39")
breastcancer$age[idx_age2] <- "20-39"

breastcancer <- breastcancer %>% 
  mutate(age = factor(age, levels = c("20-39", "40-49", "50-59", "60-69", "70-79")))

# Redefine tumor_size
breastcancer <- breastcancer %>% 
  mutate(tumor_size = factor(tumor_size, levels = c("0-9", "0-4", "10-14","15-19", "20-24", "25-29", 
                                             "30-34", "35-39", "40-44", "45-49", "5-9", "50-54")))

idx_tumor_size1 <- which(breastcancer$tumor_size == "0-4")
breastcancer$tumor_size[idx_tumor_size1] <- "0-9"

idx_tumor_size2 <- which(breastcancer$tumor_size == "5-9")
breastcancer$tumor_size[idx_tumor_size2] <- "0-9"

breastcancer <- breastcancer %>% 
  mutate(tumor_size = factor(tumor_size, levels = c("0-9", "10-14", "15-19", "20-24", "25-29", 
                                                    "30-34", "35-39", "40-44", "45-49", "50-54")))

# Redefine inv_nodes
breastcancer <- breastcancer %>% 
  mutate(inv_nodes = factor(inv_nodes, levels = c("0-2", "3-5", "6-8", "9-11", "12-14", "15-17", "24-26", "15-26")))

idx_inv_nodes1 <- which(breastcancer$inv_nodes == "15-17")
breastcancer$inv_nodes[idx_inv_nodes1] <- "15-26"

idx_inv_nodes2 <- which(breastcancer$inv_nodes == "24-26")
breastcancer$inv_nodes[idx_inv_nodes2] <- "15-26"

breastcancer <- breastcancer %>% 
  mutate(inv_nodes = factor(inv_nodes, levels = c("0-2", "3-5", "6-8", "9-11", "12-14", "15-26")))

Next, there are the contingency tables for the modified factors and we can see the changes in levels:

# Contingency tables after changes
xtabs(~ class + age, data = breastcancer)

##                       age
## class                  20-39 40-49 50-59 60-69 70-79
##   no-recurrence-events    22    63    71    40     5
##   recurrence-events       15    27    25    17     1

xtabs(~ class + tumor_size, data = breastcancer)

##                       tumor_size
## class                  0-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49
##   no-recurrence-events  11    27    23    34    36    35    12    16     2
##   recurrence-events      1     1     7    16    18    25     7     6     1
##                       tumor_size
## class                  50-54
##   no-recurrence-events     5
##   recurrence-events        3

xtabs(~ class + inv_nodes, data = breastcancer)

##                       inv_nodes
## class                  0-2 3-5 6-8 9-11 12-14 15-26
##   no-recurrence-events 167  19   7    4     1     3
##   recurrence-events     46  17  10    6     2     4

5.4) Train and test sets

The next step is to split the breastcancer data set into a train and a test set. To do so, I used 66% of the observations for the training set and the remaining 33% for the test set.

# Set seed and random index
set.seed(1234)
idx <- sample(x = nrow(breastcancer), size = (2/3)*nrow(breastcancer), replace = FALSE)

# Predictors
x_train <- breastcancer[idx, 2:10]
x_test <- breastcancer[-idx, 2:10]

# Response
y_train <- breastcancer[idx, 1]
y_test <- breastcancer[-idx, 1]

Later I will use the log loss function, defined as follows:

\[\text{log loss} = -\frac{1}{N} \sum{}_{i = 1}^{N} (y_{i} * log(p_{i}) + (1 - y_{i}) * log(1 - p_{i}))\]

For this function, is required that the values of the factor $y_{i}$ in the test set to be numeric zeros or ones. But so far these values of are strings as shown below:

head(y_test)

## [1] no-recurrence-events no-recurrence-events no-recurrence-events
## [4] no-recurrence-events no-recurrence-events no-recurrence-events
## Levels: no-recurrence-events recurrence-events

To solve this problem and compute the log loss function, I will convert “no-recurrence-event” to zero and “recurrence-event” to one with the following code:

y_test <- ifelse(test = y_test == "no-recurrence-events", yes = 0, no = 1)
head(y_test)

## [1] 0 0 0 0 0 0

5.5) Variable selection with LASSO

To select the variables for the model, I used LASSO with the function cv.glmnet() from package glmnet. It is important to note that the arguments x and y should be obtained using the function model.matrix(), otherwise cv.glmnet() returns an error. Also, to fit the logistic regression model, we have to specify type.measure = "deviance" and family = "binomial".
To fit the model p5_lasso I am using lambda.1se because in, a statistical sense, is indistinguishable from lambda.min but results in a model with fewer parameters.

# CV to select the value of lambda
p5_lasso_cv <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                         type.measure = "deviance", alpha = 1, family = "binomial")

# Fit the LASSO model
p5_lasso <- glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                   alpha = 1, family = "binomial", lambda = p5_lasso_cv$lambda.1se)

# Coefficients of LASSO model   
coef(p5_lasso)

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) -1.82285554
## age          .         
## menopause    .         
## tumor_size   .         
## inv_nodes    0.07407418
## node_caps    0.16779584
## deg_malig    0.35353276
## breast       .         
## breast_quad  .         
## irradiat     .

The only variables whose coefficients were not set equal to zero are inv_nodes, node_caps, and deg_malig. Following are the predictions on the test set using the resulting model and the corresponding test log loss.

# Make predictions
p5_lasso_pred <- predict(p5_lasso, newx = data.matrix(x_test), type = "response") 

# Compute log loss
log_loss <- (-1/nrow(x_test))*(sum(y_test * log(p5_lasso_pred) + (1 - y_test) * log(1 - p5_lasso_pred)))
log_loss

## [1] 0.5511863

5.6) Logistic regression model

When I was trying to make the predictions in section 5.5), I tried to use predict(p5_lasso, newx = x_test, type = "response"), but I got an error that said: “not-yet-implemented method for <data.frame> %*% “. Later I found that the error was being caused because we need to use the function data.matrix() to provide the matrices to glmnet.

However, when I found how to fix the problem, I have already done this section.

In this section I fitted the model with the variables selected by LASSO, namely: class ~ node_caps + deg_malig. Then I made predictions using predict() function and finally I computed the log loss function.

# Need to redefine x_train
x_train <- breastcancer[idx, ]
x_test <- breastcancer[-idx, 2:10]

# Fit logistic regression model
p5 <- glm(class ~ node_caps + deg_malig, data = x_train, family = "binomial")
summary(p5)

## 
## Call:
## glm(formula = class ~ node_caps + deg_malig, family = "binomial", 
##     data = x_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6199  -0.6171  -0.6039   0.7921   1.8930  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.60944    0.38730  -4.156 3.25e-05 ***
## node_capsyes  1.08144    0.41611   2.599  0.00935 ** 
## deg_malig2    0.04742    0.49084   0.097  0.92303    
## deg_malig3    1.52633    0.49150   3.105  0.00190 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 238.51  on 189  degrees of freedom
## Residual deviance: 203.51  on 186  degrees of freedom
## AIC: 211.51
## 
## Number of Fisher Scoring iterations: 4

# Predictions
p5_predictions <- predict(p5, newdata = x_test, type = "response")   
head(p5_predictions) # Predictions are probabilities

##         1         2         5        14        16        17 
## 0.4792361 0.1733579 0.1733579 0.4792361 0.1666667 0.1733579

# Log loss
log_loss <- (-1/nrow(x_test))*(sum(y_test * log(p5_predictions) + (1 - y_test) * log(1 - p5_predictions)))
log_loss

## [1] 0.5418395

5.7) Estimation of log loss with 10-fold cross validation

To get a more accurate value of log loss for the model class ~ node_caps + deg_malig I used 10-fold cross validation in the complete breastcancer data set.

# CV to get a better estimate of log_loss
set.seed(1234)
df <- breastcancer
df$fold <- sample(1:10, 286, replace = TRUE)
cv_log_loss <- list()

for (i in 1:10){
  temp <- glm(class ~ node_caps + deg_malig, 
              data = df[!(df$fold == i), ], family = "binomial")
  # Predicted values
  y_hat <- predict(temp, newdata = df[(df$fold == i), ], type = "response")
  # Actual Values
  y_actual <- as.numeric(df$class[(df$fold == i)]) - 1
  # Log loss
  cv_log_loss[[i]] <- (-1/length(y_actual))*(sum(y_actual * log(y_hat) + (1 - y_actual) * log(1 - y_hat)))
}

# Values of log loss for different folds
unlist(cv_log_loss)

##  [1] 0.4562944 0.3675229 0.5908199 0.4733092 0.7869962 0.6372770 0.4329519
##  [8] 0.7805364 0.4631691 0.5191718

# Mean value of log loss
mean(unlist(cv_log_loss))

## [1] 0.5508049

6) (15 points) Read the following data set into R using the code below. Predict the price of a car as a function of the other variables in the data set using LASSO regression. Explain all of the steps that you used in the process of building the model.

6.0) Problems when importing the data

Here I mention some problems that I had when importing the data and how I fixed them.

# Data import (incorrect)   
# https://archive.ics.uci.edu/ml/datasets/Automobile
cars <- read.csv("https://bit.ly/2AG1IlJ", header = FALSE)
names(cars) <- c("symboling", "normalized_losses", "make",
                 "fuel_type", "aspiration", "num_of_doors",
                 "body_style", "drive_wheels", "engine_location",
                 "wheel_base", "length", "width", 
                 "height", "curb_weight", "engine_type",
                 "num_of_cylinders", "engine_size", "fuel_system",
                 "bore", "stroke", "compression_ratio",
                 "horsepower", "peak_rpm", "city_mpg",
                 "highway_mpg", "price")
str(cars)

## 'data.frame':    205 obs. of  26 variables:
##  $ symboling        : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ normalized_losses: Factor w/ 52 levels "?","101","102",..: 1 1 1 29 29 1 27 1 27 1 ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel_type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num_of_doors     : Factor w/ 3 levels "?","four","two": 3 3 3 2 2 3 2 2 2 3 ...
##  $ body_style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive_wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine_location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel_base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb_weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine_type      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ engine_size      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel_system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : Factor w/ 39 levels "?","2.54","2.68",..: 25 25 3 15 15 15 15 15 12 12 ...
##  $ stroke           : Factor w/ 37 levels "?","2.07","2.19",..: 6 6 29 26 26 26 26 26 26 26 ...
##  $ compression_ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower       : Factor w/ 60 levels "?","100","101",..: 7 7 22 4 10 6 6 6 17 25 ...
##  $ peak_rpm         : Factor w/ 24 levels "?","4150","4200",..: 12 12 12 18 18 18 18 18 18 18 ...
##  $ city_mpg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway_mpg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price            : Factor w/ 187 levels "?","10198","10245",..: 33 52 52 38 63 43 65 73 83 1 ...

The output of str() shows that there are some “?” in the data that should be missing values instead. So my first attempt was to replace them with the following code:

# Replace ? for NA
cars <-  replace(cars, cars == "?", NA)

# Structure of the data
str(cars)

## 'data.frame':    205 obs. of  26 variables:
##  $ symboling        : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ normalized_losses: Factor w/ 52 levels "?","101","102",..: NA NA NA 29 29 NA 27 NA 27 NA ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel_type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num_of_doors     : Factor w/ 3 levels "?","four","two": 3 3 3 2 2 3 2 2 2 3 ...
##  $ body_style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive_wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine_location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel_base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb_weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine_type      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ engine_size      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel_system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : Factor w/ 39 levels "?","2.54","2.68",..: 25 25 3 15 15 15 15 15 12 12 ...
##  $ stroke           : Factor w/ 37 levels "?","2.07","2.19",..: 6 6 29 26 26 26 26 26 26 26 ...
##  $ compression_ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower       : Factor w/ 60 levels "?","100","101",..: 7 7 22 4 10 6 6 6 17 25 ...
##  $ peak_rpm         : Factor w/ 24 levels "?","4150","4200",..: 12 12 12 18 18 18 18 18 18 18 ...
##  $ city_mpg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway_mpg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price            : Factor w/ 187 levels "?","10198","10245",..: 33 52 52 38 63 43 65 73 83 NA ...

# Change normalized losses from factor to numeric
cars$normalized_losses <- as.numeric(cars$normalized_losses)
summary(cars$normalized_losses)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   15.00   26.00   26.57   39.00   52.00      41

This approach replaces the “?” by NA, but has a problem with the variable normalized_losses. This variable is imported as a factor, but in the data dictionary from UCI Machine Learning Repository, they tells us that it is a numeric variable and its range goes from 65 to 256.
When I convert to numeric using as.numeric() function, the values of the variable mutate and it ends up out of its range. This is shown in summary(cars$normalized_losses), where the new (incorrect) range is from 2 to 52.

6.1) Data import and cleaning (correct way)

In this section I implement a solution to the problem described above. It can be solved by specifying the argument na.strings = "?" inside the read.csv() function.

# Data import (incorrect)   
# https://archive.ics.uci.edu/ml/datasets/Automobile
cars <- read.csv("https://bit.ly/2AG1IlJ", header = FALSE, na.strings = "?")
names(cars) <- c("symboling", "normalized_losses", "make",
                 "fuel_type", "aspiration", "num_of_doors",
                 "body_style", "drive_wheels", "engine_location",
                 "wheel_base", "length", "width", 
                 "height", "curb_weight", "engine_type",
                 "num_of_cylinders", "engine_size", "fuel_system",
                 "bore", "stroke", "compression_ratio",
                 "horsepower", "peak_rpm", "city_mpg",
                 "highway_mpg", "price")

# Structure of the data
str(cars)

## 'data.frame':    205 obs. of  26 variables:
##  $ symboling        : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ normalized_losses: int  NA NA NA 164 164 NA 158 NA 158 NA ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel_type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num_of_doors     : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ body_style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive_wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine_location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel_base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb_weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine_type      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ engine_size      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel_system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke           : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compression_ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower       : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peak_rpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ city_mpg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway_mpg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price            : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...

As we can see from str(), we cannot work the data as it is so far. Variable symboling was imported as integer and data dictionary tells us that it should be a factor.

# Redefine variables type
cars <- cars %>% 
  mutate(symboling = as.factor(symboling))

# Structure of the data
str(cars)

## 'data.frame':    205 obs. of  26 variables:
##  $ symboling        : Factor w/ 6 levels "-2","-1","0",..: 6 6 4 5 5 5 4 4 4 3 ...
##  $ normalized_losses: int  NA NA NA 164 164 NA 158 NA 158 NA ...
##  $ make             : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ fuel_type        : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
##  $ aspiration       : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ num_of_doors     : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
##  $ body_style       : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
##  $ drive_wheels     : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
##  $ engine_location  : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
##  $ wheel_base       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ length           : num  169 169 171 177 177 ...
##  $ width            : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ height           : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curb_weight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ engine_type      : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
##  $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
##  $ engine_size      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuel_system      : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ bore             : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke           : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compression_ratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower       : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peak_rpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ city_mpg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highway_mpg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price            : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...

Now that the variables are correctly defined, I called summary()

# To check that we still have missing values
summary(cars)

##  symboling normalized_losses         make      fuel_type   aspiration 
##  -2: 3     Min.   : 65       toyota    : 32   diesel: 20   std  :168  
##  -1:22     1st Qu.: 94       nissan    : 18   gas   :185   turbo: 37  
##  0 :67     Median :115       mazda     : 17                           
##  1 :54     Mean   :122       honda     : 13                           
##  2 :32     3rd Qu.:150       mitsubishi: 13                           
##  3 :27     Max.   :256       subaru    : 12                           
##            NA's   :41        (Other)   :100                           
##  num_of_doors       body_style drive_wheels engine_location
##  four:114     convertible: 6   4wd:  9      front:202      
##  two : 89     hardtop    : 8   fwd:120      rear :  3      
##  NA's:  2     hatchback  :70   rwd: 76                     
##               sedan      :96                               
##               wagon      :25                               
##                                                            
##                                                            
##    wheel_base         length          width           height     
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :47.80  
##  1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10   1st Qu.:52.00  
##  Median : 97.00   Median :173.2   Median :65.50   Median :54.10  
##  Mean   : 98.76   Mean   :174.0   Mean   :65.91   Mean   :53.72  
##  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90   3rd Qu.:55.50  
##  Max.   :120.90   Max.   :208.1   Max.   :72.30   Max.   :59.80  
##                                                                  
##   curb_weight   engine_type num_of_cylinders  engine_size     fuel_system
##  Min.   :1488   dohc : 12   eight :  5       Min.   : 61.0   mpfi   :94  
##  1st Qu.:2145   dohcv:  1   five  : 11       1st Qu.: 97.0   2bbl   :66  
##  Median :2414   l    : 12   four  :159       Median :120.0   idi    :20  
##  Mean   :2556   ohc  :148   six   : 24       Mean   :126.9   1bbl   :11  
##  3rd Qu.:2935   ohcf : 15   three :  1       3rd Qu.:141.0   spdi   : 9  
##  Max.   :4066   ohcv : 13   twelve:  1       Max.   :326.0   4bbl   : 3  
##                 rotor:  4   two   :  4                       (Other): 2  
##       bore          stroke      compression_ratio   horsepower   
##  Min.   :2.54   Min.   :2.070   Min.   : 7.00     Min.   : 48.0  
##  1st Qu.:3.15   1st Qu.:3.110   1st Qu.: 8.60     1st Qu.: 70.0  
##  Median :3.31   Median :3.290   Median : 9.00     Median : 95.0  
##  Mean   :3.33   Mean   :3.255   Mean   :10.14     Mean   :104.3  
##  3rd Qu.:3.59   3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:116.0  
##  Max.   :3.94   Max.   :4.170   Max.   :23.00     Max.   :288.0  
##  NA's   :4      NA's   :4                         NA's   :2      
##     peak_rpm       city_mpg      highway_mpg        price      
##  Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7775  
##  Median :5200   Median :24.00   Median :30.00   Median :10295  
##  Mean   :5125   Mean   :25.22   Mean   :30.75   Mean   :13207  
##  3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16500  
##  Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400  
##  NA's   :2                                      NA's   :4

6.2) Imputation for missing values

There are several variables with NA values. I will impute values using the function mice() in mice package.

# Imputation with mice
temp_data6 <- mice(cars, m = 5, maxit = 5, meth = "pmm", seed = 1234)

## 
##  iter imp variable
##   1   1  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   1   2  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   1   3  normalized_losses*  num_of_doors*  bore  stroke*  horsepower*  peak_rpm*  price*
##   1   4  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   1   5  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   2   1  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   2   2  normalized_losses*  num_of_doors*  bore  stroke*  horsepower*  peak_rpm*  price*
##   2   3  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   2   4  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   2   5  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower  peak_rpm*  price*
##   3   1  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   2  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   3  normalized_losses*  num_of_doors*  bore  stroke*  horsepower*  peak_rpm*  price*
##   3   4  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   5  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   4   1  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   4   2  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   4   3  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   4   4  normalized_losses*  num_of_doors*  bore  stroke*  horsepower*  peak_rpm*  price*
##   4   5  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   5   1  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   5   2  normalized_losses*  num_of_doors*  bore  stroke*  horsepower*  peak_rpm*  price*
##   5   3  normalized_losses*  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   5   4  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   5   5  normalized_losses*  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##  * Please inspect the loggedEvents

## Warning: Number of logged events: 337

cars <- complete(temp_data6, 1)

# To check that we don't have missing values anymore
summary(cars)

##  symboling normalized_losses         make      fuel_type   aspiration 
##  -2: 3     Min.   : 65.0     toyota    : 32   diesel: 20   std  :168  
##  -1:22     1st Qu.: 81.0     nissan    : 18   gas   :185   turbo: 37  
##  0 :67     Median :103.0     mazda     : 17                           
##  1 :54     Mean   :112.4     honda     : 13                           
##  2 :32     3rd Qu.:137.0     mitsubishi: 13                           
##  3 :27     Max.   :256.0     subaru    : 12                           
##                              (Other)   :100                           
##  num_of_doors       body_style drive_wheels engine_location
##  four:114     convertible: 6   4wd:  9      front:202      
##  two : 91     hardtop    : 8   fwd:120      rear :  3      
##               hatchback  :70   rwd: 76                     
##               sedan      :96                               
##               wagon      :25                               
##                                                            
##                                                            
##    wheel_base         length          width           height     
##  Min.   : 86.60   Min.   :141.1   Min.   :60.30   Min.   :47.80  
##  1st Qu.: 94.50   1st Qu.:166.3   1st Qu.:64.10   1st Qu.:52.00  
##  Median : 97.00   Median :173.2   Median :65.50   Median :54.10  
##  Mean   : 98.76   Mean   :174.0   Mean   :65.91   Mean   :53.72  
##  3rd Qu.:102.40   3rd Qu.:183.1   3rd Qu.:66.90   3rd Qu.:55.50  
##  Max.   :120.90   Max.   :208.1   Max.   :72.30   Max.   :59.80  
##                                                                  
##   curb_weight   engine_type num_of_cylinders  engine_size     fuel_system
##  Min.   :1488   dohc : 12   eight :  5       Min.   : 61.0   mpfi   :94  
##  1st Qu.:2145   dohcv:  1   five  : 11       1st Qu.: 97.0   2bbl   :66  
##  Median :2414   l    : 12   four  :159       Median :120.0   idi    :20  
##  Mean   :2556   ohc  :148   six   : 24       Mean   :126.9   1bbl   :11  
##  3rd Qu.:2935   ohcf : 15   three :  1       3rd Qu.:141.0   spdi   : 9  
##  Max.   :4066   ohcv : 13   twelve:  1       Max.   :326.0   4bbl   : 3  
##                 rotor:  4   two   :  4                       (Other): 2  
##       bore          stroke      compression_ratio   horsepower   
##  Min.   :2.54   Min.   :2.070   Min.   : 7.00     Min.   : 48.0  
##  1st Qu.:3.13   1st Qu.:3.100   1st Qu.: 8.60     1st Qu.: 70.0  
##  Median :3.31   Median :3.290   Median : 9.00     Median : 95.0  
##  Mean   :3.32   Mean   :3.243   Mean   :10.14     Mean   :103.8  
##  3rd Qu.:3.58   3rd Qu.:3.410   3rd Qu.: 9.40     3rd Qu.:116.0  
##  Max.   :3.94   Max.   :4.170   Max.   :23.00     Max.   :288.0  
##                                                                  
##     peak_rpm       city_mpg      highway_mpg        price      
##  Min.   :4150   Min.   :13.00   Min.   :16.00   Min.   : 5118  
##  1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00   1st Qu.: 7788  
##  Median :5200   Median :24.00   Median :30.00   Median :10595  
##  Mean   :5116   Mean   :25.22   Mean   :30.75   Mean   :13668  
##  3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00   3rd Qu.:16558  
##  Max.   :6600   Max.   :49.00   Max.   :54.00   Max.   :45400  
##

From the last call to summary() we see that there are no missing values anymore and that all the numeric variables are in their corresponding range (according to UCI data dictionary) after the imputation.

6.3) Train and test sets

The next step is to split the cars data set into a train and a test set. To do so, I used 66% of the observations for the training set and the remaining 33% for the test set.

# Train and test
# Set seed and random index
set.seed(1234)
idx <- sample(x = nrow(cars), size = (2/3)*nrow(cars), replace = FALSE)

# Predictors
x_train <- cars[idx, 1:25]
x_test <- cars[-idx, 1:25]

# Response
y_train <- cars[idx, 26]
y_test <- cars[-idx, 26]

6.4) LASSO Regression

To select the variables for the model, I used LASSO with the function cv.glmnet() from package glmnet. It is important to note that the arguments x and y should be obtained using the function data.matrix(), otherwise cv.glmnet() returns an error. Also, to fit the regression model, we have to specify type.measure = "mse" and family = "gaussian".
Next, I fitted the model p6_lasso using lambda.1se because in, a statistical sense, is indistinguishable from lambda.min but results in a model with fewer parameters.

# Set seed   
set.seed(1234)   

# CV to select the value of lambda
p6_lasso_cv <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                         type.measure = "mse", alpha = 1, family = "gaussian")

# Plot of MSE vs. log(lambda)   
plot(p6_lasso_cv)

# Fit the LASSO model
p6_lasso <- glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                   alpha = 1, family = "gaussian", lambda = p6_lasso_cv$lambda.1se)

# Coefficients of LASSO model   
coef(p6_lasso)

## 26 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)       -41251.436318
## symboling              .       
## normalized_losses      .       
## make                 -62.433976
## fuel_type              .       
## aspiration             .       
## num_of_doors           .       
## body_style             .       
## drive_wheels           .       
## engine_location     3057.137105
## wheel_base             .       
## length                 .       
## width                549.155503
## height                 .       
## curb_weight            1.468725
## engine_type            .       
## num_of_cylinders       .       
## engine_size           53.920240
## fuel_system            .       
## bore                   .       
## stroke                 .       
## compression_ratio      .       
## horsepower            56.989237
## peak_rpm               .       
## city_mpg               .       
## highway_mpg            .

The resulting model is price ~ make + engine_location + width + curb_weight + engine_size + horsepower, all other predictors’ coefficients were shrunken to zero.
Following are the predictions on the test set using the resulting model and the corresponding test MSE.

# Make predictions
p6_lasso_pred <- predict(p6_lasso, newx = data.matrix(x_test), s = p6_lasso_cv$lambda.1se)

# Compute test MSE
p6_MSEP <- mean((y_test - p6_lasso_pred)^2)
p6_MSEP

## [1] 21670879

7)(15 points) Repeat the previous question but instead of using LASSO use a generalized additive model (GAM).

For this problem I used the function gam() in package mgcv. I started fitting a GAM model with the predictors selected in problem 6) by LASSO. This are: make, engine_location, width, curb_weight, engine_size, and horsepower. Then, I fitted different GAM models and computed the test MSE with 10 fold cross validation.

Finally, I choosed the GAM model with smallest test MSE.

7.1) GAM 1

In this model I used a smoother for variables width, curb_weight, engine_size, horsepower and included drive_wheels and engine_location as a linear terms. When we include a linear term with a categorical variable (like in this case), the GAM function fits a model with a fixed effect for each level of the category.

# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars

df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()

# price ~ drive_wheels + engine_location + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
  gam_temp <- gam(price ~ drive_wheels + engine_location + s(width)
                  + s(curb_weight) + s(engine_size) + s(horsepower),
                  data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
  # Predictions model gam_temp
  gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
  # MSEP model gam_temp
  p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}

# MSEP for different folds
unlist(p7_MSEP)

##  [1]  7559961  6763730 24519172 39526409  6169915 50907957  6139796
##  [8] 11395538  5628761  5500033

# Mean MSEP
mean(unlist(p7_MSEP))

## [1] 16411127

7.2) GAM 2

This GAM model is similar to the previous but now I am specifying s(horsepower, by = engine_location) to fit different smooths for each level of the categorical variable. This is called “smooth factor interaction”.

# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars

df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()

# price ~ drive_wheels + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
  gam_temp <- gam(price ~ drive_wheels + s(width)
                  + s(curb_weight) + s(engine_size) + s(horsepower, by = engine_location),
                  data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
  # Predictions model gam_temp
  gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
  # MSEP model gam_temp
  p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}

# MSEP for different folds
unlist(p7_MSEP)

##  [1]  7509187  6780296 24186456 39510425  6529835 51048495  6144899
##  [8] 11311768  5605478  5510205

# Mean MSEP
mean(unlist(p7_MSEP))

## [1] 16413704

7.2) GAM 3

This model is a combination of the previous two models. Here I am modeling a smooth-factor interaction with s(horsepower, by = engine_location) and also including a varying intercept (the linear engine_location). This will result in an improvement in case the different categories are different in overall means in addition to the shape of the smooths.

# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars

df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()

# price ~ drive_wheels + engine_location + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
  gam_temp <- gam(price ~ drive_wheels + engine_location + s(width)
                  + s(curb_weight) + s(engine_size) + s(horsepower, by = engine_location),
                  data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
  # Predictions model gam_temp
  gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
  # MSEP model gam_temp
  p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}

# MSEP for different folds
unlist(p7_MSEP)

##  [1]  7561750  6764076 24519143 39526587  6170054 50908202  6139432
##  [8] 11394694  5629081  5499898

# Mean MSEP
mean(unlist(p7_MSEP))

## [1] 16411292

7.2) GAM 4

Finally, I fitted a last GAM model without linear terms; only considering smooths for the continuos predictors.

# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars

df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()

# price ~ drive_wheels + engine_location + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
  gam_temp <- gam(price ~ s(width) + s(curb_weight) + s(engine_size) + s(horsepower),
                  data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
  # Predictions model gam_temp
  gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
  # MSEP model gam_temp
  p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}

# MSEP for different folds
unlist(p7_MSEP)

##  [1]  9406306  9906989 22583957 52967969 11918524 51985440  5755281
##  [8] 17740862  4320213  3490733

# Mean MSEP
mean(unlist(p7_MSEP))

## [1] 19007627

The GAM model with smallest test MSE is GAM 1, namely: price ~ drive_wheels + engine_location + s(width) + s(curb_weight) + s(engine_size) + s(horsepower)

8) (5 points) If you have to choose between using the LASSO or a GAM model, which one would you choose?

Using the cars data set the test MSE for the LASSO model was 21670879, while for the best GAM model was 16411127. Taking this into account, I prefer the GAM model for this data set.

9) (30 points) Read the following dataset into R using the code below. Use it to predict the variable “ViolentCrimesPerPop” by using any model that we have covered so far (Note though: Don’t use trees yet. That’s not on this exam.)

9.1) Data import and cleaning

The first step is to import the data.

# Data import
# https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
crimes <- read.csv("https://bit.ly/2CJ7Wko", header = FALSE, na.strings = "?")

# Structure of data
str(crimes, list.len = ncol(crimes))

## 'data.frame':    1994 obs. of  128 variables:
##  $ V1  : int  8 53 24 34 42 6 44 6 21 29 ...
##  $ V2  : int  NA NA NA 5 95 NA 7 NA NA NA ...
##  $ V3  : int  NA NA NA 81440 6096 NA 41500 NA NA NA ...
##  $ V4  : Factor w/ 1828 levels "Aberdeencity",..: 796 1626 2 1788 142 1520 840 1462 669 288 ...
##  $ V5  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V6  : num  0.19 0 0 0.04 0.01 0.02 0.01 0.01 0.03 0.01 ...
##  $ V7  : num  0.33 0.16 0.42 0.77 0.55 0.28 0.39 0.74 0.34 0.4 ...
##  $ V8  : num  0.02 0.12 0.49 1 0.02 0.06 0 0.03 0.2 0.06 ...
##  $ V9  : num  0.9 0.74 0.56 0.08 0.95 0.54 0.98 0.46 0.84 0.87 ...
##  $ V10 : num  0.12 0.45 0.17 0.12 0.09 1 0.06 0.2 0.02 0.3 ...
##  $ V11 : num  0.17 0.07 0.04 0.1 0.05 0.25 0.02 1 0 0.03 ...
##  $ V12 : num  0.34 0.26 0.39 0.51 0.38 0.31 0.3 0.52 0.38 0.9 ...
##  $ V13 : num  0.47 0.59 0.47 0.5 0.38 0.48 0.37 0.55 0.45 0.82 ...
##  $ V14 : num  0.29 0.35 0.28 0.34 0.23 0.27 0.23 0.36 0.28 0.8 ...
##  $ V15 : num  0.32 0.27 0.32 0.21 0.36 0.37 0.6 0.35 0.48 0.39 ...
##  $ V16 : num  0.2 0.02 0 0.06 0.02 0.04 0.02 0 0.04 0.02 ...
##  $ V17 : num  1 1 0 1 0.9 1 0.81 0 1 1 ...
##  $ V18 : num  0.37 0.31 0.3 0.58 0.5 0.52 0.42 0.16 0.17 0.54 ...
##  $ V19 : num  0.72 0.72 0.58 0.89 0.72 0.68 0.5 0.44 0.47 0.59 ...
##  $ V20 : num  0.34 0.11 0.19 0.21 0.16 0.2 0.23 1 0.36 0.22 ...
##  $ V21 : num  0.6 0.45 0.39 0.43 0.68 0.61 0.68 0.23 0.34 0.86 ...
##  $ V22 : num  0.29 0.25 0.38 0.36 0.44 0.28 0.61 0.53 0.55 0.42 ...
##  $ V23 : num  0.15 0.29 0.4 0.2 0.11 0.15 0.21 0.97 0.48 0.02 ...
##  $ V24 : num  0.43 0.39 0.84 0.82 0.71 0.25 0.54 0.41 0.43 0.31 ...
##  $ V25 : num  0.39 0.29 0.28 0.51 0.46 0.62 0.43 0.15 0.21 0.85 ...
##  $ V26 : num  0.4 0.37 0.27 0.36 0.43 0.72 0.47 0.1 0.23 0.89 ...
##  $ V27 : num  0.39 0.38 0.29 0.4 0.41 0.76 0.44 0.12 0.23 0.94 ...
##  $ V28 : num  0.32 0.33 0.27 0.39 0.28 0.77 0.4 0.08 0.19 0.11 ...
##  $ V29 : num  0.27 0.16 0.07 0.16 0 0.28 0.24 0.17 0.1 0.09 ...
##  $ V30 : num  0.27 0.3 0.29 0.25 0.74 0.52 0.86 0.27 0.26 0.33 ...
##  $ V31 : num  0.36 0.22 0.28 0.36 0.51 0.48 0.24 0.18 0.29 0.17 ...
##  $ V32 : num  0.41 0.35 0.39 0.44 0.48 0.6 0.36 0.21 0.22 0.8 ...
##  $ V33 : num  0.08 0.01 0.01 0.01 0 0.01 0.01 0.03 0.04 0 ...
##  $ V34 : num  0.19 0.24 0.27 0.1 0.06 0.12 0.11 0.64 0.45 0.11 ...
##  $ V35 : num  0.1 0.14 0.27 0.09 0.25 0.13 0.29 0.96 0.52 0.04 ...
##  $ V36 : num  0.18 0.24 0.43 0.25 0.3 0.12 0.41 0.82 0.59 0.03 ...
##  $ V37 : num  0.48 0.3 0.19 0.31 0.33 0.8 0.36 0.12 0.17 1 ...
##  $ V38 : num  0.27 0.27 0.36 0.33 0.12 0.1 0.28 1 0.55 0.11 ...
##  $ V39 : num  0.68 0.73 0.58 0.71 0.65 0.65 0.54 0.26 0.43 0.44 ...
##  $ V40 : num  0.23 0.57 0.32 0.36 0.67 0.19 0.44 0.43 0.59 0.2 ...
##  $ V41 : num  0.41 0.15 0.29 0.45 0.38 0.77 0.53 0.34 0.36 1 ...
##  $ V42 : num  0.25 0.42 0.49 0.37 0.42 0.06 0.33 0.71 0.64 0.02 ...
##  $ V43 : num  0.52 0.36 0.32 0.39 0.46 0.91 0.49 0.18 0.29 0.96 ...
##  $ V44 : num  0.68 1 0.63 0.34 0.22 0.49 0.25 0.38 0.62 0.3 ...
##  $ V45 : num  0.4 0.63 0.41 0.45 0.27 0.57 0.34 0.47 0.26 0.85 ...
##  $ V46 : num  0.75 0.91 0.71 0.49 0.2 0.61 0.28 0.59 0.66 0.39 ...
##  $ V47 : num  0.75 1 0.7 0.44 0.21 0.58 0.28 0.52 0.67 0.36 ...
##  $ V48 : num  0.35 0.29 0.45 0.75 0.51 0.44 0.42 0.78 0.37 0.31 ...
##  $ V49 : num  0.55 0.43 0.42 0.65 0.91 0.62 0.77 0.45 0.51 0.65 ...
##  $ V50 : num  0.59 0.47 0.44 0.54 0.91 0.69 0.81 0.43 0.55 0.73 ...
##  $ V51 : num  0.61 0.6 0.43 0.83 0.89 0.87 0.79 0.34 0.58 0.78 ...
##  $ V52 : num  0.56 0.39 0.43 0.65 0.85 0.53 0.74 0.34 0.47 0.67 ...
##  $ V53 : num  0.74 0.46 0.71 0.85 0.4 0.3 0.57 0.29 0.65 0.72 ...
##  $ V54 : num  0.76 0.53 0.67 0.86 0.6 0.43 0.62 0.27 0.64 0.71 ...
##  $ V55 : num  0.04 0 0.01 0.03 0 0 0 0.02 0.02 0 ...
##  $ V56 : num  0.14 0.24 0.46 0.33 0.06 0.11 0.13 0.5 0.29 0.07 ...
##  $ V57 : num  0.03 0.01 0 0.02 0 0.04 0.01 0.02 0 0.01 ...
##  $ V58 : num  0.24 0.52 0.07 0.11 0.03 0.3 0 0.5 0.12 0.41 ...
##  $ V59 : num  0.27 0.62 0.06 0.2 0.07 0.35 0.02 0.59 0.09 0.44 ...
##  $ V60 : num  0.37 0.64 0.15 0.3 0.2 0.43 0.02 0.65 0.07 0.52 ...
##  $ V61 : num  0.39 0.63 0.19 0.31 0.27 0.47 0.1 0.59 0.13 0.48 ...
##  $ V62 : num  0.07 0.25 0.02 0.05 0.01 0.5 0 0.69 0 0.22 ...
##  $ V63 : num  0.07 0.27 0.02 0.08 0.02 0.5 0.01 0.72 0 0.21 ...
##  $ V64 : num  0.08 0.25 0.04 0.11 0.04 0.56 0.01 0.71 0 0.22 ...
##  $ V65 : num  0.08 0.23 0.05 0.11 0.05 0.57 0.03 0.6 0 0.19 ...
##  $ V66 : num  0.89 0.84 0.88 0.81 0.88 0.45 0.73 0.12 0.99 0.85 ...
##  $ V67 : num  0.06 0.1 0.04 0.08 0.05 0.28 0.05 0.93 0.01 0.03 ...
##  $ V68 : num  0.14 0.16 0.2 0.56 0.16 0.25 0.12 0.74 0.12 0.09 ...
##  $ V69 : num  0.13 0.1 0.2 0.62 0.19 0.19 0.13 0.75 0.12 0.06 ...
##  $ V70 : num  0.33 0.17 0.46 0.85 0.59 0.29 0.42 0.8 0.35 0.15 ...
##  $ V71 : num  0.39 0.29 0.52 0.77 0.6 0.53 0.54 0.68 0.38 0.34 ...
##  $ V72 : num  0.28 0.17 0.43 1 0.37 0.18 0.24 0.92 0.33 0.05 ...
##  $ V73 : num  0.55 0.26 0.42 0.94 0.89 0.39 0.65 0.39 0.5 0.48 ...
##  $ V74 : num  0.09 0.2 0.15 0.12 0.02 0.26 0.03 0.89 0.1 0.03 ...
##  $ V75 : num  0.51 0.82 0.51 0.01 0.19 0.73 0.46 0.66 0.64 0.58 ...
##  $ V76 : num  0.5 0 0.5 0.5 0.5 0 0.5 0 0 0 ...
##  $ V77 : num  0.21 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.04 0.02 ...
##  $ V78 : num  0.71 0.79 0.86 0.97 0.89 0.84 0.89 0.91 0.72 0.72 ...
##  $ V79 : num  0.52 0.24 0.41 0.96 0.87 0.3 0.57 0.46 0.49 0.38 ...
##  $ V80 : num  0.05 0.02 0.29 0.6 0.04 0.16 0.09 0.22 0.05 0.07 ...
##  $ V81 : num  0.26 0.25 0.3 0.47 0.55 0.28 0.49 0.37 0.49 0.47 ...
##  $ V82 : num  0.65 0.65 0.52 0.52 0.73 0.25 0.38 0.6 0.5 0.04 ...
##  $ V83 : num  0.14 0.16 0.47 0.11 0.05 0.02 0.05 0.28 0.57 0.01 ...
##  $ V84 : num  0.06 0 0.45 0.11 0.14 0.05 0.05 0.23 0.22 0 ...
##  $ V85 : num  0.22 0.21 0.18 0.24 0.31 0.94 0.37 0.15 0.07 0.63 ...
##  $ V86 : num  0.19 0.2 0.17 0.21 0.31 1 0.38 0.13 0.07 0.71 ...
##  $ V87 : num  0.18 0.21 0.16 0.19 0.3 1 0.39 0.13 0.08 0.79 ...
##  $ V88 : num  0.36 0.42 0.27 0.75 0.4 0.67 0.26 0.21 0.14 0.44 ...
##  $ V89 : num  0.35 0.38 0.29 0.7 0.36 0.63 0.35 0.24 0.17 0.42 ...
##  $ V90 : num  0.38 0.4 0.27 0.77 0.38 0.68 0.42 0.25 0.16 0.47 ...
##  $ V91 : num  0.34 0.37 0.31 0.89 0.38 0.62 0.35 0.24 0.15 0.41 ...
##  $ V92 : num  0.38 0.29 0.48 0.63 0.22 0.47 0.46 0.64 0.38 0.23 ...
##  $ V93 : num  0.46 0.32 0.39 0.51 0.51 0.59 0.44 0.59 0.13 0.27 ...
##  $ V94 : num  0.25 0.18 0.28 0.47 0.21 0.11 0.31 0.28 0.36 0.28 ...
##  $ V95 : num  0.04 0 0 0 0 0 0 0 0.01 0 ...
##  $ V96 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ V97 : num  0.12 0.21 0.14 0.19 0.11 0.7 0.15 0.59 0.01 0.22 ...
##  $ V98 : num  0.42 0.5 0.49 0.3 0.72 0.42 0.81 0.58 0.78 0.42 ...
##  $ V99 : num  0.5 0.34 0.54 0.73 0.64 0.49 0.77 0.52 0.48 0.34 ...
##  $ V100: num  0.51 0.6 0.67 0.64 0.61 0.73 0.91 0.79 0.79 0.23 ...
##  $ V101: num  0.64 0.52 0.56 0.65 0.53 0.64 0.84 0.78 0.75 0.09 ...
##  $ V102: num  0.03 NA NA NA NA NA NA NA NA NA ...
##  $ V103: num  0.13 NA NA NA NA NA NA NA NA NA ...
##  $ V104: num  0.96 NA NA NA NA NA NA NA NA NA ...
##  $ V105: num  0.17 NA NA NA NA NA NA NA NA NA ...
##  $ V106: num  0.06 NA NA NA NA NA NA NA NA NA ...
##  $ V107: num  0.18 NA NA NA NA NA NA NA NA NA ...
##  $ V108: num  0.44 NA NA NA NA NA NA NA NA NA ...
##  $ V109: num  0.13 NA NA NA NA NA NA NA NA NA ...
##  $ V110: num  0.94 NA NA NA NA NA NA NA NA NA ...
##  $ V111: num  0.93 NA NA NA NA NA NA NA NA NA ...
##  $ V112: num  0.03 NA NA NA NA NA NA NA NA NA ...
##  $ V113: num  0.07 NA NA NA NA NA NA NA NA NA ...
##  $ V114: num  0.1 NA NA NA NA NA NA NA NA NA ...
##  $ V115: num  0.07 NA NA NA NA NA NA NA NA NA ...
##  $ V116: num  0.02 NA NA NA NA NA NA NA NA NA ...
##  $ V117: num  0.57 NA NA NA NA NA NA NA NA NA ...
##  $ V118: num  0.29 NA NA NA NA NA NA NA NA NA ...
##  $ V119: num  0.12 0.02 0.01 0.02 0.04 0.01 0.05 0.01 0.04 0 ...
##  $ V120: num  0.26 0.12 0.21 0.39 0.09 0.58 0.08 0.33 0.17 0.47 ...
##  $ V121: num  0.2 0.45 0.02 0.28 0.02 0.1 0.06 0 0.04 0.11 ...
##  $ V122: num  0.06 NA NA NA NA NA NA NA NA NA ...
##  $ V123: num  0.04 NA NA NA NA NA NA NA NA NA ...
##  $ V124: num  0.9 NA NA NA NA NA NA NA NA NA ...
##  $ V125: num  0.5 NA NA NA NA NA NA NA NA NA ...
##  $ V126: num  0.32 0 0 0 0 0 0 0 0 0 ...
##  $ V127: num  0.14 NA NA NA NA NA NA NA NA NA ...
##  $ V128: num  0.2 0.67 0.43 0.12 0.03 0.14 0.03 0.55 0.53 0.15 ...

We can see that the data is imported as an object of class “data.frame”. The str() function displays the structure of the data and we can see that there are missing values.

The next step is to read the data dictionary in the UCI Machine Learning Repository to check if the variables were imported correctly. The data dictionary also tells that the some variables are not predictive:

V1 is state number. Not predictive.
V2 is county. Not predictive
V3 is community. Not predictive
V4 is community name. Not predictive
V5 is fold number for non-random 10 fold cross validation. Not predictive

Hence, I removed these variables from the data set.

# Remove not predictive variables
crimes <- crimes %>% 
  select(-c(V1, V2, V3, V4, V5))

9.2) Imputation for missing values

As mentioned before, there are several variables with NA values. I used function mice() in package mice to impute values for these variables.

# Imputation
temp_data <- mice(crimes, m = 3, maxit = 5, meth = "pmm", seed = 1234)

## 
##  iter imp variable
##   1   1  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   1   2  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   1   3  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   2   1  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   2   2  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   2   3  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   3   1  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   3   2  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   3   3  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   4   1  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   4   2  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   4   3  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   5   1  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   5   2  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127
##   5   3  V31  V102  V103  V104  V105  V106  V107  V108  V110  V111  V112  V113  V114  V115  V116  V117  V118  V122  V123  V124  V125  V127

## Warning: Number of logged events: 331

Next I used the function complete(), also from mice package, to complete the data and stored the data in a new data frame called df9.

# Complete data
df9 <- complete(temp_data, 1)

# Problem with V109
summary(df9$V109)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1300  0.1800  0.2175  0.2500  1.0000    1675

For some reason mice doesn’t impute in V109, so I did it manually. I imputed the median.

# Manually impute for V109
imp_V109 <- median(df9$V109, na.rm = TRUE)
df9$V109[which(is.na(df9$V109))] <- imp_V109

9.3) Train and test sets

The next step is to split the df9 data set into a train and a test set. To do so, I used 66% of the observations for the training set and the remaining 33% for the test set.

# Train and test
# Set seed and random index
set.seed(1234)
idx <- sample(x = nrow(df9), size = (2/3)*nrow(df9), replace = FALSE)

# Predictors
x_train <- df9[idx, 1:121]
x_test <- df9[-idx, 1:121]

# Response
y_train <- df9[idx, 122]
y_test <- df9[-idx, 122]

9.4) Elastic net models

Finally, I fitted several elastic net models with different values of alpha and computed test MSEP for each of them.

I started by creating a list where I am going to store the different elastic net models. This list is called enet_models. Then I did a for loop to get different values of alpha and passed these values to cv.glmnet(). The output of cv.glmnet() is stored in enet_models and is used in the next for loop, where the predict() function is used.

In the second for loop, predictions are made for each value of alpha and they are stored in the object enet_pred. With those values, test MSEP is computed and stored in enet_MSEP. Finally, for every cycle of the loop alpha, enet_MSEP, and enet_name are stored in temp data set, which later is row binded to results data set.

# Set seed
set.seed(1234)

# List to store elastic net models with different alphas
enet_models <- list()   

for (i in 0:10){
  # Model name
  enet_name <- paste0("alpha", i/10)
  # CV to select lambda for each value of alpha
  enet_models[[enet_name]] <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                                        type.measure = "mse", alpha = i/10, family = "gaussian")
}

# Data frame to store results
results <- data.frame()

for (i in 0:10){
  # Model name
  enet_name <- paste0("alpha", i/10)
  # Predictions
  enet_pred <- predict(enet_models[[enet_name]], 
                       s = enet_models[[enet_name]]$lambda.1se, newx = data.matrix(x_test))
  # Compute test MSEP
  enet_MSEP <- mean((y_test - enet_pred)^2)
  # Store alpha, MSEP, and model name in temp data frame
  temp <- data.frame(alpha = i/10, enet_MSEP = enet_MSEP, enet_name = enet_name)
  # Row bind temp to results data frame
  results <- rbind(results, temp)
}

# Show results
results

##    alpha   enet_MSEP enet_name
## 1    0.0 0.006667070    alpha0
## 2    0.1 0.004982013  alpha0.1
## 3    0.2 0.004806970  alpha0.2
## 4    0.3 0.004834450  alpha0.3
## 5    0.4 0.004899841  alpha0.4
## 6    0.5 0.004775144  alpha0.5
## 7    0.6 0.004822186  alpha0.6
## 8    0.7 0.004956150  alpha0.7
## 9    0.8 0.004862721  alpha0.8
## 10   0.9 0.004890681  alpha0.9
## 11   1.0 0.005124086    alpha1

From results, we observe that the smallest test MSEP is obtained using $\alpha = 0.5$.

9.5) Final elastic net model

To conclude, I fit the elastic net model with alpha = 0.5 and show its coefficients. We can see that some of them have been shrunken to zero and others to really small values.

# Set seed
set.seed(1234)

# CV to select the value of lambda
enet_final_cv <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                           type.measure = "mse", alpha = 0.5, family = "gaussian")

# Plot of MSE vs. log(lambda)   
plot(enet_final_cv)

# Fit the elastic net model
enet_final <- glmnet(x = data.matrix(x_train), y = data.matrix(y_train), 
                     alpha = 0.5, family = "gaussian", lambda = enet_final_cv$lambda.1se)
# Coefficients
coef(enet_final)

## 122 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  0.8141653697
## V6           .           
## V7           0.0330382474
## V8          -0.0075820692
## V9           0.0114380959
## V10         -0.0299804066
## V11          .           
## V12          0.0006534571
## V13         -0.1028892814
## V14         -0.0259960410
## V15          .           
## V16          .           
## V17         -0.0789971992
## V18         -0.0508585112
## V19         -0.1057517101
## V20         -0.0129870582
## V21          0.0791538535
## V22         -0.2928632621
## V23          0.0338010267
## V24         -0.0157624002
## V25          0.0138503690
## V26          .           
## V27         -0.0127860069
## V28          0.0565525227
## V29          0.0376076388
## V30         -0.0350711902
## V31          0.0479656215
## V32         -0.0578237561
## V33         -0.0089708109
## V34          0.1331489492
## V35          0.1116073208
## V36         -0.2478165665
## V37         -0.2377867830
## V38         -0.1080590520
## V39         -0.2784794322
## V40          0.1545775258
## V41          .           
## V42         -0.0313782439
## V43          0.1462542675
## V44         -0.1386543156
## V45         -0.1024025528
## V46          .           
## V47          .           
## V48          0.0044462099
## V49         -0.0939670935
## V50         -0.1832024355
## V51          0.2732906519
## V52          0.0122564705
## V53          0.0634836218
## V54          0.0297050520
## V55          0.1138081240
## V56          .           
## V57         -0.1817845520
## V58         -0.2187270673
## V59          0.1515294223
## V60          .           
## V61          0.1511782354
## V62          .           
## V63          .           
## V64          .           
## V65         -0.0954286018
## V66          0.0059720858
## V67         -0.0624960597
## V68          0.0750087910
## V69          .           
## V70          .           
## V71         -0.0519289917
## V72         -0.2004792813
## V73         -0.0746416412
## V74          0.1334838645
## V75          .           
## V76          0.0009247091
## V77          0.0001367667
## V78          0.1104608401
## V79          .           
## V80          0.0237334842
## V81          0.0491419879
## V82         -0.0354619936
## V83          .           
## V84         -0.0824981675
## V85          0.0086405301
## V86         -0.0885778575
## V87          .           
## V88          0.0830537148
## V89          0.0728738963
## V90          .           
## V91          .           
## V92         -0.0494731626
## V93          0.0667032995
## V94         -0.0785663872
## V95          0.0610509958
## V96          .           
## V97          0.1468123794
## V98         -0.0294297451
## V99          0.0091647185
## V100         0.0072653867
## V101        -0.0558800323
## V102        -1.5482567069
## V103         0.5173826840
## V104        -0.3085014287
## V105         0.4112105890
## V106         0.0986388772
## V107         0.0838994683
## V108        -0.0142401745
## V109         0.0313844299
## V110         .           
## V111        -0.0276188243
## V112         .           
## V113         0.0846552709
## V114         0.0084125436
## V115        -0.0668021810
## V116        -0.1817067067
## V117         .           
## V118         0.0064450323
## V119         0.0448022873
## V120         .           
## V121        -0.0091008230
## V122         0.0198524874
## V123         1.1848998489
## V124        -0.0673155901
## V125        -0.0016069745
## V126         0.0488002643

STAT 488 Predictive Analytics - Exam 1

Juan M. Cánovas

October 30, 2018