# Required libraries
library(dplyr)
library(ggplot2)
library(glmnet)
library(mgcv)
library(methods)
library(mice)
This statement is false. Doing this we would be overfitting and the problem with this strategy is that there is no guarantee that the model with the lowest training MSE will also have the lowest test MSE.
This happens because many statistical methods, e.g. maximum likelihood, specifically estimate coefficients so as to minimize the training set MSE. For these methods, the training set MSE can be quite small, but the test MSE is often much larger.
The bias-variance trade off is governed by the following equation:
\[E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\]
In general, as we increase the complexity of the model, the bias tends to initially decrease faster than the variance increases.
On the other hand, as the model gets less complex, the variance will decrease and the bias will increase.
LASSO can perform variable selection, while Rigde cannot.
For LASSO regression we have:
\[\sum_{i = 1}^{n} \big(y_{i} - \beta_{0} - \sum_{j = 1}^{p} \beta_{j} x_{ij} \big)^2 + \lambda \sum_{j = 1}^{p} |\beta_{j}| = RSS + \lambda \sum_{j = 1}^{p} |\beta_{j}|\]
For Ridge regression we have:
\[\sum_{i = 1}^{n} \big(y_{i} - \beta_{0} - \sum_{j = 1}^{p} \beta_{j} x_{ij} \big)^2 + \lambda \sum_{j = 1}^{p} \beta_{j}^{2} = RSS + \lambda \sum_{j = 1}^{p} \beta_{j}^{2}\]
Both LASSO and Ridge have similar formulations. The only difference is that the \(\beta_{j}^{2}\) term in the Ridge regression penalty is replaced by \(|\beta_{j}|\) in the LASSO penalty. LASSO uses an \(\ell_{1}\) penalty instead of an \(\ell_{2}\) penalty.
For the LASSO, the \(\ell_{1}\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large.
To solve this problem we can use principal components regression (PCR). This approach involves constructing the first M principal components, \(Z_{1}, \ldots , Z_{M}\), and then using these components as the predictors in a linear regression model that is fit using least squares. The key idea is that often a small number of principal components suffice to explain most of the variability in the data, as well as the relationship with the response.
Other methods discussed in the semester, like Linear Discriminant Analysis or k-nearest neighbors would not work in this setting because they are classification algorithms and in this problem we are interested in regression.
The first step is to import the data.
# Data import
# https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
breastcancer <- read.csv("https://bit.ly/2qg37Hv", header = FALSE)
names(breastcancer) <- c("class", "age", "menopause",
"tumor_size", "inv_nodes",
"node_caps", "deg_malig",
"breast", "breast_quad",
"irradiat")
class(breastcancer)
## [1] "data.frame"
str(breastcancer)
## 'data.frame': 286 obs. of 10 variables:
## $ class : Factor w/ 2 levels "no-recurrence-events",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ age : Factor w/ 6 levels "20-29","30-39",..: 2 3 3 5 3 5 4 5 3 3 ...
## $ menopause : Factor w/ 3 levels "ge40","lt40",..: 3 3 3 1 3 1 3 1 3 3 ...
## $ tumor_size : Factor w/ 11 levels "0-4","10-14",..: 6 4 4 3 1 3 5 4 11 4 ...
## $ inv_nodes : Factor w/ 7 levels "0-2","12-14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ node_caps : Factor w/ 3 levels "?","no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ deg_malig : int 3 2 2 2 2 2 2 1 2 2 ...
## $ breast : Factor w/ 2 levels "left","right": 1 2 1 2 2 1 1 1 1 2 ...
## $ breast_quad: Factor w/ 6 levels "?","central",..: 3 6 3 4 5 3 3 3 3 4 ...
## $ irradiat : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(breastcancer)
## class age menopause tumor_size
## no-recurrence-events:201 20-29: 1 ge40 :129 30-34 :60
## recurrence-events : 85 30-39:36 lt40 : 7 25-29 :54
## 40-49:90 premeno:150 20-24 :50
## 50-59:96 15-19 :30
## 60-69:57 10-14 :28
## 70-79: 6 40-44 :22
## (Other):42
## inv_nodes node_caps deg_malig breast breast_quad
## 0-2 :213 ? : 8 Min. :1.000 left :152 ? : 1
## 12-14: 3 no :222 1st Qu.:2.000 right:134 central : 21
## 15-17: 6 yes: 56 Median :2.000 left_low :110
## 24-26: 1 Mean :2.049 left_up : 97
## 3-5 : 36 3rd Qu.:3.000 right_low: 24
## 6-8 : 17 Max. :3.000 right_up : 33
## 9-11 : 10
## irradiat
## no :218
## yes: 68
##
##
##
##
##
We can see that the data is imported as an object of class “data.frame”. The str()
function displays the structure of the data and here we can see that the variables have been imported as factors, which is correct according to the data dictionary in the UCI Machine Learning Repository. The only exception is the variable deg_malig, which is imported as numeric, when it should be a factor. Furthermore, missing values have been imported as “?” instead of NA.
Next, I am going to replace all the “?” for NA and re-define the factor levels and variables where necessary.
# Missing values: Replace "?" for NA
breastcancer <- replace(breastcancer, breastcancer == "?", NA)
# node_factor, breast_quad and deg_malig
breastcancer <- breastcancer %>%
mutate(node_caps = factor(node_caps, levels = c("no", "yes"))) %>%
mutate(breast_quad = factor(breast_quad, levels = c("central", "left_low", "left_up", "right_low", "right_up"))) %>%
mutate(deg_malig = as.factor(deg_malig))
str(breastcancer)
## 'data.frame': 286 obs. of 10 variables:
## $ class : Factor w/ 2 levels "no-recurrence-events",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ age : Factor w/ 6 levels "20-29","30-39",..: 2 3 3 5 3 5 4 5 3 3 ...
## $ menopause : Factor w/ 3 levels "ge40","lt40",..: 3 3 3 1 3 1 3 1 3 3 ...
## $ tumor_size : Factor w/ 11 levels "0-4","10-14",..: 6 4 4 3 1 3 5 4 11 4 ...
## $ inv_nodes : Factor w/ 7 levels "0-2","12-14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ node_caps : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ deg_malig : Factor w/ 3 levels "1","2","3": 3 2 2 2 2 2 2 1 2 2 ...
## $ breast : Factor w/ 2 levels "left","right": 1 2 1 2 2 1 1 1 1 2 ...
## $ breast_quad: Factor w/ 5 levels "central","left_low",..: 2 5 2 3 4 2 2 2 2 3 ...
## $ irradiat : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(breastcancer)
## class age menopause tumor_size
## no-recurrence-events:201 20-29: 1 ge40 :129 30-34 :60
## recurrence-events : 85 30-39:36 lt40 : 7 25-29 :54
## 40-49:90 premeno:150 20-24 :50
## 50-59:96 15-19 :30
## 60-69:57 10-14 :28
## 70-79: 6 40-44 :22
## (Other):42
## inv_nodes node_caps deg_malig breast breast_quad irradiat
## 0-2 :213 no :222 1: 71 left :152 central : 21 no :218
## 12-14: 3 yes : 56 2:130 right:134 left_low :110 yes: 68
## 15-17: 6 NA's: 8 3: 85 left_up : 97
## 24-26: 1 right_low: 24
## 3-5 : 36 right_up : 33
## 6-8 : 17 NA's : 1
## 9-11 : 10
From the last call to str()
and summary()
we can see that there is no longer “?” in the data and that the factor levels for variables node_factor and breast_quad have been redefined. Also, variable deg_malig is now a factor.
We can also see that the ratio between “no” and “yes” for the variable node_factor is almost 4:1, meaning that for every “yes” there are approximately four “no”. Taking this into account, and since there are only eight NA’s in this variable, I will impute the level “no” to the missing values here.
Similarly, for the variable breast_quad, I imputed the most repeated level (“left_low”) to the only missing value.
# Imputation
breastcancer$node_caps[which(is.na(breastcancer$node_caps))] <- "no"
breastcancer$breast_quad[which(is.na(breastcancer$breast_quad))] <- "left_low"
summary(breastcancer)
## class age menopause tumor_size
## no-recurrence-events:201 20-29: 1 ge40 :129 30-34 :60
## recurrence-events : 85 30-39:36 lt40 : 7 25-29 :54
## 40-49:90 premeno:150 20-24 :50
## 50-59:96 15-19 :30
## 60-69:57 10-14 :28
## 70-79: 6 40-44 :22
## (Other):42
## inv_nodes node_caps deg_malig breast breast_quad irradiat
## 0-2 :213 no :230 1: 71 left :152 central : 21 no :218
## 12-14: 3 yes: 56 2:130 right:134 left_low :111 yes: 68
## 15-17: 6 3: 85 left_up : 97
## 24-26: 1 right_low: 24
## 3-5 : 36 right_up : 33
## 6-8 : 17
## 9-11 : 10
Finally, our data set breastcancer has no missing values.
In this section I made a contingency table for class and every other predictor available. I am checking that there are at least one observation in every cell of the table. If that is not the case, the predict()
function will return an error later.
# Contingency tables
xtabs(~ class + age, data = breastcancer) # Zero for age 20-29
## age
## class 20-29 30-39 40-49 50-59 60-69 70-79
## no-recurrence-events 1 21 63 71 40 5
## recurrence-events 0 15 27 25 17 1
xtabs(~ class + menopause, data = breastcancer)
## menopause
## class ge40 lt40 premeno
## no-recurrence-events 94 5 102
## recurrence-events 35 2 48
xtabs(~ class + tumor_size, data = breastcancer) # Zero for tumor size 5-9
## tumor_size
## class 0-4 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49
## no-recurrence-events 7 27 23 34 36 35 12 16 2
## recurrence-events 1 1 7 16 18 25 7 6 1
## tumor_size
## class 5-9 50-54
## no-recurrence-events 4 5
## recurrence-events 0 3
xtabs(~ class + inv_nodes, data = breastcancer) # Zero for inv_node 24-26
## inv_nodes
## class 0-2 12-14 15-17 24-26 3-5 6-8 9-11
## no-recurrence-events 167 1 3 0 19 7 4
## recurrence-events 46 2 3 1 17 10 6
xtabs(~ class + node_caps, data = breastcancer)
## node_caps
## class no yes
## no-recurrence-events 176 25
## recurrence-events 54 31
xtabs(~ class + deg_malig, data = breastcancer)
## deg_malig
## class 1 2 3
## no-recurrence-events 59 102 40
## recurrence-events 12 28 45
xtabs(~ class + breast, data = breastcancer)
## breast
## class left right
## no-recurrence-events 103 98
## recurrence-events 49 36
xtabs(~ class + breast_quad, data = breastcancer)
## breast_quad
## class central left_low left_up right_low right_up
## no-recurrence-events 17 75 71 18 20
## recurrence-events 4 36 26 6 13
xtabs(~ class + irradiat, data = breastcancer)
## irradiat
## class no yes
## no-recurrence-events 164 37
## recurrence-events 54 31
We can see that for variables age, tumor_size, and inv_nodes there are zeros in at least one cell of the contingency table. For those cases, I am going to reassign them to another level and redefine the levels of the corresponding factor.
# Redefine age
breastcancer <- breastcancer %>%
mutate(age = factor(age, levels = c("20-29", "30-39", "20-39", "40-49", "50-59", "60-69", "70-79")))
idx_age1 <- which(breastcancer$age == "20-29")
breastcancer$age[idx_age1] <- "20-39"
idx_age2 <- which(breastcancer$age == "30-39")
breastcancer$age[idx_age2] <- "20-39"
breastcancer <- breastcancer %>%
mutate(age = factor(age, levels = c("20-39", "40-49", "50-59", "60-69", "70-79")))
# Redefine tumor_size
breastcancer <- breastcancer %>%
mutate(tumor_size = factor(tumor_size, levels = c("0-9", "0-4", "10-14","15-19", "20-24", "25-29",
"30-34", "35-39", "40-44", "45-49", "5-9", "50-54")))
idx_tumor_size1 <- which(breastcancer$tumor_size == "0-4")
breastcancer$tumor_size[idx_tumor_size1] <- "0-9"
idx_tumor_size2 <- which(breastcancer$tumor_size == "5-9")
breastcancer$tumor_size[idx_tumor_size2] <- "0-9"
breastcancer <- breastcancer %>%
mutate(tumor_size = factor(tumor_size, levels = c("0-9", "10-14", "15-19", "20-24", "25-29",
"30-34", "35-39", "40-44", "45-49", "50-54")))
# Redefine inv_nodes
breastcancer <- breastcancer %>%
mutate(inv_nodes = factor(inv_nodes, levels = c("0-2", "3-5", "6-8", "9-11", "12-14", "15-17", "24-26", "15-26")))
idx_inv_nodes1 <- which(breastcancer$inv_nodes == "15-17")
breastcancer$inv_nodes[idx_inv_nodes1] <- "15-26"
idx_inv_nodes2 <- which(breastcancer$inv_nodes == "24-26")
breastcancer$inv_nodes[idx_inv_nodes2] <- "15-26"
breastcancer <- breastcancer %>%
mutate(inv_nodes = factor(inv_nodes, levels = c("0-2", "3-5", "6-8", "9-11", "12-14", "15-26")))
Next, there are the contingency tables for the modified factors and we can see the changes in levels:
# Contingency tables after changes
xtabs(~ class + age, data = breastcancer)
## age
## class 20-39 40-49 50-59 60-69 70-79
## no-recurrence-events 22 63 71 40 5
## recurrence-events 15 27 25 17 1
xtabs(~ class + tumor_size, data = breastcancer)
## tumor_size
## class 0-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49
## no-recurrence-events 11 27 23 34 36 35 12 16 2
## recurrence-events 1 1 7 16 18 25 7 6 1
## tumor_size
## class 50-54
## no-recurrence-events 5
## recurrence-events 3
xtabs(~ class + inv_nodes, data = breastcancer)
## inv_nodes
## class 0-2 3-5 6-8 9-11 12-14 15-26
## no-recurrence-events 167 19 7 4 1 3
## recurrence-events 46 17 10 6 2 4
The next step is to split the breastcancer data set into a train and a test set. To do so, I used 66% of the observations for the training set and the remaining 33% for the test set.
# Set seed and random index
set.seed(1234)
idx <- sample(x = nrow(breastcancer), size = (2/3)*nrow(breastcancer), replace = FALSE)
# Predictors
x_train <- breastcancer[idx, 2:10]
x_test <- breastcancer[-idx, 2:10]
# Response
y_train <- breastcancer[idx, 1]
y_test <- breastcancer[-idx, 1]
Later I will use the log loss function, defined as follows:
\[\text{log loss} = -\frac{1}{N} \sum{}_{i = 1}^{N} (y_{i} * log(p_{i}) + (1 - y_{i}) * log(1 - p_{i}))\]
For this function, is required that the values of the factor \(y_{i}\) in the test set to be numeric zeros or ones. But so far these values of are strings as shown below:
head(y_test)
## [1] no-recurrence-events no-recurrence-events no-recurrence-events
## [4] no-recurrence-events no-recurrence-events no-recurrence-events
## Levels: no-recurrence-events recurrence-events
To solve this problem and compute the log loss function, I will convert “no-recurrence-event” to zero and “recurrence-event” to one with the following code:
y_test <- ifelse(test = y_test == "no-recurrence-events", yes = 0, no = 1)
head(y_test)
## [1] 0 0 0 0 0 0
To select the variables for the model, I used LASSO with the function cv.glmnet()
from package glmnet. It is important to note that the arguments x and y should be obtained using the function model.matrix()
, otherwise cv.glmnet()
returns an error. Also, to fit the logistic regression model, we have to specify type.measure = "deviance"
and family = "binomial"
.
To fit the model p5_lasso I am using lambda.1se because in, a statistical sense, is indistinguishable from lambda.min but results in a model with fewer parameters.
# CV to select the value of lambda
p5_lasso_cv <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
type.measure = "deviance", alpha = 1, family = "binomial")
# Fit the LASSO model
p5_lasso <- glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
alpha = 1, family = "binomial", lambda = p5_lasso_cv$lambda.1se)
# Coefficients of LASSO model
coef(p5_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.82285554
## age .
## menopause .
## tumor_size .
## inv_nodes 0.07407418
## node_caps 0.16779584
## deg_malig 0.35353276
## breast .
## breast_quad .
## irradiat .
The only variables whose coefficients were not set equal to zero are inv_nodes, node_caps, and deg_malig. Following are the predictions on the test set using the resulting model and the corresponding test log loss.
# Make predictions
p5_lasso_pred <- predict(p5_lasso, newx = data.matrix(x_test), type = "response")
# Compute log loss
log_loss <- (-1/nrow(x_test))*(sum(y_test * log(p5_lasso_pred) + (1 - y_test) * log(1 - p5_lasso_pred)))
log_loss
## [1] 0.5511863
When I was trying to make the predictions in section 5.5), I tried to use predict(p5_lasso, newx = x_test, type = "response")
, but I got an error that said: “not-yet-implemented method for <data.frame> %*% data.matrix()
to provide the matrices to glmnet.
However, when I found how to fix the problem, I have already done this section.
In this section I fitted the model with the variables selected by LASSO, namely: class ~ node_caps + deg_malig
. Then I made predictions using predict()
function and finally I computed the log loss function.
# Need to redefine x_train
x_train <- breastcancer[idx, ]
x_test <- breastcancer[-idx, 2:10]
# Fit logistic regression model
p5 <- glm(class ~ node_caps + deg_malig, data = x_train, family = "binomial")
summary(p5)
##
## Call:
## glm(formula = class ~ node_caps + deg_malig, family = "binomial",
## data = x_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6199 -0.6171 -0.6039 0.7921 1.8930
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.60944 0.38730 -4.156 3.25e-05 ***
## node_capsyes 1.08144 0.41611 2.599 0.00935 **
## deg_malig2 0.04742 0.49084 0.097 0.92303
## deg_malig3 1.52633 0.49150 3.105 0.00190 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 238.51 on 189 degrees of freedom
## Residual deviance: 203.51 on 186 degrees of freedom
## AIC: 211.51
##
## Number of Fisher Scoring iterations: 4
# Predictions
p5_predictions <- predict(p5, newdata = x_test, type = "response")
head(p5_predictions) # Predictions are probabilities
## 1 2 5 14 16 17
## 0.4792361 0.1733579 0.1733579 0.4792361 0.1666667 0.1733579
# Log loss
log_loss <- (-1/nrow(x_test))*(sum(y_test * log(p5_predictions) + (1 - y_test) * log(1 - p5_predictions)))
log_loss
## [1] 0.5418395
To get a more accurate value of log loss for the model class ~ node_caps + deg_malig
I used 10-fold cross validation in the complete breastcancer data set.
# CV to get a better estimate of log_loss
set.seed(1234)
df <- breastcancer
df$fold <- sample(1:10, 286, replace = TRUE)
cv_log_loss <- list()
for (i in 1:10){
temp <- glm(class ~ node_caps + deg_malig,
data = df[!(df$fold == i), ], family = "binomial")
# Predicted values
y_hat <- predict(temp, newdata = df[(df$fold == i), ], type = "response")
# Actual Values
y_actual <- as.numeric(df$class[(df$fold == i)]) - 1
# Log loss
cv_log_loss[[i]] <- (-1/length(y_actual))*(sum(y_actual * log(y_hat) + (1 - y_actual) * log(1 - y_hat)))
}
# Values of log loss for different folds
unlist(cv_log_loss)
## [1] 0.4562944 0.3675229 0.5908199 0.4733092 0.7869962 0.6372770 0.4329519
## [8] 0.7805364 0.4631691 0.5191718
# Mean value of log loss
mean(unlist(cv_log_loss))
## [1] 0.5508049
Here I mention some problems that I had when importing the data and how I fixed them.
# Data import (incorrect)
# https://archive.ics.uci.edu/ml/datasets/Automobile
cars <- read.csv("https://bit.ly/2AG1IlJ", header = FALSE)
names(cars) <- c("symboling", "normalized_losses", "make",
"fuel_type", "aspiration", "num_of_doors",
"body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width",
"height", "curb_weight", "engine_type",
"num_of_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio",
"horsepower", "peak_rpm", "city_mpg",
"highway_mpg", "price")
str(cars)
## 'data.frame': 205 obs. of 26 variables:
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ normalized_losses: Factor w/ 52 levels "?","101","102",..: 1 1 1 29 29 1 27 1 27 1 ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ fuel_type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num_of_doors : Factor w/ 3 levels "?","four","two": 3 3 3 2 2 3 2 2 2 3 ...
## $ body_style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive_wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine_location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel_base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb_weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine_type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ engine_size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel_system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : Factor w/ 39 levels "?","2.54","2.68",..: 25 25 3 15 15 15 15 15 12 12 ...
## $ stroke : Factor w/ 37 levels "?","2.07","2.19",..: 6 6 29 26 26 26 26 26 26 26 ...
## $ compression_ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : Factor w/ 60 levels "?","100","101",..: 7 7 22 4 10 6 6 6 17 25 ...
## $ peak_rpm : Factor w/ 24 levels "?","4150","4200",..: 12 12 12 18 18 18 18 18 18 18 ...
## $ city_mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway_mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : Factor w/ 187 levels "?","10198","10245",..: 33 52 52 38 63 43 65 73 83 1 ...
The output of str()
shows that there are some “?” in the data that should be missing values instead. So my first attempt was to replace them with the following code:
# Replace ? for NA
cars <- replace(cars, cars == "?", NA)
# Structure of the data
str(cars)
## 'data.frame': 205 obs. of 26 variables:
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ normalized_losses: Factor w/ 52 levels "?","101","102",..: NA NA NA 29 29 NA 27 NA 27 NA ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ fuel_type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num_of_doors : Factor w/ 3 levels "?","four","two": 3 3 3 2 2 3 2 2 2 3 ...
## $ body_style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive_wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine_location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel_base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb_weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine_type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ engine_size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel_system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : Factor w/ 39 levels "?","2.54","2.68",..: 25 25 3 15 15 15 15 15 12 12 ...
## $ stroke : Factor w/ 37 levels "?","2.07","2.19",..: 6 6 29 26 26 26 26 26 26 26 ...
## $ compression_ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : Factor w/ 60 levels "?","100","101",..: 7 7 22 4 10 6 6 6 17 25 ...
## $ peak_rpm : Factor w/ 24 levels "?","4150","4200",..: 12 12 12 18 18 18 18 18 18 18 ...
## $ city_mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway_mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : Factor w/ 187 levels "?","10198","10245",..: 33 52 52 38 63 43 65 73 83 NA ...
# Change normalized losses from factor to numeric
cars$normalized_losses <- as.numeric(cars$normalized_losses)
summary(cars$normalized_losses)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 15.00 26.00 26.57 39.00 52.00 41
This approach replaces the “?” by NA, but has a problem with the variable normalized_losses. This variable is imported as a factor, but in the data dictionary from UCI Machine Learning Repository, they tells us that it is a numeric variable and its range goes from 65 to 256.
When I convert to numeric using as.numeric()
function, the values of the variable mutate and it ends up out of its range. This is shown in summary(cars$normalized_losses)
, where the new (incorrect) range is from 2 to 52.
In this section I implement a solution to the problem described above. It can be solved by specifying the argument na.strings = "?"
inside the read.csv()
function.
# Data import (incorrect)
# https://archive.ics.uci.edu/ml/datasets/Automobile
cars <- read.csv("https://bit.ly/2AG1IlJ", header = FALSE, na.strings = "?")
names(cars) <- c("symboling", "normalized_losses", "make",
"fuel_type", "aspiration", "num_of_doors",
"body_style", "drive_wheels", "engine_location",
"wheel_base", "length", "width",
"height", "curb_weight", "engine_type",
"num_of_cylinders", "engine_size", "fuel_system",
"bore", "stroke", "compression_ratio",
"horsepower", "peak_rpm", "city_mpg",
"highway_mpg", "price")
# Structure of the data
str(cars)
## 'data.frame': 205 obs. of 26 variables:
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ normalized_losses: int NA NA NA 164 164 NA 158 NA 158 NA ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ fuel_type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num_of_doors : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
## $ body_style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive_wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine_location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel_base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb_weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine_type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ engine_size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel_system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compression_ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peak_rpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ city_mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway_mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
As we can see from str()
, we cannot work the data as it is so far. Variable symboling was imported as integer and data dictionary tells us that it should be a factor.
# Redefine variables type
cars <- cars %>%
mutate(symboling = as.factor(symboling))
# Structure of the data
str(cars)
## 'data.frame': 205 obs. of 26 variables:
## $ symboling : Factor w/ 6 levels "-2","-1","0",..: 6 6 4 5 5 5 4 4 4 3 ...
## $ normalized_losses: int NA NA NA 164 164 NA 158 NA 158 NA ...
## $ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
## $ fuel_type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
## $ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
## $ num_of_doors : Factor w/ 2 levels "four","two": 2 2 2 1 1 2 1 1 1 2 ...
## $ body_style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
## $ drive_wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
## $ engine_location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
## $ wheel_base : num 88.6 88.6 94.5 99.8 99.4 ...
## $ length : num 169 169 171 177 177 ...
## $ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curb_weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ engine_type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
## $ num_of_cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
## $ engine_size : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuel_system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ bore : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compression_ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peak_rpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ city_mpg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highway_mpg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 NA ...
Now that the variables are correctly defined, I called summary()
# To check that we still have missing values
summary(cars)
## symboling normalized_losses make fuel_type aspiration
## -2: 3 Min. : 65 toyota : 32 diesel: 20 std :168
## -1:22 1st Qu.: 94 nissan : 18 gas :185 turbo: 37
## 0 :67 Median :115 mazda : 17
## 1 :54 Mean :122 honda : 13
## 2 :32 3rd Qu.:150 mitsubishi: 13
## 3 :27 Max. :256 subaru : 12
## NA's :41 (Other) :100
## num_of_doors body_style drive_wheels engine_location
## four:114 convertible: 6 4wd: 9 front:202
## two : 89 hardtop : 8 fwd:120 rear : 3
## NA's: 2 hatchback :70 rwd: 76
## sedan :96
## wagon :25
##
##
## wheel_base length width height
## Min. : 86.60 Min. :141.1 Min. :60.30 Min. :47.80
## 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10 1st Qu.:52.00
## Median : 97.00 Median :173.2 Median :65.50 Median :54.10
## Mean : 98.76 Mean :174.0 Mean :65.91 Mean :53.72
## 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90 3rd Qu.:55.50
## Max. :120.90 Max. :208.1 Max. :72.30 Max. :59.80
##
## curb_weight engine_type num_of_cylinders engine_size fuel_system
## Min. :1488 dohc : 12 eight : 5 Min. : 61.0 mpfi :94
## 1st Qu.:2145 dohcv: 1 five : 11 1st Qu.: 97.0 2bbl :66
## Median :2414 l : 12 four :159 Median :120.0 idi :20
## Mean :2556 ohc :148 six : 24 Mean :126.9 1bbl :11
## 3rd Qu.:2935 ohcf : 15 three : 1 3rd Qu.:141.0 spdi : 9
## Max. :4066 ohcv : 13 twelve: 1 Max. :326.0 4bbl : 3
## rotor: 4 two : 4 (Other): 2
## bore stroke compression_ratio horsepower
## Min. :2.54 Min. :2.070 Min. : 7.00 Min. : 48.0
## 1st Qu.:3.15 1st Qu.:3.110 1st Qu.: 8.60 1st Qu.: 70.0
## Median :3.31 Median :3.290 Median : 9.00 Median : 95.0
## Mean :3.33 Mean :3.255 Mean :10.14 Mean :104.3
## 3rd Qu.:3.59 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:116.0
## Max. :3.94 Max. :4.170 Max. :23.00 Max. :288.0
## NA's :4 NA's :4 NA's :2
## peak_rpm city_mpg highway_mpg price
## Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7775
## Median :5200 Median :24.00 Median :30.00 Median :10295
## Mean :5125 Mean :25.22 Mean :30.75 Mean :13207
## 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16500
## Max. :6600 Max. :49.00 Max. :54.00 Max. :45400
## NA's :2 NA's :4
There are several variables with NA values. I will impute values using the function mice()
in mice package.
# Imputation with mice
temp_data6 <- mice(cars, m = 5, maxit = 5, meth = "pmm", seed = 1234)
##
## iter imp variable
## 1 1 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 1 2 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 1 3 normalized_losses* num_of_doors* bore stroke* horsepower* peak_rpm* price*
## 1 4 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 1 5 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 2 1 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 2 2 normalized_losses* num_of_doors* bore stroke* horsepower* peak_rpm* price*
## 2 3 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 2 4 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 2 5 normalized_losses* num_of_doors* bore* stroke* horsepower peak_rpm* price*
## 3 1 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 2 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 3 normalized_losses* num_of_doors* bore stroke* horsepower* peak_rpm* price*
## 3 4 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 5 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 4 1 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 4 2 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 4 3 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 4 4 normalized_losses* num_of_doors* bore stroke* horsepower* peak_rpm* price*
## 4 5 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 5 1 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 5 2 normalized_losses* num_of_doors* bore stroke* horsepower* peak_rpm* price*
## 5 3 normalized_losses* num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 5 4 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 5 5 normalized_losses* num_of_doors* bore* stroke horsepower* peak_rpm* price*
## * Please inspect the loggedEvents
## Warning: Number of logged events: 337
cars <- complete(temp_data6, 1)
# To check that we don't have missing values anymore
summary(cars)
## symboling normalized_losses make fuel_type aspiration
## -2: 3 Min. : 65.0 toyota : 32 diesel: 20 std :168
## -1:22 1st Qu.: 81.0 nissan : 18 gas :185 turbo: 37
## 0 :67 Median :103.0 mazda : 17
## 1 :54 Mean :112.4 honda : 13
## 2 :32 3rd Qu.:137.0 mitsubishi: 13
## 3 :27 Max. :256.0 subaru : 12
## (Other) :100
## num_of_doors body_style drive_wheels engine_location
## four:114 convertible: 6 4wd: 9 front:202
## two : 91 hardtop : 8 fwd:120 rear : 3
## hatchback :70 rwd: 76
## sedan :96
## wagon :25
##
##
## wheel_base length width height
## Min. : 86.60 Min. :141.1 Min. :60.30 Min. :47.80
## 1st Qu.: 94.50 1st Qu.:166.3 1st Qu.:64.10 1st Qu.:52.00
## Median : 97.00 Median :173.2 Median :65.50 Median :54.10
## Mean : 98.76 Mean :174.0 Mean :65.91 Mean :53.72
## 3rd Qu.:102.40 3rd Qu.:183.1 3rd Qu.:66.90 3rd Qu.:55.50
## Max. :120.90 Max. :208.1 Max. :72.30 Max. :59.80
##
## curb_weight engine_type num_of_cylinders engine_size fuel_system
## Min. :1488 dohc : 12 eight : 5 Min. : 61.0 mpfi :94
## 1st Qu.:2145 dohcv: 1 five : 11 1st Qu.: 97.0 2bbl :66
## Median :2414 l : 12 four :159 Median :120.0 idi :20
## Mean :2556 ohc :148 six : 24 Mean :126.9 1bbl :11
## 3rd Qu.:2935 ohcf : 15 three : 1 3rd Qu.:141.0 spdi : 9
## Max. :4066 ohcv : 13 twelve: 1 Max. :326.0 4bbl : 3
## rotor: 4 two : 4 (Other): 2
## bore stroke compression_ratio horsepower
## Min. :2.54 Min. :2.070 Min. : 7.00 Min. : 48.0
## 1st Qu.:3.13 1st Qu.:3.100 1st Qu.: 8.60 1st Qu.: 70.0
## Median :3.31 Median :3.290 Median : 9.00 Median : 95.0
## Mean :3.32 Mean :3.243 Mean :10.14 Mean :103.8
## 3rd Qu.:3.58 3rd Qu.:3.410 3rd Qu.: 9.40 3rd Qu.:116.0
## Max. :3.94 Max. :4.170 Max. :23.00 Max. :288.0
##
## peak_rpm city_mpg highway_mpg price
## Min. :4150 Min. :13.00 Min. :16.00 Min. : 5118
## 1st Qu.:4800 1st Qu.:19.00 1st Qu.:25.00 1st Qu.: 7788
## Median :5200 Median :24.00 Median :30.00 Median :10595
## Mean :5116 Mean :25.22 Mean :30.75 Mean :13668
## 3rd Qu.:5500 3rd Qu.:30.00 3rd Qu.:34.00 3rd Qu.:16558
## Max. :6600 Max. :49.00 Max. :54.00 Max. :45400
##
From the last call to summary()
we see that there are no missing values anymore and that all the numeric variables are in their corresponding range (according to UCI data dictionary) after the imputation.
The next step is to split the cars data set into a train and a test set. To do so, I used 66% of the observations for the training set and the remaining 33% for the test set.
# Train and test
# Set seed and random index
set.seed(1234)
idx <- sample(x = nrow(cars), size = (2/3)*nrow(cars), replace = FALSE)
# Predictors
x_train <- cars[idx, 1:25]
x_test <- cars[-idx, 1:25]
# Response
y_train <- cars[idx, 26]
y_test <- cars[-idx, 26]
To select the variables for the model, I used LASSO with the function cv.glmnet()
from package glmnet. It is important to note that the arguments x and y should be obtained using the function data.matrix()
, otherwise cv.glmnet()
returns an error. Also, to fit the regression model, we have to specify type.measure = "mse"
and family = "gaussian"
.
Next, I fitted the model p6_lasso using lambda.1se because in, a statistical sense, is indistinguishable from lambda.min but results in a model with fewer parameters.
# Set seed
set.seed(1234)
# CV to select the value of lambda
p6_lasso_cv <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
type.measure = "mse", alpha = 1, family = "gaussian")
# Plot of MSE vs. log(lambda)
plot(p6_lasso_cv)
# Fit the LASSO model
p6_lasso <- glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
alpha = 1, family = "gaussian", lambda = p6_lasso_cv$lambda.1se)
# Coefficients of LASSO model
coef(p6_lasso)
## 26 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -41251.436318
## symboling .
## normalized_losses .
## make -62.433976
## fuel_type .
## aspiration .
## num_of_doors .
## body_style .
## drive_wheels .
## engine_location 3057.137105
## wheel_base .
## length .
## width 549.155503
## height .
## curb_weight 1.468725
## engine_type .
## num_of_cylinders .
## engine_size 53.920240
## fuel_system .
## bore .
## stroke .
## compression_ratio .
## horsepower 56.989237
## peak_rpm .
## city_mpg .
## highway_mpg .
The resulting model is price ~ make + engine_location + width + curb_weight + engine_size + horsepower
, all other predictors’ coefficients were shrunken to zero.
Following are the predictions on the test set using the resulting model and the corresponding test MSE.
# Make predictions
p6_lasso_pred <- predict(p6_lasso, newx = data.matrix(x_test), s = p6_lasso_cv$lambda.1se)
# Compute test MSE
p6_MSEP <- mean((y_test - p6_lasso_pred)^2)
p6_MSEP
## [1] 21670879
For this problem I used the function gam()
in package mgcv. I started fitting a GAM model with the predictors selected in problem 6) by LASSO. This are: make, engine_location, width, curb_weight, engine_size, and horsepower. Then, I fitted different GAM models and computed the test MSE with 10 fold cross validation.
Finally, I choosed the GAM model with smallest test MSE.
In this model I used a smoother for variables width, curb_weight, engine_size, horsepower and included drive_wheels and engine_location as a linear terms. When we include a linear term with a categorical variable (like in this case), the GAM function fits a model with a fixed effect for each level of the category.
# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars
df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()
# price ~ drive_wheels + engine_location + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
gam_temp <- gam(price ~ drive_wheels + engine_location + s(width)
+ s(curb_weight) + s(engine_size) + s(horsepower),
data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
# Predictions model gam_temp
gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
# MSEP model gam_temp
p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}
# MSEP for different folds
unlist(p7_MSEP)
## [1] 7559961 6763730 24519172 39526409 6169915 50907957 6139796
## [8] 11395538 5628761 5500033
# Mean MSEP
mean(unlist(p7_MSEP))
## [1] 16411127
This GAM model is similar to the previous but now I am specifying s(horsepower, by = engine_location)
to fit different smooths for each level of the categorical variable. This is called “smooth factor interaction”.
# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars
df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()
# price ~ drive_wheels + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
gam_temp <- gam(price ~ drive_wheels + s(width)
+ s(curb_weight) + s(engine_size) + s(horsepower, by = engine_location),
data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
# Predictions model gam_temp
gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
# MSEP model gam_temp
p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}
# MSEP for different folds
unlist(p7_MSEP)
## [1] 7509187 6780296 24186456 39510425 6529835 51048495 6144899
## [8] 11311768 5605478 5510205
# Mean MSEP
mean(unlist(p7_MSEP))
## [1] 16413704
This model is a combination of the previous two models. Here I am modeling a smooth-factor interaction with s(horsepower, by = engine_location)
and also including a varying intercept (the linear engine_location). This will result in an improvement in case the different categories are different in overall means in addition to the shape of the smooths.
# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars
df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()
# price ~ drive_wheels + engine_location + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
gam_temp <- gam(price ~ drive_wheels + engine_location + s(width)
+ s(curb_weight) + s(engine_size) + s(horsepower, by = engine_location),
data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
# Predictions model gam_temp
gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
# MSEP model gam_temp
p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}
# MSEP for different folds
unlist(p7_MSEP)
## [1] 7561750 6764076 24519143 39526587 6170054 50908202 6139432
## [8] 11394694 5629081 5499898
# Mean MSEP
mean(unlist(p7_MSEP))
## [1] 16411292
Finally, I fitted a last GAM model without linear terms; only considering smooths for the continuos predictors.
# GAM model with 10 fold CV
set.seed(1234)
df7 <- cars
df7$fold <- sample(1:10, 205, replace = TRUE)
p7_MSEP <- list()
# price ~ drive_wheels + engine_location + length + curb_weight + engine_size + highway_mpg
for (i in 1:10){
gam_temp <- gam(price ~ s(width) + s(curb_weight) + s(engine_size) + s(horsepower),
data = df7[!(df7$fold == i), ], method = "REML", select = TRUE)
# Predictions model gam_temp
gam_temp_pred <- predict.gam(gam_temp, newdata = df7[(df7$fold == i), ])
# MSEP model gam_temp
p7_MSEP[[i]] <- mean((df7$price[(df7$fold == i)] - gam_temp_pred)^2)
}
# MSEP for different folds
unlist(p7_MSEP)
## [1] 9406306 9906989 22583957 52967969 11918524 51985440 5755281
## [8] 17740862 4320213 3490733
# Mean MSEP
mean(unlist(p7_MSEP))
## [1] 19007627
The GAM model with smallest test MSE is GAM 1, namely: price ~ drive_wheels + engine_location + s(width) + s(curb_weight) + s(engine_size) + s(horsepower)
Using the cars data set the test MSE for the LASSO model was 21670879, while for the best GAM model was 16411127. Taking this into account, I prefer the GAM model for this data set.
The first step is to import the data.
# Data import
# https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
crimes <- read.csv("https://bit.ly/2CJ7Wko", header = FALSE, na.strings = "?")
# Structure of data
str(crimes, list.len = ncol(crimes))
## 'data.frame': 1994 obs. of 128 variables:
## $ V1 : int 8 53 24 34 42 6 44 6 21 29 ...
## $ V2 : int NA NA NA 5 95 NA 7 NA NA NA ...
## $ V3 : int NA NA NA 81440 6096 NA 41500 NA NA NA ...
## $ V4 : Factor w/ 1828 levels "Aberdeencity",..: 796 1626 2 1788 142 1520 840 1462 669 288 ...
## $ V5 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ V6 : num 0.19 0 0 0.04 0.01 0.02 0.01 0.01 0.03 0.01 ...
## $ V7 : num 0.33 0.16 0.42 0.77 0.55 0.28 0.39 0.74 0.34 0.4 ...
## $ V8 : num 0.02 0.12 0.49 1 0.02 0.06 0 0.03 0.2 0.06 ...
## $ V9 : num 0.9 0.74 0.56 0.08 0.95 0.54 0.98 0.46 0.84 0.87 ...
## $ V10 : num 0.12 0.45 0.17 0.12 0.09 1 0.06 0.2 0.02 0.3 ...
## $ V11 : num 0.17 0.07 0.04 0.1 0.05 0.25 0.02 1 0 0.03 ...
## $ V12 : num 0.34 0.26 0.39 0.51 0.38 0.31 0.3 0.52 0.38 0.9 ...
## $ V13 : num 0.47 0.59 0.47 0.5 0.38 0.48 0.37 0.55 0.45 0.82 ...
## $ V14 : num 0.29 0.35 0.28 0.34 0.23 0.27 0.23 0.36 0.28 0.8 ...
## $ V15 : num 0.32 0.27 0.32 0.21 0.36 0.37 0.6 0.35 0.48 0.39 ...
## $ V16 : num 0.2 0.02 0 0.06 0.02 0.04 0.02 0 0.04 0.02 ...
## $ V17 : num 1 1 0 1 0.9 1 0.81 0 1 1 ...
## $ V18 : num 0.37 0.31 0.3 0.58 0.5 0.52 0.42 0.16 0.17 0.54 ...
## $ V19 : num 0.72 0.72 0.58 0.89 0.72 0.68 0.5 0.44 0.47 0.59 ...
## $ V20 : num 0.34 0.11 0.19 0.21 0.16 0.2 0.23 1 0.36 0.22 ...
## $ V21 : num 0.6 0.45 0.39 0.43 0.68 0.61 0.68 0.23 0.34 0.86 ...
## $ V22 : num 0.29 0.25 0.38 0.36 0.44 0.28 0.61 0.53 0.55 0.42 ...
## $ V23 : num 0.15 0.29 0.4 0.2 0.11 0.15 0.21 0.97 0.48 0.02 ...
## $ V24 : num 0.43 0.39 0.84 0.82 0.71 0.25 0.54 0.41 0.43 0.31 ...
## $ V25 : num 0.39 0.29 0.28 0.51 0.46 0.62 0.43 0.15 0.21 0.85 ...
## $ V26 : num 0.4 0.37 0.27 0.36 0.43 0.72 0.47 0.1 0.23 0.89 ...
## $ V27 : num 0.39 0.38 0.29 0.4 0.41 0.76 0.44 0.12 0.23 0.94 ...
## $ V28 : num 0.32 0.33 0.27 0.39 0.28 0.77 0.4 0.08 0.19 0.11 ...
## $ V29 : num 0.27 0.16 0.07 0.16 0 0.28 0.24 0.17 0.1 0.09 ...
## $ V30 : num 0.27 0.3 0.29 0.25 0.74 0.52 0.86 0.27 0.26 0.33 ...
## $ V31 : num 0.36 0.22 0.28 0.36 0.51 0.48 0.24 0.18 0.29 0.17 ...
## $ V32 : num 0.41 0.35 0.39 0.44 0.48 0.6 0.36 0.21 0.22 0.8 ...
## $ V33 : num 0.08 0.01 0.01 0.01 0 0.01 0.01 0.03 0.04 0 ...
## $ V34 : num 0.19 0.24 0.27 0.1 0.06 0.12 0.11 0.64 0.45 0.11 ...
## $ V35 : num 0.1 0.14 0.27 0.09 0.25 0.13 0.29 0.96 0.52 0.04 ...
## $ V36 : num 0.18 0.24 0.43 0.25 0.3 0.12 0.41 0.82 0.59 0.03 ...
## $ V37 : num 0.48 0.3 0.19 0.31 0.33 0.8 0.36 0.12 0.17 1 ...
## $ V38 : num 0.27 0.27 0.36 0.33 0.12 0.1 0.28 1 0.55 0.11 ...
## $ V39 : num 0.68 0.73 0.58 0.71 0.65 0.65 0.54 0.26 0.43 0.44 ...
## $ V40 : num 0.23 0.57 0.32 0.36 0.67 0.19 0.44 0.43 0.59 0.2 ...
## $ V41 : num 0.41 0.15 0.29 0.45 0.38 0.77 0.53 0.34 0.36 1 ...
## $ V42 : num 0.25 0.42 0.49 0.37 0.42 0.06 0.33 0.71 0.64 0.02 ...
## $ V43 : num 0.52 0.36 0.32 0.39 0.46 0.91 0.49 0.18 0.29 0.96 ...
## $ V44 : num 0.68 1 0.63 0.34 0.22 0.49 0.25 0.38 0.62 0.3 ...
## $ V45 : num 0.4 0.63 0.41 0.45 0.27 0.57 0.34 0.47 0.26 0.85 ...
## $ V46 : num 0.75 0.91 0.71 0.49 0.2 0.61 0.28 0.59 0.66 0.39 ...
## $ V47 : num 0.75 1 0.7 0.44 0.21 0.58 0.28 0.52 0.67 0.36 ...
## $ V48 : num 0.35 0.29 0.45 0.75 0.51 0.44 0.42 0.78 0.37 0.31 ...
## $ V49 : num 0.55 0.43 0.42 0.65 0.91 0.62 0.77 0.45 0.51 0.65 ...
## $ V50 : num 0.59 0.47 0.44 0.54 0.91 0.69 0.81 0.43 0.55 0.73 ...
## $ V51 : num 0.61 0.6 0.43 0.83 0.89 0.87 0.79 0.34 0.58 0.78 ...
## $ V52 : num 0.56 0.39 0.43 0.65 0.85 0.53 0.74 0.34 0.47 0.67 ...
## $ V53 : num 0.74 0.46 0.71 0.85 0.4 0.3 0.57 0.29 0.65 0.72 ...
## $ V54 : num 0.76 0.53 0.67 0.86 0.6 0.43 0.62 0.27 0.64 0.71 ...
## $ V55 : num 0.04 0 0.01 0.03 0 0 0 0.02 0.02 0 ...
## $ V56 : num 0.14 0.24 0.46 0.33 0.06 0.11 0.13 0.5 0.29 0.07 ...
## $ V57 : num 0.03 0.01 0 0.02 0 0.04 0.01 0.02 0 0.01 ...
## $ V58 : num 0.24 0.52 0.07 0.11 0.03 0.3 0 0.5 0.12 0.41 ...
## $ V59 : num 0.27 0.62 0.06 0.2 0.07 0.35 0.02 0.59 0.09 0.44 ...
## $ V60 : num 0.37 0.64 0.15 0.3 0.2 0.43 0.02 0.65 0.07 0.52 ...
## $ V61 : num 0.39 0.63 0.19 0.31 0.27 0.47 0.1 0.59 0.13 0.48 ...
## $ V62 : num 0.07 0.25 0.02 0.05 0.01 0.5 0 0.69 0 0.22 ...
## $ V63 : num 0.07 0.27 0.02 0.08 0.02 0.5 0.01 0.72 0 0.21 ...
## $ V64 : num 0.08 0.25 0.04 0.11 0.04 0.56 0.01 0.71 0 0.22 ...
## $ V65 : num 0.08 0.23 0.05 0.11 0.05 0.57 0.03 0.6 0 0.19 ...
## $ V66 : num 0.89 0.84 0.88 0.81 0.88 0.45 0.73 0.12 0.99 0.85 ...
## $ V67 : num 0.06 0.1 0.04 0.08 0.05 0.28 0.05 0.93 0.01 0.03 ...
## $ V68 : num 0.14 0.16 0.2 0.56 0.16 0.25 0.12 0.74 0.12 0.09 ...
## $ V69 : num 0.13 0.1 0.2 0.62 0.19 0.19 0.13 0.75 0.12 0.06 ...
## $ V70 : num 0.33 0.17 0.46 0.85 0.59 0.29 0.42 0.8 0.35 0.15 ...
## $ V71 : num 0.39 0.29 0.52 0.77 0.6 0.53 0.54 0.68 0.38 0.34 ...
## $ V72 : num 0.28 0.17 0.43 1 0.37 0.18 0.24 0.92 0.33 0.05 ...
## $ V73 : num 0.55 0.26 0.42 0.94 0.89 0.39 0.65 0.39 0.5 0.48 ...
## $ V74 : num 0.09 0.2 0.15 0.12 0.02 0.26 0.03 0.89 0.1 0.03 ...
## $ V75 : num 0.51 0.82 0.51 0.01 0.19 0.73 0.46 0.66 0.64 0.58 ...
## $ V76 : num 0.5 0 0.5 0.5 0.5 0 0.5 0 0 0 ...
## $ V77 : num 0.21 0.02 0.01 0.01 0.01 0.02 0.01 0.01 0.04 0.02 ...
## $ V78 : num 0.71 0.79 0.86 0.97 0.89 0.84 0.89 0.91 0.72 0.72 ...
## $ V79 : num 0.52 0.24 0.41 0.96 0.87 0.3 0.57 0.46 0.49 0.38 ...
## $ V80 : num 0.05 0.02 0.29 0.6 0.04 0.16 0.09 0.22 0.05 0.07 ...
## $ V81 : num 0.26 0.25 0.3 0.47 0.55 0.28 0.49 0.37 0.49 0.47 ...
## $ V82 : num 0.65 0.65 0.52 0.52 0.73 0.25 0.38 0.6 0.5 0.04 ...
## $ V83 : num 0.14 0.16 0.47 0.11 0.05 0.02 0.05 0.28 0.57 0.01 ...
## $ V84 : num 0.06 0 0.45 0.11 0.14 0.05 0.05 0.23 0.22 0 ...
## $ V85 : num 0.22 0.21 0.18 0.24 0.31 0.94 0.37 0.15 0.07 0.63 ...
## $ V86 : num 0.19 0.2 0.17 0.21 0.31 1 0.38 0.13 0.07 0.71 ...
## $ V87 : num 0.18 0.21 0.16 0.19 0.3 1 0.39 0.13 0.08 0.79 ...
## $ V88 : num 0.36 0.42 0.27 0.75 0.4 0.67 0.26 0.21 0.14 0.44 ...
## $ V89 : num 0.35 0.38 0.29 0.7 0.36 0.63 0.35 0.24 0.17 0.42 ...
## $ V90 : num 0.38 0.4 0.27 0.77 0.38 0.68 0.42 0.25 0.16 0.47 ...
## $ V91 : num 0.34 0.37 0.31 0.89 0.38 0.62 0.35 0.24 0.15 0.41 ...
## $ V92 : num 0.38 0.29 0.48 0.63 0.22 0.47 0.46 0.64 0.38 0.23 ...
## $ V93 : num 0.46 0.32 0.39 0.51 0.51 0.59 0.44 0.59 0.13 0.27 ...
## $ V94 : num 0.25 0.18 0.28 0.47 0.21 0.11 0.31 0.28 0.36 0.28 ...
## $ V95 : num 0.04 0 0 0 0 0 0 0 0.01 0 ...
## $ V96 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ V97 : num 0.12 0.21 0.14 0.19 0.11 0.7 0.15 0.59 0.01 0.22 ...
## $ V98 : num 0.42 0.5 0.49 0.3 0.72 0.42 0.81 0.58 0.78 0.42 ...
## $ V99 : num 0.5 0.34 0.54 0.73 0.64 0.49 0.77 0.52 0.48 0.34 ...
## $ V100: num 0.51 0.6 0.67 0.64 0.61 0.73 0.91 0.79 0.79 0.23 ...
## $ V101: num 0.64 0.52 0.56 0.65 0.53 0.64 0.84 0.78 0.75 0.09 ...
## $ V102: num 0.03 NA NA NA NA NA NA NA NA NA ...
## $ V103: num 0.13 NA NA NA NA NA NA NA NA NA ...
## $ V104: num 0.96 NA NA NA NA NA NA NA NA NA ...
## $ V105: num 0.17 NA NA NA NA NA NA NA NA NA ...
## $ V106: num 0.06 NA NA NA NA NA NA NA NA NA ...
## $ V107: num 0.18 NA NA NA NA NA NA NA NA NA ...
## $ V108: num 0.44 NA NA NA NA NA NA NA NA NA ...
## $ V109: num 0.13 NA NA NA NA NA NA NA NA NA ...
## $ V110: num 0.94 NA NA NA NA NA NA NA NA NA ...
## $ V111: num 0.93 NA NA NA NA NA NA NA NA NA ...
## $ V112: num 0.03 NA NA NA NA NA NA NA NA NA ...
## $ V113: num 0.07 NA NA NA NA NA NA NA NA NA ...
## $ V114: num 0.1 NA NA NA NA NA NA NA NA NA ...
## $ V115: num 0.07 NA NA NA NA NA NA NA NA NA ...
## $ V116: num 0.02 NA NA NA NA NA NA NA NA NA ...
## $ V117: num 0.57 NA NA NA NA NA NA NA NA NA ...
## $ V118: num 0.29 NA NA NA NA NA NA NA NA NA ...
## $ V119: num 0.12 0.02 0.01 0.02 0.04 0.01 0.05 0.01 0.04 0 ...
## $ V120: num 0.26 0.12 0.21 0.39 0.09 0.58 0.08 0.33 0.17 0.47 ...
## $ V121: num 0.2 0.45 0.02 0.28 0.02 0.1 0.06 0 0.04 0.11 ...
## $ V122: num 0.06 NA NA NA NA NA NA NA NA NA ...
## $ V123: num 0.04 NA NA NA NA NA NA NA NA NA ...
## $ V124: num 0.9 NA NA NA NA NA NA NA NA NA ...
## $ V125: num 0.5 NA NA NA NA NA NA NA NA NA ...
## $ V126: num 0.32 0 0 0 0 0 0 0 0 0 ...
## $ V127: num 0.14 NA NA NA NA NA NA NA NA NA ...
## $ V128: num 0.2 0.67 0.43 0.12 0.03 0.14 0.03 0.55 0.53 0.15 ...
We can see that the data is imported as an object of class “data.frame”. The str()
function displays the structure of the data and we can see that there are missing values.
The next step is to read the data dictionary in the UCI Machine Learning Repository to check if the variables were imported correctly. The data dictionary also tells that the some variables are not predictive:
Hence, I removed these variables from the data set.
# Remove not predictive variables
crimes <- crimes %>%
select(-c(V1, V2, V3, V4, V5))
As mentioned before, there are several variables with NA values. I used function mice()
in package mice to impute values for these variables.
# Imputation
temp_data <- mice(crimes, m = 3, maxit = 5, meth = "pmm", seed = 1234)
##
## iter imp variable
## 1 1 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 1 2 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 1 3 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 2 1 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 2 2 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 2 3 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 3 1 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 3 2 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 3 3 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 4 1 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 4 2 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 4 3 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 5 1 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 5 2 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## 5 3 V31 V102 V103 V104 V105 V106 V107 V108 V110 V111 V112 V113 V114 V115 V116 V117 V118 V122 V123 V124 V125 V127
## Warning: Number of logged events: 331
Next I used the function complete()
, also from mice package, to complete the data and stored the data in a new data frame called df9.
# Complete data
df9 <- complete(temp_data, 1)
# Problem with V109
summary(df9$V109)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1300 0.1800 0.2175 0.2500 1.0000 1675
For some reason mice doesn’t impute in V109, so I did it manually. I imputed the median.
# Manually impute for V109
imp_V109 <- median(df9$V109, na.rm = TRUE)
df9$V109[which(is.na(df9$V109))] <- imp_V109
The next step is to split the df9 data set into a train and a test set. To do so, I used 66% of the observations for the training set and the remaining 33% for the test set.
# Train and test
# Set seed and random index
set.seed(1234)
idx <- sample(x = nrow(df9), size = (2/3)*nrow(df9), replace = FALSE)
# Predictors
x_train <- df9[idx, 1:121]
x_test <- df9[-idx, 1:121]
# Response
y_train <- df9[idx, 122]
y_test <- df9[-idx, 122]
Finally, I fitted several elastic net models with different values of alpha and computed test MSEP for each of them.
I started by creating a list where I am going to store the different elastic net models. This list is called enet_models. Then I did a for loop to get different values of alpha and passed these values to cv.glmnet()
. The output of cv.glmnet()
is stored in enet_models and is used in the next for loop, where the predict()
function is used.
In the second for loop, predictions are made for each value of alpha and they are stored in the object enet_pred. With those values, test MSEP is computed and stored in enet_MSEP. Finally, for every cycle of the loop alpha, enet_MSEP, and enet_name are stored in temp data set, which later is row binded to results data set.
# Set seed
set.seed(1234)
# List to store elastic net models with different alphas
enet_models <- list()
for (i in 0:10){
# Model name
enet_name <- paste0("alpha", i/10)
# CV to select lambda for each value of alpha
enet_models[[enet_name]] <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
type.measure = "mse", alpha = i/10, family = "gaussian")
}
# Data frame to store results
results <- data.frame()
for (i in 0:10){
# Model name
enet_name <- paste0("alpha", i/10)
# Predictions
enet_pred <- predict(enet_models[[enet_name]],
s = enet_models[[enet_name]]$lambda.1se, newx = data.matrix(x_test))
# Compute test MSEP
enet_MSEP <- mean((y_test - enet_pred)^2)
# Store alpha, MSEP, and model name in temp data frame
temp <- data.frame(alpha = i/10, enet_MSEP = enet_MSEP, enet_name = enet_name)
# Row bind temp to results data frame
results <- rbind(results, temp)
}
# Show results
results
## alpha enet_MSEP enet_name
## 1 0.0 0.006667070 alpha0
## 2 0.1 0.004982013 alpha0.1
## 3 0.2 0.004806970 alpha0.2
## 4 0.3 0.004834450 alpha0.3
## 5 0.4 0.004899841 alpha0.4
## 6 0.5 0.004775144 alpha0.5
## 7 0.6 0.004822186 alpha0.6
## 8 0.7 0.004956150 alpha0.7
## 9 0.8 0.004862721 alpha0.8
## 10 0.9 0.004890681 alpha0.9
## 11 1.0 0.005124086 alpha1
From results, we observe that the smallest test MSEP is obtained using \(\alpha = 0.5\).
To conclude, I fit the elastic net model with alpha = 0.5 and show its coefficients. We can see that some of them have been shrunken to zero and others to really small values.
# Set seed
set.seed(1234)
# CV to select the value of lambda
enet_final_cv <- cv.glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
type.measure = "mse", alpha = 0.5, family = "gaussian")
# Plot of MSE vs. log(lambda)
plot(enet_final_cv)
# Fit the elastic net model
enet_final <- glmnet(x = data.matrix(x_train), y = data.matrix(y_train),
alpha = 0.5, family = "gaussian", lambda = enet_final_cv$lambda.1se)
# Coefficients
coef(enet_final)
## 122 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.8141653697
## V6 .
## V7 0.0330382474
## V8 -0.0075820692
## V9 0.0114380959
## V10 -0.0299804066
## V11 .
## V12 0.0006534571
## V13 -0.1028892814
## V14 -0.0259960410
## V15 .
## V16 .
## V17 -0.0789971992
## V18 -0.0508585112
## V19 -0.1057517101
## V20 -0.0129870582
## V21 0.0791538535
## V22 -0.2928632621
## V23 0.0338010267
## V24 -0.0157624002
## V25 0.0138503690
## V26 .
## V27 -0.0127860069
## V28 0.0565525227
## V29 0.0376076388
## V30 -0.0350711902
## V31 0.0479656215
## V32 -0.0578237561
## V33 -0.0089708109
## V34 0.1331489492
## V35 0.1116073208
## V36 -0.2478165665
## V37 -0.2377867830
## V38 -0.1080590520
## V39 -0.2784794322
## V40 0.1545775258
## V41 .
## V42 -0.0313782439
## V43 0.1462542675
## V44 -0.1386543156
## V45 -0.1024025528
## V46 .
## V47 .
## V48 0.0044462099
## V49 -0.0939670935
## V50 -0.1832024355
## V51 0.2732906519
## V52 0.0122564705
## V53 0.0634836218
## V54 0.0297050520
## V55 0.1138081240
## V56 .
## V57 -0.1817845520
## V58 -0.2187270673
## V59 0.1515294223
## V60 .
## V61 0.1511782354
## V62 .
## V63 .
## V64 .
## V65 -0.0954286018
## V66 0.0059720858
## V67 -0.0624960597
## V68 0.0750087910
## V69 .
## V70 .
## V71 -0.0519289917
## V72 -0.2004792813
## V73 -0.0746416412
## V74 0.1334838645
## V75 .
## V76 0.0009247091
## V77 0.0001367667
## V78 0.1104608401
## V79 .
## V80 0.0237334842
## V81 0.0491419879
## V82 -0.0354619936
## V83 .
## V84 -0.0824981675
## V85 0.0086405301
## V86 -0.0885778575
## V87 .
## V88 0.0830537148
## V89 0.0728738963
## V90 .
## V91 .
## V92 -0.0494731626
## V93 0.0667032995
## V94 -0.0785663872
## V95 0.0610509958
## V96 .
## V97 0.1468123794
## V98 -0.0294297451
## V99 0.0091647185
## V100 0.0072653867
## V101 -0.0558800323
## V102 -1.5482567069
## V103 0.5173826840
## V104 -0.3085014287
## V105 0.4112105890
## V106 0.0986388772
## V107 0.0838994683
## V108 -0.0142401745
## V109 0.0313844299
## V110 .
## V111 -0.0276188243
## V112 .
## V113 0.0846552709
## V114 0.0084125436
## V115 -0.0668021810
## V116 -0.1817067067
## V117 .
## V118 0.0064450323
## V119 0.0448022873
## V120 .
## V121 -0.0091008230
## V122 0.0198524874
## V123 1.1848998489
## V124 -0.0673155901
## V125 -0.0016069745
## V126 0.0488002643