This exam has 3 questions. It is due May 5, 2018 at 11:55pm. There are 100 points available for students in both 388/488. Students enrolled in 388 will be scored out of 80 points. Don’t cheat. Seriously, don’t cheat. Have a great summer. Try not to do anything stupid.
(10 points) What is the difference between parametric and non-parametric statistics (2-3 lines should suffice for an answer)?
In parametric statistics we make assumptions about the probability distributions of variables. These assumptions often involve parametric distributions. On the other hand, non-parametric statistics does not make these distributional assumptions about the data (But it may make other assumptions).
(10 points) Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and nonparametric tests used to test the hypothesis that you have given as an example. (HINT: Pick an example that is easy for yourself!)
For this problem I selected two statistical procedures to test the relative locations (medians) of three or more populations.
Null hypothesis
\(H_0: \mu_{1} = \mu_{2} = \ldots = \mu_{j}\) where \(\mu_{i}\) is the mean of group \(i\)
The null hypothesis states that the means of all the groups are equal.
Alternative hypothesis
\(H_1: \mu_{i} \neq \mu_{j}\) for some \(i, j\), where \(\mu_{i}\) is the mean of group \(i\)
The alternative states that there is at least a pair of groups with different means.
This test uses the \(F\) statistic which is computed as follows: \[F = \frac{MST}{MSE} \stackrel{H_0}{\sim} F_{k-1, n-k}\]
Where \(MST = \frac{SST}{k-1}\) and \(MSE = \frac{MSE}{n-k}\).
Null hypothesis
\(H_0: \tau_{1} = \tau_{2} = \ldots = \tau_{k} = 0\) where \(\tau_{i}\) is the unknown treatment effect for the i-th population
Alternative hypothesis
\(H_a: \tau_{i}, \tau_{j}, \ldots, \tau_{k}\) not all equal 0
The hypotheses mention \(\tau\) (the unknown treatment effect for the i-th population), so I reviewed the assumptions of the test to remember what \(\tau\) is:
The observations are independent
For each group, \(j \in \{1, \ldots, k\}\) the \(n_j\) observations are a random sample from a continuous distribution with distribution function \(F_j\)
The distribution functions \(F_1, \ldots, F_k\) are connected through the relationship \[F_j(t) = F(t -\tau_j)\]
where \(j = 1, \ldots, k\) and \(F\) is a distribution function for a continuous distribution with unknown median \(\theta\) and \(\tau_j\) is the unknown treatment effect for the j-th population. Under the third assumption here, the \(k\) underlying populations can differ by at most a shift in location. They cannot differ by scales parameters and must belong to the same general family.
It is important to note that the Kruskal-Wallis test statistic (H) is computed using the ranks and not the raw data. The formula for the test statistic is the following:
\[H = \frac{12}{N(N+1)} * \sum^{k}_{i = 1} n_j * (R_{.j}-\frac{N+1}{2})^2\]
Where \(N\) is the total number of observations, \(k\) is the number of groups, \(R_{.j}\) is the mean of the ranks of the j-th group, and \(n_j\) is the number of observations in the j-th group. The term \(\frac{12}{N(N+1)}\) is a scaling factor and allows to calculate a p-value by using a \(\chi^2\) distribution with \(k-1\) degrees of freedom.
Answer the following questions with the mtcars dataset:
# Required libraries
library(ggplot2)
# Loading the data
df <- mtcars
(10 points) Using the mtcars data set that is pre-loaded in R, construct a kernel density estimator of the car weights in that data set and show the density in a figure. (hint: ggplot2 makes this very easy).
I produced the density estimation with ggplot()
using the code below.
# Density estimation
ggplot(data = df, aes(x = wt)) + geom_density() +
ggtitle("3) A) - Density estimation for cars weight") +
scale_x_continuous("weight (1000 lbs)")
(10 points) Using the mtcars data set that is pre-loaded in R, fit a loess regression model with qsec as a predictor and mpg as the response variable. (hint: ggplot2 makes this very easy).
For this problem I also used ggplot()
. My first approach was to use the default values for the parameters.
# Loess regression with span = 0.75 (default)
ggplot(data = df, aes(x = qsec, y = mpg)) + geom_point() +
stat_smooth(method = "loess", span = 0.75) +
ggtitle("3) B) - Loess Regression of mpg vs. qsec - Span = 0.75 (default)")
I repeated the loess regression but this time I changed the span to 0.35.
# Loess regression with span = 0.35
ggplot(data = df, aes(x = qsec, y = mpg)) + geom_point() +
stat_smooth(method = "loess", span = 0.35) +
ggtitle("3) B) - Loess Regression of mpg vs. qsec - Span = 0.35 ")
We observe that the first case - where span = 0.75 - produces a smoother curve. The second curve presents an important drop in the right hand side and the 95% confidence interval expands rapidly in this area. If we use this curve to predict the mpg
value of an observation with high qsec
value, there is a high probability that the prediction will have an important error.
For this reason, I choose the first curve fitted with span = 0.75.
Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles. (I did some data cleaning for you. Also, assume that the 0’s in the early years are correct.)
# Required libraries
library(dslabs)
library(reshape2)
# Loading the data
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)
#Remove the space in some of the state names.
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"
(10 points) Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.
I first removed the variable year and created a data set called measles2.
Then I fit the regression tree with the default values of the function tree()
, whithout tunning any parameter, and called the model tree1. The idea of this initial tree is to see the general structure of the tree: how many branches the tree has, how many leaves the tree has, which variables produces the largest reduction in deviance, etc.
library(tree)
set.seed(4321)
# Removing the year
measles2 <- measles[-1]
# Fitting tree1
tree1 <- tree(Illinois ~ ., data = measles2)
tree1
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 76 2.002e+10 11500
## 2) Indiana < 8771.5 62 2.595e+09 5334
## 4) WV < 1676 40 1.273e+08 1206 *
## 5) WV > 1676 22 5.460e+08 12840 *
## 3) Indiana > 8771.5 14 4.620e+09 38810
## 6) NC < 10117.5 9 6.207e+08 29070 *
## 7) NC > 10117.5 5 1.607e+09 56350 *
plot(tree1)
text(tree1)
From this tree we observe that the root node is related to the number of measles cases in Indiana. This is reasonable since these states are next to each other and one can argue that they have similar weather conditions, similar demographic characteristics, etc. Probably, the factors that produce measles also behave in a similar fashion in Illinois and Indiana. The number of measles cases in Indiana is the variable that produces the largest reduction in deviance.
The first binary split opens two branches: the left branch continues asking about the number of measles cases in WV (West Virginia), while the right branch continues asking about the number of measles cases in NC (North Carolina).
Given an observation for which we are trying to predict the measles cases in Illinois, this will be the reasoning path in words:
In order to improve the tree I performed variable selection with stepwise.
# Variable selection
fitall <- lm(Illinois ~ ., data = measles2)
fitstart <- lm(Illinois ~ 1, data = measles2)
formula(lm_Illinois <- step(fitstart, scope = formula(fitall), direction = "both", trace = FALSE))
## Illinois ~ Pennsylvania + Indiana + SC + Michigan + RI + Iowa +
## Minnesota + Utah + Maine + Tennessee + Colorado + NJ + Kansas +
## Vermont + Kentucky + Arkansas + NY + Alaska + Maryland +
## California + Mississippi + NM + Virginia + Ohio + Nevada +
## Alabama + NC + Georgia + Nebraska + Massachusetts + NH +
## Wyoming + Washington + Delaware + Texas + Florida + Oklahoma +
## Missouri + WV + Hawaii
With the selected variables, I fitted a new tree.
set.seed(4321)
# Fitting tree2
tree2 <- tree(formula = formula(lm_Illinois) , data = measles2)
tree2
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 76 2.002e+10 11500
## 2) Indiana < 8771.5 62 2.595e+09 5334
## 4) WV < 1676 40 1.273e+08 1206 *
## 5) WV > 1676 22 5.460e+08 12840 *
## 3) Indiana > 8771.5 14 4.620e+09 38810
## 6) Pennsylvania < 70039 9 6.207e+08 29070 *
## 7) Pennsylvania > 70039 5 1.607e+09 56350 *
plot(tree2)
text(tree2)
The interpretation is similar, but this time instead of NC we have Pennsylvania.
(10 points) Compute the Pearson correlation coefficient between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.
First I checked for missing values in the data with the following code:
# Checking for missing values
Illinois <- measles[ , "Illinois"]
Massachusetts <- measles[ , "Massachusetts"]
sum(is.na(Illinois))
## [1] 0
sum(is.na(Massachusetts))
## [1] 0
There are no missing values for Illinois and Massachusetts. Then I used the bootstrap procedure to estimate the correlation coefficient, the standard error, and the bias of the estimator.
# Theta hat for the Pearson correlation coefficient
theta_hat_cor <- cor(Illinois, Massachusetts, method = "pearson")
theta_hat_cor
## [1] 0.4809065
# Simulation parameters
set.seed(4321)
n <- length(Illinois)
nsim <- 1000
# Theta bootstrap for the Pearson correlation coefficient
theta_boots_cor <- rep(NA, nsim) # Preallocation
for (i in 1:nsim){
boots_sample <- measles[sample(1:n, n, replace = TRUE), ]
temp <- cor(boots_sample$Illinois, boots_sample$Massachusetts, method = "pearson")
theta_boots_cor[i] <- temp # Correlation coefficients of the bootstraps samples
}
The standard error of the estimate of the Pearson correlation coefficient using the bootstrap procedure is the following:
# Bootstrap estimate of standard error for the Pearson correlation coefficient
boots_se <- sd(theta_boots_cor)
boots_se
## [1] 0.1059561
The bias of the Pearson correlation coefficients using the boostrap procedure is:
# Bootstrap estimate of bias for the Pearson correlation coefficient
boots_bias <- mean(theta_boots_cor) - theta_hat_cor
boots_bias
## [1] 0.01557448
Even though is not asked by the directions, I also calculated the bootstrap estimate of the MSE and the variance. If the bias is small as I calculated above, then the MSE and the variance should be similar.
# Bootstrap estimate of MSE for the Pearson correlation coefficient
boots_MSE <- mean((theta_boots_cor - theta_hat_cor)^2)
boots_MSE
## [1] 0.01145803
# Bootstrap estimate of variance for the Pearson correlation coefficient
boots_var <- var(theta_boots_cor)
boots_var
## [1] 0.01122669
# Checking
boots_bias^2 + boots_var
## [1] 0.01146925
(10 points) Perform PCA on the the measles data with states as varaibles. (Remember to exclude year as a predictor and also don’t forget to scale the data before doing PCA.) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.
For this problem I will use the data set measles3 where I removed the variable year.
# Removing the year
measles3 <- measles[-1]
# Data scaling and PCA
measles3 <- scale(measles3, scale = TRUE, center = TRUE)
s <- cor(scale(measles3, scale = TRUE, center = TRUE), method = "pearson")
e <- eigen(s)
str(e)
## List of 2
## $ values : num [1:51] 25.72 5.71 2.99 2.43 2.24 ...
## $ vectors: num [1:51, 1:51] -0.1608 -0.0827 -0.1287 -0.1383 -0.149 ...
## - attr(*, "class")= chr "eigen"
The list e stores the eigenvalues and the eigenvectors. I will use them later calling e$values
or e$vectors
respectively. It is important to notice that e$values
stores the eigenvalues in decreasing order. In other words, e$values[1]
is the largest eigenvalue, e$values[2]
is the second largest eigenvalue, etc.
# Eigenvectors associated to the two largest eigenvalues
head(e$vectors[ , 1:2])
## [,1] [,2]
## [1,] -0.16077162 -0.04273130
## [2,] -0.08266421 0.30002422
## [3,] -0.12868551 0.24769098
## [4,] -0.13829837 -0.08250811
## [5,] -0.14897156 0.06196241
## [6,] -0.16622331 -0.01145588
Here I printed the first six components of the two first eigenvectors. The objective is just to keep a record because later I will use a function called prcomp()
that returns a list with many objects, including the eigenvectors. I just wanted to check if prcomp()
returns the same values we are getting here.
The proportion of the total variability explained by a single principal component is given by the eigenvalue associated to that principal component divided by the total sum of eigenvalues.
\[\text{Proportion PC}_{i} = \frac{\lambda_{i}}{\sum_{i = 1}^{p}{\lambda_{i}}}\]
where p is the number of predictor variables in the original data.
The following for loop is to determine how many principal components account for 85% of the total variability:
# Account for 85% of the variability
cummulative_prop <- 0
counter <- 0
for (i in 1:length(e$values)){
if (cummulative_prop < 0.85){
cummulative_prop <- cummulative_prop + (e$values[i] / sum(e$values))
counter <- counter + 1
}
}
counter
## [1] 9
I created two variables called cummulative_prop and counter that are initialized at zero. There is a for loop that goes through every component of the vector e$values and an if statement that checks if the cummulative_prop value at that point is less than 0.85. If that is the case, then cummulative_prop is updated and counter increased by one.
When counter value is nine cummulative_prop is greater than 0.85 and the condition of the if statement is not met. Hence, the values of cummulative_prop and counter are not updated anymore.
We need 9 principal components to account for 85% of the variability.
For this section it is important to recall that the principal components form a new coordinates system. In this new system the principal components are the new orthogonal axes. By selecting only two principal components, PC1 and PC2, we can represent a plane where we can plot the original data.
To interpret the first two PC I will plot the data points in the new coordinates system. To do so I need to change the coordinates from the “old” system to the “new” system. That change is accomplished by matrix multiplication of the coordinates in the “old” system by the normalized eigenvectors 1 and 2, which correspond to eigenvalues 1 and 2 respectively.
# Data coordinates in the "new" system
out <- as.matrix(measles3)%*%e$vectors[ , 1:2]
head(out)
## [,1] [,2]
## [1,] -4.327350 -5.1326092
## [2,] -1.340115 -2.0964258
## [3,] -2.449262 -2.1443394
## [4,] -3.103270 -3.4145583
## [5,] -1.576672 -0.8715173
## [6,] -1.289000 -2.2692750
To use ggplot()
I need a data frame:
# Preparing the data to use ggplot
year <- as.character(measles[ , 1])
row.names(out) <- year
colnames(out) <- c("PC1", "PC2")
out_df <- data.frame(out)
head(out_df)
## PC1 PC2
## 1928 -4.327350 -5.1326092
## 1929 -1.340115 -2.0964258
## 1930 -2.449262 -2.1443394
## 1931 -3.103270 -3.4145583
## 1932 -1.576672 -0.8715173
## 1933 -1.289000 -2.2692750
# Plot with ggplot
ggplot(data = out_df, aes(x = PC1, y = PC2, label = rownames(out_df))) +
geom_point(position = "jitter") +
geom_text() +
ggtitle("4) C) - PCA of measles data") +
scale_x_continuous("PC1 (50.44% explained variability)") +
scale_y_continuous("PC2 (11.20% explained variability)")
From the plot we can see that each point corresponds to a single year. Most of the variability (50.44%) is explained along the X axis, which is reasonable since that axis corresponds to PC1. Only 11.20% of the variability is explained along the Y axis. These percentages come from calculations that can be find below.
In the middle of the plot there are years corresponding to the 1940 and 1950 decades. As we move to the right side, there are years of the 1960 decade. In the right most part of the plot there are a lot of points together, which corresponds to the decades from the 1970 to the present. Probably in these decades there were improvements in vaccines, in public health policies, etc. that lead to a continuos reduction in measles cases. This might be the reason why the (centered) variability along the Y axis is almost zero for the 1970 - 2003 period.
Special attention deserves the decade of 1930. There are several values of this decade in the left part of the plot and most of them are in the lower part. I subsetted that decade for some states to look closer what is happening.
measles[7:12, c("year", "California", "Texas", "NY", "Florida")]
## year California Texas NY Florida
## 7 1934 25650 29699 34208 8015
## 8 1935 28799 3923 69073 1174
## 9 1936 49050 7776 64366 302
## 10 1937 5107 14694 27404 394
## 11 1938 18527 6011 70111 9175
## 12 1939 63748 7380 44812 2895
Many questions araise from here:
* Is it possible that California reduced the measles cases from 49050 in 1936 to 5107 in 1937?
* Is possible that in Florida measles cases grew exponentially from 394 in 1937 to 9175 in 1938?
May be there was a change on the statiscal procedures in those year, or a change in the medical definition of what is considered to be measles, etc. In conclusion, data from this decade deserves more in depth analysis.
I did the same work using the function prcomp()
from the stats package. Even though the results are the same, I found this function useful because it returns a list with class “prcomp” that contains several objects.
One if them is x that contains the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix). These are the same values that in previous procedure stores the matrix called out
# PCA with prcomp()
measles2_pca <- prcomp(measles2, retx = TRUE, scale. = TRUE, center = TRUE)
# Object x: rotated data
head(measles2_pca$x[ , 1:2])
## PC1 PC2
## [1,] -4.327350 5.1326092
## [2,] -1.340115 2.0964258
## [3,] -2.449262 2.1443394
## [4,] -3.103270 3.4145583
## [5,] -1.576672 0.8715173
## [6,] -1.289000 2.2692750
Another object is rotation, which contains the eigenvectors. These values are the same as the ones in e$vectors
from the previous procedure.
# Object rotation: matrix that contains the eigenvectors
head(measles2_pca$rotation[ , 1:2])
## PC1 PC2
## Alabama -0.16077162 0.04273130
## Alaska -0.08266421 -0.30002422
## Arizona -0.12868551 -0.24769098
## Arkansas -0.13829837 0.08250811
## California -0.14897156 -0.06196241
## Colorado -0.16622331 0.01145588
Furthermore, when we use plot()
with the class “prcomp” object, we can easily get the scree plot.
# Scree plot
plot(measles2_pca, pch = 19, type = "l",
main = "Scree plot for measles data")
Finally, if we do summary()
of the “prcomp” class object we get the proportion of variance and the cumulative proportion explained by each principal component. From here we can see that nine principal components are required to account for 85% of the variability.
summary(measles2_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion 0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion 0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion 0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion 0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion 0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion 0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion 0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion 0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
## PC49 PC50 PC51
## Standard deviation 0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion 1.00000 1.0000 1.000000
(15 points) Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data and also don’t forget to scale the data.) How many clusters do there appear to be? Try to interpret each of the clusters.
For this problem I created a data set called measles4 where I removed the variable year. I also used the function t()
to transpose the data and get the states as observations.
# Removing the year
measles4 <- measles[-1]
year <- as.character(measles[ , 1])
row.names(measles4) <- year
# Transpose to get states as observations
measles4 <- t(measles4)
head(measles4[ , 1:6])
## 1928 1929 1930 1931 1932 1933
## Alabama 8843 2959 4156 8934 270 1735
## Alaska 0 0 0 0 0 0
## Arizona 847 236 2024 2135 86 1261
## Arkansas 8899 1245 994 849 99 5438
## California 3698 4024 43416 27807 12618 26551
## Colorado 2099 748 11781 4787 2376 297
Then I used the function scale()
to scale and center the data.
# Data scaling
measles4 <- scale(measles4, scale = TRUE, center = TRUE)
head(sqrt(diag(var(measles4))))
## 1928 1929 1930 1931 1932 1933
## 1 1 1 1 1 1
We observe that the standard deviations for each variable are equal to one. Next, I computed the Euclidean distances and stored the values in matrix d.
d <- as.matrix(measles4)
d <- dist(d, method = "euclidean")
head(d)
## [1] 9.915484 8.993017 7.624698 29.325578 8.486998 8.880287
str(d)
## 'dist' num [1:1275] 9.92 8.99 7.62 29.33 8.49 ...
## - attr(*, "Size")= int 51
## - attr(*, "Labels")= chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = d, method = "euclidean")
Then I did hierarchical clustering using hclust()
with different linkages. I started with Ward’s method that according to the help file of hclust()
aims at finding compact, spherical clusters.
# Hierarchical clustering - Ward linkage
hc <- hclust(d, method = "ward.D2")
plot(hc, hang = -1, cex = 0.9)
# Hierarchical clustering - Single linkage
hc <- hclust(d, method = "single")
plot(hc, hang = -1, cex = 0.9)
# Hierarchical clustering - Average linkage
hc <- hclust(d, method = "average")
plot(hc, hang = -1, cex = 0.9)
# Hierarchical clustering - Complete linkage
hc <- hclust(d, method = "complete")
plot(hc, hang = -1, cex = 0.9)
I will do my analysis based on the complete linkage dendrogram. From this dendrogram there appear to be 4 big clusters.
In the middle part of the dendrogram we can see Arkansas, Oklahoma, Nebraska, and South Dakota. Next to these states we can find Georgia, Misouri, and Mississippi. Finally Delaware, DC, and Maryland are also close to each other.
Within each of the groups I mentioned, the states are also close geographically so the number of measles cases can be affected by similar factors. While the groups may not be closer between them, I would say that they all belong to the same cluster. We can see this by the height of the branches where the union between groups occurs.
To the left of the first cluster, we can find Arizona, Utah, New Mexico, Colorado, and Nevada. One may argue that this is because the states are next to each other, but in the same cluster we can find Vermont, Alasksa, and Hawaii which definitely are not closer to the continental south west states.
To the right of the first cluster, we can find Indiana, Kentucky, Tennessee, and Alabama. These states are almost at the same longitude in the map, so it is reasonable to think that they may also have some similarities regarding public health issues like measles cases.
Finally, the right most cluster is not as similar as the clusters described above. We can see this by the height where the union between states occurs. In this cluster we find California, Texas, New York, Pennsylvania, and Illinois. While these states may not be close to each other, they have one important similarity: all of them are populous. Probably, the health system for a populous state has a particular complexity not seen in smaller states that impacts in the measles cases in a similar fashion.
(15 points) Perform k-means clustering on the the measles data with states as observations. (Remember to exclude year data and also don’t forget to scale the data.) How many clusters did you choose? Try to interpret each of the clusters.
For this problem I created a data set called measles5 where I removed the variable year. I also used the function t()
to transpose the data and get the states as observations.
# Removing the year
measles5 <- measles[-1]
year <- as.character(measles[ , 1])
row.names(measles5) <- year
# Transpose to get states as observations
measles5 <- t(measles5)
head(measles5[ , 1:5])
## 1928 1929 1930 1931 1932
## Alabama 8843 2959 4156 8934 270
## Alaska 0 0 0 0 0
## Arizona 847 236 2024 2135 86
## Arkansas 8899 1245 994 849 99
## California 3698 4024 43416 27807 12618
## Colorado 2099 748 11781 4787 2376
Then I used the function scale()
to scale and center the data. It is important to notice here that in the original data measles, the number of measles cases for year 2003 is zero for the 51 states.
Hence, when we calculate the standard deviation to scale the data, its value is zero. Recalling that when we scale we divide by the standard deviation, we will divide by zero and this operation introduces NaN values.
If we try to do k-means clustering with the NaN values, the algorithm returns an error. For this reason, I will only scale columns 1 through 75. Column 76 of the transposed data is the one that has all the zeros (Column 76 in the transposed data corresponds to row 76 of the original data, which is the year 2003).
# Data scaling
measles5 <- scale(measles5[ , 1:75], scale = TRUE, center = TRUE)
head(sqrt(diag(var(measles5))))
## 1928 1929 1930 1931 1932 1933
## 1 1 1 1 1 1
To decide how many clusters I will use, I checked the reduction in within sum of squares versus the number of clusters. Theoretically, within sum of squares decreases as the number clusters increases and is minimized in the extreme case where each observation is in its own cluster.
But we do not want this extreme case. Instead, we want to choose the number of clusters that account for the highest reduction in within sum of squares.
# How many clusters?
ss <- rep(NA, 30)
for (k in 2:31){
km <- kmeans(measles5, centers = k, iter.max = 100, nstart = 20)
ss[k - 1] <- sum(km$withinss)
}
plot(ss, pch = 19,
main = "Within ss vs. k", xlab = "k")
abline(v = 6, lwd = 2.5, col = "red")
From the plot, we observe what we expected: within sum of squares decreases as k increases. When k is between 6 and 9 the highest reduction in within sum of squares is reached. More than 10 clusters only produce a small reduction in within sum of squares.
For my analysis I choose k = 6, which in the plot is showed by the vertical red line.
# k-means clustering
km <- kmeans(measles5, centers = 6, iter.max = 100, nstart = 20)
km
## K-means clustering with 6 clusters of sizes 30, 1, 11, 6, 1, 2
##
## Cluster means:
## 1928 1929 1930 1931 1932 1933
## 1 -0.38188209 -0.4008368 -0.4488463 -0.4007665 -0.4327591 -0.4056946
## 2 -0.35899164 -0.2390339 3.3168369 1.3022216 0.3809492 1.5332420
## 3 0.02371122 -0.2751456 -0.3155369 -0.1141081 -0.2482860 -0.2308294
## 4 0.78983788 1.6059364 1.4078735 0.7456490 1.4305318 0.7997594
## 5 -0.24556432 -0.2314738 -0.3019712 -0.4244041 -0.2365486 1.1464039
## 6 3.53058403 2.9432973 2.7370935 3.9632357 3.4931639 3.6158796
## 1934 1935 1936 1937 1938 1939
## 1 -0.4721339 -0.4566924 -0.2930762 -0.42580847 -0.44215898 -0.3793591847
## 2 0.6835942 0.7060573 3.6753117 -0.06777218 0.09050014 4.7466750590
## 3 0.1477332 -0.1161546 -0.2913312 -0.14008393 -0.11234559 0.0008941894
## 4 0.9337646 1.4136376 0.2699282 1.30908484 1.41400171 0.4533162125
## 5 0.9265346 -0.4918852 0.1913850 0.92439280 -0.37241905 -0.0852555816
## 6 2.6631177 3.1412379 3.2553311 2.80202388 3.14923975 1.9948113528
## 1940 1941 1942 1943 1944 1945
## 1 -0.4638159 -0.47881406 -0.3572925 -0.4920818 -0.5088010 -0.3946474
## 2 2.9114208 -0.09069016 5.7429148 1.2787223 3.3234955 5.7411002
## 3 -0.1185018 -0.10720800 -0.1735362 -0.2470419 -0.1675787 -0.1529536
## 4 1.0490170 1.35741950 0.3831397 1.5422826 1.3765133 0.5522322
## 5 1.4761882 0.22743355 1.7010606 0.3908719 2.0631080 1.9931402
## 6 2.2681435 3.63122466 1.4424301 3.2783121 1.7308562 1.2371388
## 1946 1947 1948 1949 1950 1951
## 1 -0.4746854 -0.51794899 -0.5188314 -0.4818805 -0.5203480 -0.4244273
## 2 2.4058353 0.68587815 2.8046781 1.5175913 1.0617771 3.6901582
## 3 -0.2891867 0.03262851 -0.3018154 -0.1237660 -0.1612982 -0.2658836
## 4 1.2491751 1.62442528 1.5553410 0.9051364 1.8528778 0.7545752
## 5 0.9783027 0.83874824 2.4769565 3.2275905 0.9329471 4.0178357
## 6 3.2712138 1.95418892 2.1356157 2.8209200 2.1363640 1.7110463
## 1952 1953 1954 1955 1956 1957
## 1 -0.5055191 -0.3573271 -0.54355172 -0.4639970 -0.540073855 -0.48274615
## 2 1.6522822 3.2930167 2.49984432 3.6198326 1.318146149 3.63134958
## 3 -0.1736312 -0.2351682 0.09433979 -0.2410232 0.003786399 0.00798216
## 4 1.5417717 0.7544880 0.70165429 1.4659056 1.125325907 0.81697930
## 5 0.7231519 4.7531494 3.33575247 1.8451843 3.567762820 2.96367127
## 6 2.7247256 0.3667854 2.61164570 1.1553570 2.261350420 1.44884210
## 1958 1959 1960 1961 1962 1963
## 1 -0.56550431 -0.5377302 -0.50358665 -0.585773047 -0.50736897 -0.4416372
## 2 1.19829142 3.4917346 1.13036103 2.881280618 1.44179765 1.3092892
## 3 0.04250469 0.1084605 -0.06802398 -0.002828525 0.08410545 -0.0472874
## 4 1.27727694 0.7462996 1.38935780 1.565760621 0.92106911 1.6232669
## 5 3.86709012 2.9356254 3.38794562 0.817611892 4.41488823 0.9866424
## 6 1.88426735 2.0168410 1.50070489 2.255424475 1.45640432 0.8668727
## 1964 1965 1966 1967 1968 1969
## 1 -0.6385620 -0.5080511 -0.47611446 -0.35174365 -0.38617898 -0.3414986
## 2 2.4522519 1.2805846 1.77605170 1.98772263 1.31246741 0.4127892
## 3 0.5856504 0.1312060 0.08400011 0.26451588 -0.05975611 -0.1274049
## 4 0.8096763 1.3511662 1.14826845 -0.01095615 0.38196709 0.1989347
## 5 3.1316553 3.5883129 3.35112513 5.87918029 5.14024884 4.3673928
## 6 1.1363704 0.4111854 0.67132256 -0.07926563 1.74908399 2.8363113
## 1970 1971 1972 1973 1974 1975
## 1 -0.42898764 -0.5206229 -0.44564911 -0.421361201 -0.3596566 -0.2567468
## 2 0.76859028 0.8972479 2.75206759 0.338595398 0.7244987 5.1448347
## 3 -0.03352422 0.3179388 0.08217737 -0.016454721 -0.2981747 -0.2763307
## 4 0.94632553 0.7283411 1.52614545 1.861734156 2.0007478 0.8131053
## 5 5.42401297 4.7443738 1.03685732 -0.003463835 -0.2459478 -0.2020789
## 6 0.68391969 1.0548462 -0.24013764 0.658150735 0.7932912 0.4603274
## 1976 1977 1978 1979 1980 1981
## 1 -0.35441018 -0.4245726 -0.3161875 -0.33562503 -0.2799136 -0.3090609
## 2 1.18641490 4.6037796 0.1031165 1.87421885 1.5238405 1.7918509
## 3 -0.06854667 0.1997748 0.1172335 -0.02300845 -0.2873292 -0.1507364
## 4 1.13568981 0.2385661 1.1127716 0.75064261 0.6702452 -0.1421940
## 5 -0.38251187 0.6051158 0.4509942 1.02959351 -0.1422422 5.3666846
## 6 1.88413839 1.9496822 0.4826589 1.45708793 3.0774807 2.3122773
## 1982 1983 1984 1985 1986 1987
## 1 -0.23643057 -0.3309832 -0.2939556 -0.2653891 -0.3356441 -0.2366752
## 2 6.46993869 2.0438288 2.6637051 2.4120998 1.5447718 5.1404078
## 3 -0.06492493 0.4645065 -0.2427378 -0.2319585 -0.1391298 -0.3379437
## 4 -0.14780014 0.4666048 0.5385883 0.4252572 0.8571121 0.1089514
## 5 1.19431781 0.2127103 4.9522925 4.2811755 1.2667817 2.5608530
## 6 0.51481784 -0.1181221 0.3206276 0.6341986 1.8227618 1.2313347
## 1988 1989 1990 1991 1992 1993
## 1 -0.30010024 -0.33826621 -0.23622856 -0.2720894 -0.20399212 -0.20396305
## 2 5.07354861 2.67274427 6.42497073 4.4824907 0.17265365 5.51251499
## 3 -0.07465684 -0.27901375 -0.20586887 -0.2780612 -0.01376932 -0.09321162
## 4 0.21988209 1.10799940 0.03549939 0.0689494 -0.15642068 0.31145710
## 5 -0.31304999 3.99540187 2.43206193 0.1360472 6.69178931 0.54573898
## 6 1.87222059 -0.04950244 0.14069269 3.0945612 0.17265365 -0.39138856
## 1994 1995 1996 1997 1998 1999
## 1 -0.15577497 -0.06634261 0.04019927 -0.02714325 -0.05648498 -0.1980753
## 2 1.66994402 4.03194879 1.69895318 1.47753668 0.94790887 4.5631867
## 3 -0.14314648 -0.15948560 -0.17774443 -0.38279487 -0.25615968 0.1689352
## 4 0.55091939 -0.11796837 -0.25020103 0.44432313 0.61679002 -0.2342826
## 5 0.08236229 1.05353934 0.84669142 -0.08733044 -0.37656654 1.5760832
## 6 0.59501889 -0.31652900 -0.14761396 0.48444793 0.12011174 -0.3248009
## 2000 2001 2002
## 1 -0.12238801 -0.1217445 -0.1708634
## 2 5.70875668 6.6300840 -0.2687739
## 3 -0.22691557 -0.1988553 0.5323120
## 4 0.16506278 -0.1896020 -0.2687739
## 5 -0.45090321 -0.2913884 0.7103311
## 6 -0.04025921 0.3193298 0.2207786
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 3 1 1 1 2
## Colorado Connecticut Delaware DC Florida
## 1 3 1 1 3
## Georgia Hawaii Idaho Illinois Indiana
## 1 1 1 4 3
## Iowa Kansas Kentucky Louisiana Maine
## 3 1 3 1 1
## Maryland Massachusetts Michigan Minnesota Mississippi
## 1 4 4 1 1
## Missouri Montana Nebraska Nevada NH
## 1 1 1 1 1
## NJ NM NY NC ND
## 4 1 6 3 1
## Ohio Oklahoma Oregon Pennsylvania RI
## 4 1 1 6 1
## SC SD Tennessee Texas Utah
## 1 1 3 5 1
## Vermont Virginia Washington WV Wisconsin
## 1 3 3 3 4
## Wyoming
## 1
##
## Within cluster sum of squares by cluster:
## [1] 355.2395 0.0000 281.9867 419.7744 0.0000 111.9421
## (between_SS / total_SS = 68.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
We observe some similarities with the clusters from hierarchical clustering:
The populous states are in their own clusters. See Texas (cluster 1), California (cluster 3), and Pennsylvania and New York (cluster 5). Maybe the health system for a populous state has a particular complexity, not seen in smaller states, that impacts in the measles cases in a similar fashion.
Again we can find Indiana, Kentucky, Tennessee, and Alabama in the same cluster (See cluster 2).
In cluster 4 we can find Illinois and Michigan (which are close to each other). But here we also find Massachusetts.
Finally, in cluster 6 we find a lot of states from different regions of the US. We can find states from the north east (like Vermont, Maine, Rhode Island, and New Hampshire) to states from the south west (like Colorado, Arizona, and New Mexico).
If we analyze each sub group, maybe this is telling us that the geographical proximity between states can explain some of the measles cases.