Final

This exam has 3 questions. It is due May 5, 2018 at 11:55pm. There are 100 points available for students in both 388/488. Students enrolled in 388 will be scored out of 80 points. Don’t cheat. Seriously, don’t cheat. Have a great summer. Try not to do anything stupid.

Problem 1

(10 points) What is the difference between parametric and non-parametric statistics (2-3 lines should suffice for an answer)?

In parametric statistics we make assumptions about the probability distributions of variables. These assumptions often involve parametric distributions. On the other hand, non-parametric statistics does not make these distributional assumptions about the data (But it may make other assumptions).

Problem 2

(10 points) Give an example of a hypothesis test that can be tested using both parametric and non-parametric tests. State the null and alternative hypotheses and describe (or give the names of) the parametric and nonparametric tests used to test the hypothesis that you have given as an example. (HINT: Pick an example that is easy for yourself!)

For this problem I selected two statistical procedures to test the relative locations (medians) of three or more populations.

Parametric test: ANOVA

Null hypothesis
\(H_0: \mu_{1} = \mu_{2} = \ldots = \mu_{j}\) where \(\mu_{i}\) is the mean of group \(i\)

The null hypothesis states that the means of all the groups are equal.

Alternative hypothesis
\(H_1: \mu_{i} \neq \mu_{j}\) for some \(i, j\), where \(\mu_{i}\) is the mean of group \(i\)

The alternative states that there is at least a pair of groups with different means.

This test uses the \(F\) statistic which is computed as follows: \[F = \frac{MST}{MSE} \stackrel{H_0}{\sim} F_{k-1, n-k}\]

Where \(MST = \frac{SST}{k-1}\) and \(MSE = \frac{MSE}{n-k}\).

Non Parametric test: Kruskal-Wallis

Null hypothesis
\(H_0: \tau_{1} = \tau_{2} = \ldots = \tau_{k} = 0\) where \(\tau_{i}\) is the unknown treatment effect for the i-th population

Alternative hypothesis
\(H_a: \tau_{i}, \tau_{j}, \ldots, \tau_{k}\) not all equal 0

The hypotheses mention \(\tau\) (the unknown treatment effect for the i-th population), so I reviewed the assumptions of the test to remember what \(\tau\) is:

  1. The observations are independent

  2. For each group, \(j \in \{1, \ldots, k\}\) the \(n_j\) observations are a random sample from a continuous distribution with distribution function \(F_j\)

  3. The distribution functions \(F_1, \ldots, F_k\) are connected through the relationship \[F_j(t) = F(t -\tau_j)\]

where \(j = 1, \ldots, k\) and \(F\) is a distribution function for a continuous distribution with unknown median \(\theta\) and \(\tau_j\) is the unknown treatment effect for the j-th population. Under the third assumption here, the \(k\) underlying populations can differ by at most a shift in location. They cannot differ by scales parameters and must belong to the same general family.

It is important to note that the Kruskal-Wallis test statistic (H) is computed using the ranks and not the raw data. The formula for the test statistic is the following:

\[H = \frac{12}{N(N+1)} * \sum^{k}_{i = 1} n_j * (R_{.j}-\frac{N+1}{2})^2\]

Where \(N\) is the total number of observations, \(k\) is the number of groups, \(R_{.j}\) is the mean of the ranks of the j-th group, and \(n_j\) is the number of observations in the j-th group. The term \(\frac{12}{N(N+1)}\) is a scaling factor and allows to calculate a p-value by using a \(\chi^2\) distribution with \(k-1\) degrees of freedom.

Problem 3

Answer the following questions with the mtcars dataset:

# Required libraries
library(ggplot2)

# Loading the data
df <- mtcars

3) A) Density estimator

(10 points) Using the mtcars data set that is pre-loaded in R, construct a kernel density estimator of the car weights in that data set and show the density in a figure. (hint: ggplot2 makes this very easy).

I produced the density estimation with ggplot() using the code below.

# Density estimation 
ggplot(data = df, aes(x = wt)) + geom_density() + 
  ggtitle("3) A) - Density estimation for cars weight") +
  scale_x_continuous("weight (1000 lbs)")

3) B) Loess regression

(10 points) Using the mtcars data set that is pre-loaded in R, fit a loess regression model with qsec as a predictor and mpg as the response variable. (hint: ggplot2 makes this very easy).

For this problem I also used ggplot(). My first approach was to use the default values for the parameters.

# Loess regression with span = 0.75 (default)  
ggplot(data = df, aes(x = qsec, y = mpg)) + geom_point() + 
  stat_smooth(method = "loess", span = 0.75) +
  ggtitle("3) B) - Loess Regression of mpg vs. qsec - Span = 0.75 (default)")

I repeated the loess regression but this time I changed the span to 0.35.

# Loess regression with span = 0.35  
ggplot(data = df, aes(x = qsec, y = mpg)) + geom_point() + 
  stat_smooth(method = "loess", span = 0.35) +
  ggtitle("3) B) - Loess Regression of mpg vs. qsec - Span = 0.35 ")

We observe that the first case - where span = 0.75 - produces a smoother curve. The second curve presents an important drop in the right hand side and the 95% confidence interval expands rapidly in this area. If we use this curve to predict the mpg value of an observation with high qsec value, there is a high probability that the prediction will have an important error.
For this reason, I choose the first curve fitted with span = 0.75.

Problem 4

Answer the following questions with the data sets in the dslabs pacakge. We will focus on measles. (I did some data cleaning for you. Also, assume that the 0’s in the early years are correct.)

# Required libraries
library(dslabs)
library(reshape2)

# Loading the data
measles <- subset(us_contagious_diseases, disease == "Measles")
measles <- melt(measles, id = c("disease","state","year"), measure = "count")
measles <- dcast(measles, formula = year ~ state)

#Remove the space in some of the state names.
names(measles)[10]<-"DC"
names(measles)[31]<-"NH"
names(measles)[32]<-"NJ"
names(measles)[33]<-"NM"
names(measles)[34]<-"NY"
names(measles)[35]<-"NC"
names(measles)[36]<-"ND"
names(measles)[41]<-"RI"
names(measles)[42]<-"SC"
names(measles)[43]<-"SD"
names(measles)[50]<-"WV"

4) A) CART model

(10 points) Construct a CART model to predict number of measles cases in Illinois using the number of cases in other states as potential predictors. Display the best tree that you find and try to offer some interpretation.

I first removed the variable year and created a data set called measles2.

Then I fit the regression tree with the default values of the function tree(), whithout tunning any parameter, and called the model tree1. The idea of this initial tree is to see the general structure of the tree: how many branches the tree has, how many leaves the tree has, which variables produces the largest reduction in deviance, etc.

library(tree)
set.seed(4321)

# Removing the year
measles2 <- measles[-1]

# Fitting tree1
tree1 <- tree(Illinois ~ ., data = measles2)
tree1
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 76 2.002e+10 11500  
##   2) Indiana < 8771.5 62 2.595e+09  5334  
##     4) WV < 1676 40 1.273e+08  1206 *
##     5) WV > 1676 22 5.460e+08 12840 *
##   3) Indiana > 8771.5 14 4.620e+09 38810  
##     6) NC < 10117.5 9 6.207e+08 29070 *
##     7) NC > 10117.5 5 1.607e+09 56350 *
plot(tree1)
text(tree1)

From this tree we observe that the root node is related to the number of measles cases in Indiana. This is reasonable since these states are next to each other and one can argue that they have similar weather conditions, similar demographic characteristics, etc. Probably, the factors that produce measles also behave in a similar fashion in Illinois and Indiana. The number of measles cases in Indiana is the variable that produces the largest reduction in deviance.

The first binary split opens two branches: the left branch continues asking about the number of measles cases in WV (West Virginia), while the right branch continues asking about the number of measles cases in NC (North Carolina).

Given an observation for which we are trying to predict the measles cases in Illinois, this will be the reasoning path in words:

    1. We ask if the measles cases in Indiana are less than 8771.5
    1. If the answer to 1) is yes continue reading here; if not go to 3).
      Then we ask about the number of cases in WV. If this answer is yes the predicted number of measles cases for Illinois is 1206, otherwise is 12840.
    1. Being here means that the answer to 1) was no.
      Then we ask about the number of cases in NC. If this answer is yes the predicted number of measles cases for Illinois is 29070, otherwise is 56350.

In order to improve the tree I performed variable selection with stepwise.

# Variable selection
fitall <- lm(Illinois ~ ., data = measles2)
fitstart <- lm(Illinois ~ 1, data = measles2)
formula(lm_Illinois <- step(fitstart, scope = formula(fitall), direction = "both", trace = FALSE))
## Illinois ~ Pennsylvania + Indiana + SC + Michigan + RI + Iowa + 
##     Minnesota + Utah + Maine + Tennessee + Colorado + NJ + Kansas + 
##     Vermont + Kentucky + Arkansas + NY + Alaska + Maryland + 
##     California + Mississippi + NM + Virginia + Ohio + Nevada + 
##     Alabama + NC + Georgia + Nebraska + Massachusetts + NH + 
##     Wyoming + Washington + Delaware + Texas + Florida + Oklahoma + 
##     Missouri + WV + Hawaii

With the selected variables, I fitted a new tree.

set.seed(4321)

# Fitting tree2
tree2 <- tree(formula = formula(lm_Illinois) , data = measles2)
tree2
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 76 2.002e+10 11500  
##   2) Indiana < 8771.5 62 2.595e+09  5334  
##     4) WV < 1676 40 1.273e+08  1206 *
##     5) WV > 1676 22 5.460e+08 12840 *
##   3) Indiana > 8771.5 14 4.620e+09 38810  
##     6) Pennsylvania < 70039 9 6.207e+08 29070 *
##     7) Pennsylvania > 70039 5 1.607e+09 56350 *
plot(tree2)
text(tree2)

The interpretation is similar, but this time instead of NC we have Pennsylvania.

4) B) Pearson correlation, bootstrap estimation, and bias

(10 points) Compute the Pearson correlation coefficient between the Illinois and Massachusetts measles cases data. Compute the standard error of this estimator using the bootstrap procedure. Also, estimate the bias of this estimator.

First I checked for missing values in the data with the following code:

# Checking for missing values
Illinois <- measles[ , "Illinois"]
Massachusetts <- measles[ , "Massachusetts"]
sum(is.na(Illinois))
## [1] 0
sum(is.na(Massachusetts))
## [1] 0

There are no missing values for Illinois and Massachusetts. Then I used the bootstrap procedure to estimate the correlation coefficient, the standard error, and the bias of the estimator.

# Theta hat for the Pearson correlation coefficient
theta_hat_cor <- cor(Illinois, Massachusetts, method = "pearson")
theta_hat_cor
## [1] 0.4809065
# Simulation parameters
set.seed(4321)
n <- length(Illinois)
nsim <- 1000

# Theta bootstrap for the Pearson correlation coefficient
theta_boots_cor <- rep(NA, nsim)            # Preallocation                       

for (i in 1:nsim){
  boots_sample <- measles[sample(1:n, n, replace = TRUE), ]
  temp <- cor(boots_sample$Illinois, boots_sample$Massachusetts, method = "pearson")
  theta_boots_cor[i] <- temp                # Correlation coefficients of the bootstraps samples
}

The standard error of the estimate of the Pearson correlation coefficient using the bootstrap procedure is the following:

# Bootstrap estimate of standard error for the Pearson correlation coefficient
boots_se <- sd(theta_boots_cor)
boots_se
## [1] 0.1059561

The bias of the Pearson correlation coefficients using the boostrap procedure is:

# Bootstrap estimate of bias for the Pearson correlation coefficient
boots_bias <- mean(theta_boots_cor) - theta_hat_cor
boots_bias
## [1] 0.01557448

Even though is not asked by the directions, I also calculated the bootstrap estimate of the MSE and the variance. If the bias is small as I calculated above, then the MSE and the variance should be similar.

# Bootstrap estimate of MSE for the Pearson correlation coefficient
boots_MSE <- mean((theta_boots_cor - theta_hat_cor)^2)
boots_MSE
## [1] 0.01145803
# Bootstrap estimate of variance for the Pearson correlation coefficient
boots_var <- var(theta_boots_cor)
boots_var
## [1] 0.01122669
# Checking
boots_bias^2 + boots_var
## [1] 0.01146925

4) C) PCA

(10 points) Perform PCA on the the measles data with states as varaibles. (Remember to exclude year as a predictor and also don’t forget to scale the data before doing PCA.) How many components are needed to account for 85% of the variability? Also, try to interpret the first two principal components.

For this problem I will use the data set measles3 where I removed the variable year.

# Removing the year
measles3 <- measles[-1]

# Data scaling and PCA
measles3 <- scale(measles3, scale = TRUE, center = TRUE)
s <- cor(scale(measles3, scale = TRUE, center = TRUE), method = "pearson")
e <- eigen(s)
str(e)
## List of 2
##  $ values : num [1:51] 25.72 5.71 2.99 2.43 2.24 ...
##  $ vectors: num [1:51, 1:51] -0.1608 -0.0827 -0.1287 -0.1383 -0.149 ...
##  - attr(*, "class")= chr "eigen"

The list e stores the eigenvalues and the eigenvectors. I will use them later calling e$values or e$vectors respectively. It is important to notice that e$values stores the eigenvalues in decreasing order. In other words, e$values[1] is the largest eigenvalue, e$values[2] is the second largest eigenvalue, etc.

# Eigenvectors associated to the two largest eigenvalues
head(e$vectors[ , 1:2])
##             [,1]        [,2]
## [1,] -0.16077162 -0.04273130
## [2,] -0.08266421  0.30002422
## [3,] -0.12868551  0.24769098
## [4,] -0.13829837 -0.08250811
## [5,] -0.14897156  0.06196241
## [6,] -0.16622331 -0.01145588

Here I printed the first six components of the two first eigenvectors. The objective is just to keep a record because later I will use a function called prcomp() that returns a list with many objects, including the eigenvectors. I just wanted to check if prcomp() returns the same values we are getting here.

Account for 85% of the variability

The proportion of the total variability explained by a single principal component is given by the eigenvalue associated to that principal component divided by the total sum of eigenvalues.

\[\text{Proportion PC}_{i} = \frac{\lambda_{i}}{\sum_{i = 1}^{p}{\lambda_{i}}}\]

where p is the number of predictor variables in the original data.

The following for loop is to determine how many principal components account for 85% of the total variability:

# Account for 85% of the variability
cummulative_prop <- 0
counter <- 0

for (i in 1:length(e$values)){
  if (cummulative_prop < 0.85){
  cummulative_prop <- cummulative_prop + (e$values[i] / sum(e$values))
  counter <- counter + 1
}
}
counter
## [1] 9

I created two variables called cummulative_prop and counter that are initialized at zero. There is a for loop that goes through every component of the vector e$values and an if statement that checks if the cummulative_prop value at that point is less than 0.85. If that is the case, then cummulative_prop is updated and counter increased by one.

When counter value is nine cummulative_prop is greater than 0.85 and the condition of the if statement is not met. Hence, the values of cummulative_prop and counter are not updated anymore.

We need 9 principal components to account for 85% of the variability.

Interpretation of the first two PC

For this section it is important to recall that the principal components form a new coordinates system. In this new system the principal components are the new orthogonal axes. By selecting only two principal components, PC1 and PC2, we can represent a plane where we can plot the original data.

To interpret the first two PC I will plot the data points in the new coordinates system. To do so I need to change the coordinates from the “old” system to the “new” system. That change is accomplished by matrix multiplication of the coordinates in the “old” system by the normalized eigenvectors 1 and 2, which correspond to eigenvalues 1 and 2 respectively.

# Data coordinates in the "new" system
out <- as.matrix(measles3)%*%e$vectors[ , 1:2]
head(out)
##           [,1]       [,2]
## [1,] -4.327350 -5.1326092
## [2,] -1.340115 -2.0964258
## [3,] -2.449262 -2.1443394
## [4,] -3.103270 -3.4145583
## [5,] -1.576672 -0.8715173
## [6,] -1.289000 -2.2692750

To use ggplot() I need a data frame:

# Preparing the data to use ggplot
year <- as.character(measles[ , 1])
row.names(out) <- year
colnames(out) <- c("PC1", "PC2")
out_df <- data.frame(out)
head(out_df)
##            PC1        PC2
## 1928 -4.327350 -5.1326092
## 1929 -1.340115 -2.0964258
## 1930 -2.449262 -2.1443394
## 1931 -3.103270 -3.4145583
## 1932 -1.576672 -0.8715173
## 1933 -1.289000 -2.2692750
# Plot with ggplot
ggplot(data = out_df, aes(x = PC1, y = PC2, label = rownames(out_df))) +
  geom_point(position = "jitter") +
  geom_text() +
  ggtitle("4) C) - PCA of measles data") +
  scale_x_continuous("PC1 (50.44% explained variability)") +
  scale_y_continuous("PC2 (11.20% explained variability)")

From the plot we can see that each point corresponds to a single year. Most of the variability (50.44%) is explained along the X axis, which is reasonable since that axis corresponds to PC1. Only 11.20% of the variability is explained along the Y axis. These percentages come from calculations that can be find below.

In the middle of the plot there are years corresponding to the 1940 and 1950 decades. As we move to the right side, there are years of the 1960 decade. In the right most part of the plot there are a lot of points together, which corresponds to the decades from the 1970 to the present. Probably in these decades there were improvements in vaccines, in public health policies, etc. that lead to a continuos reduction in measles cases. This might be the reason why the (centered) variability along the Y axis is almost zero for the 1970 - 2003 period.

Special attention deserves the decade of 1930. There are several values of this decade in the left part of the plot and most of them are in the lower part. I subsetted that decade for some states to look closer what is happening.

measles[7:12, c("year", "California", "Texas", "NY", "Florida")]
##    year California Texas    NY Florida
## 7  1934      25650 29699 34208    8015
## 8  1935      28799  3923 69073    1174
## 9  1936      49050  7776 64366     302
## 10 1937       5107 14694 27404     394
## 11 1938      18527  6011 70111    9175
## 12 1939      63748  7380 44812    2895

Many questions araise from here:
* Is it possible that California reduced the measles cases from 49050 in 1936 to 5107 in 1937?
* Is possible that in Florida measles cases grew exponentially from 394 in 1937 to 9175 in 1938?

May be there was a change on the statiscal procedures in those year, or a change in the medical definition of what is considered to be measles, etc. In conclusion, data from this decade deserves more in depth analysis.

Using prcomp()

I did the same work using the function prcomp() from the stats package. Even though the results are the same, I found this function useful because it returns a list with class “prcomp” that contains several objects.

One if them is x that contains the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix). These are the same values that in previous procedure stores the matrix called out

# PCA with prcomp()
measles2_pca <- prcomp(measles2, retx = TRUE, scale. = TRUE, center = TRUE)

# Object x: rotated data
head(measles2_pca$x[ , 1:2])
##            PC1       PC2
## [1,] -4.327350 5.1326092
## [2,] -1.340115 2.0964258
## [3,] -2.449262 2.1443394
## [4,] -3.103270 3.4145583
## [5,] -1.576672 0.8715173
## [6,] -1.289000 2.2692750

Another object is rotation, which contains the eigenvectors. These values are the same as the ones in e$vectors from the previous procedure.

# Object rotation: matrix that contains the eigenvectors
head(measles2_pca$rotation[ , 1:2])
##                    PC1         PC2
## Alabama    -0.16077162  0.04273130
## Alaska     -0.08266421 -0.30002422
## Arizona    -0.12868551 -0.24769098
## Arkansas   -0.13829837  0.08250811
## California -0.14897156 -0.06196241
## Colorado   -0.16622331  0.01145588

Furthermore, when we use plot() with the class “prcomp” object, we can easily get the scree plot.

# Scree plot
plot(measles2_pca, pch = 19, type = "l", 
     main = "Scree plot for measles data")

Finally, if we do summary() of the “prcomp” class object we get the proportion of variance and the cumulative proportion explained by each principal component. From here we can see that nine principal components are required to account for 85% of the variability.

summary(measles2_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     5.0717 2.3906 1.73002 1.55793 1.49685 1.21572
## Proportion of Variance 0.5044 0.1120 0.05869 0.04759 0.04393 0.02898
## Cumulative Proportion  0.5044 0.6164 0.67510 0.72269 0.76662 0.79560
##                            PC7     PC8    PC9    PC10    PC11    PC12
## Standard deviation     1.08633 1.06723 1.0023 0.91873 0.81014 0.80742
## Proportion of Variance 0.02314 0.02233 0.0197 0.01655 0.01287 0.01278
## Cumulative Proportion  0.81874 0.84107 0.8608 0.87732 0.89019 0.90297
##                           PC13    PC14    PC15   PC16    PC17    PC18
## Standard deviation     0.74147 0.71233 0.65939 0.6101 0.58548 0.57551
## Proportion of Variance 0.01078 0.00995 0.00853 0.0073 0.00672 0.00649
## Cumulative Proportion  0.91375 0.92370 0.93223 0.9395 0.94625 0.95274
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.55033 0.53632 0.53413 0.47762 0.45042 0.42163
## Proportion of Variance 0.00594 0.00564 0.00559 0.00447 0.00398 0.00349
## Cumulative Proportion  0.95868 0.96432 0.96991 0.97439 0.97837 0.98185
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.39860 0.37323 0.32642 0.30369 0.28084 0.25949
## Proportion of Variance 0.00312 0.00273 0.00209 0.00181 0.00155 0.00132
## Cumulative Proportion  0.98497 0.98770 0.98979 0.99160 0.99314 0.99446
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.24514 0.23105 0.17986 0.17410 0.16067 0.13137
## Proportion of Variance 0.00118 0.00105 0.00063 0.00059 0.00051 0.00034
## Cumulative Proportion  0.99564 0.99669 0.99732 0.99792 0.99842 0.99876
##                           PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.11746 0.10899 0.09484 0.08418 0.07666 0.06662
## Proportion of Variance 0.00027 0.00023 0.00018 0.00014 0.00012 0.00009
## Cumulative Proportion  0.99903 0.99926 0.99944 0.99958 0.99969 0.99978
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.06287 0.04965 0.04220 0.03468 0.03022 0.02156
## Proportion of Variance 0.00008 0.00005 0.00003 0.00002 0.00002 0.00001
## Cumulative Proportion  0.99986 0.99991 0.99994 0.99997 0.99998 0.99999
##                           PC49   PC50     PC51
## Standard deviation     0.01338 0.0106 0.008771
## Proportion of Variance 0.00000 0.0000 0.000000
## Cumulative Proportion  1.00000 1.0000 1.000000

4) C) Hierarchical clustering

(15 points) Perform hierarchical clustering on the the measles data with states as observations. (Remember to exclude year data and also don’t forget to scale the data.) How many clusters do there appear to be? Try to interpret each of the clusters.

For this problem I created a data set called measles4 where I removed the variable year. I also used the function t() to transpose the data and get the states as observations.

# Removing the year
measles4 <- measles[-1]
year <- as.character(measles[ , 1])
row.names(measles4) <- year

# Transpose to get states as observations
measles4 <- t(measles4)
head(measles4[ , 1:6])
##            1928 1929  1930  1931  1932  1933
## Alabama    8843 2959  4156  8934   270  1735
## Alaska        0    0     0     0     0     0
## Arizona     847  236  2024  2135    86  1261
## Arkansas   8899 1245   994   849    99  5438
## California 3698 4024 43416 27807 12618 26551
## Colorado   2099  748 11781  4787  2376   297

Then I used the function scale() to scale and center the data.

# Data scaling 
measles4 <- scale(measles4, scale = TRUE, center = TRUE)
head(sqrt(diag(var(measles4))))
## 1928 1929 1930 1931 1932 1933 
##    1    1    1    1    1    1

We observe that the standard deviations for each variable are equal to one. Next, I computed the Euclidean distances and stored the values in matrix d.

d <- as.matrix(measles4)
d <- dist(d, method = "euclidean")
head(d)
## [1]  9.915484  8.993017  7.624698 29.325578  8.486998  8.880287
str(d)
##  'dist' num [1:1275] 9.92 8.99 7.62 29.33 8.49 ...
##  - attr(*, "Size")= int 51
##  - attr(*, "Labels")= chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = d, method = "euclidean")

Then I did hierarchical clustering using hclust() with different linkages. I started with Ward’s method that according to the help file of hclust() aims at finding compact, spherical clusters.

# Hierarchical clustering - Ward linkage
hc <- hclust(d, method = "ward.D2")
plot(hc, hang = -1, cex = 0.9)

# Hierarchical clustering - Single linkage
hc <- hclust(d, method = "single")
plot(hc, hang = -1, cex = 0.9)

# Hierarchical clustering - Average linkage
hc <- hclust(d, method = "average")
plot(hc, hang = -1, cex = 0.9)

# Hierarchical clustering - Complete linkage
hc <- hclust(d, method = "complete")
plot(hc, hang = -1, cex = 0.9)

I will do my analysis based on the complete linkage dendrogram. From this dendrogram there appear to be 4 big clusters.

  1. In the middle part of the dendrogram we can see Arkansas, Oklahoma, Nebraska, and South Dakota. Next to these states we can find Georgia, Misouri, and Mississippi. Finally Delaware, DC, and Maryland are also close to each other.
    Within each of the groups I mentioned, the states are also close geographically so the number of measles cases can be affected by similar factors. While the groups may not be closer between them, I would say that they all belong to the same cluster. We can see this by the height of the branches where the union between groups occurs.

  2. To the left of the first cluster, we can find Arizona, Utah, New Mexico, Colorado, and Nevada. One may argue that this is because the states are next to each other, but in the same cluster we can find Vermont, Alasksa, and Hawaii which definitely are not closer to the continental south west states.

  3. To the right of the first cluster, we can find Indiana, Kentucky, Tennessee, and Alabama. These states are almost at the same longitude in the map, so it is reasonable to think that they may also have some similarities regarding public health issues like measles cases.

  4. Finally, the right most cluster is not as similar as the clusters described above. We can see this by the height where the union between states occurs. In this cluster we find California, Texas, New York, Pennsylvania, and Illinois. While these states may not be close to each other, they have one important similarity: all of them are populous. Probably, the health system for a populous state has a particular complexity not seen in smaller states that impacts in the measles cases in a similar fashion.

4) D) K-means clustering

(15 points) Perform k-means clustering on the the measles data with states as observations. (Remember to exclude year data and also don’t forget to scale the data.) How many clusters did you choose? Try to interpret each of the clusters.

For this problem I created a data set called measles5 where I removed the variable year. I also used the function t() to transpose the data and get the states as observations.

# Removing the year
measles5 <- measles[-1]
year <- as.character(measles[ , 1])
row.names(measles5) <- year

# Transpose to get states as observations
measles5 <- t(measles5)
head(measles5[ , 1:5])
##            1928 1929  1930  1931  1932
## Alabama    8843 2959  4156  8934   270
## Alaska        0    0     0     0     0
## Arizona     847  236  2024  2135    86
## Arkansas   8899 1245   994   849    99
## California 3698 4024 43416 27807 12618
## Colorado   2099  748 11781  4787  2376

Then I used the function scale() to scale and center the data. It is important to notice here that in the original data measles, the number of measles cases for year 2003 is zero for the 51 states.
Hence, when we calculate the standard deviation to scale the data, its value is zero. Recalling that when we scale we divide by the standard deviation, we will divide by zero and this operation introduces NaN values.
If we try to do k-means clustering with the NaN values, the algorithm returns an error. For this reason, I will only scale columns 1 through 75. Column 76 of the transposed data is the one that has all the zeros (Column 76 in the transposed data corresponds to row 76 of the original data, which is the year 2003).

# Data scaling 
measles5 <- scale(measles5[ , 1:75], scale = TRUE, center = TRUE)
head(sqrt(diag(var(measles5))))
## 1928 1929 1930 1931 1932 1933 
##    1    1    1    1    1    1

To decide how many clusters I will use, I checked the reduction in within sum of squares versus the number of clusters. Theoretically, within sum of squares decreases as the number clusters increases and is minimized in the extreme case where each observation is in its own cluster.
But we do not want this extreme case. Instead, we want to choose the number of clusters that account for the highest reduction in within sum of squares.

# How many clusters?
ss <- rep(NA, 30)
for (k in 2:31){
km <- kmeans(measles5, centers = k, iter.max = 100, nstart = 20)
ss[k - 1] <- sum(km$withinss)
}

plot(ss, pch = 19,
     main = "Within ss vs. k", xlab = "k")
abline(v = 6, lwd = 2.5, col = "red")

From the plot, we observe what we expected: within sum of squares decreases as k increases. When k is between 6 and 9 the highest reduction in within sum of squares is reached. More than 10 clusters only produce a small reduction in within sum of squares.
For my analysis I choose k = 6, which in the plot is showed by the vertical red line.

# k-means clustering
km <- kmeans(measles5, centers = 6, iter.max = 100, nstart = 20)
km
## K-means clustering with 6 clusters of sizes 30, 1, 11, 6, 1, 2
## 
## Cluster means:
##          1928       1929       1930       1931       1932       1933
## 1 -0.38188209 -0.4008368 -0.4488463 -0.4007665 -0.4327591 -0.4056946
## 2 -0.35899164 -0.2390339  3.3168369  1.3022216  0.3809492  1.5332420
## 3  0.02371122 -0.2751456 -0.3155369 -0.1141081 -0.2482860 -0.2308294
## 4  0.78983788  1.6059364  1.4078735  0.7456490  1.4305318  0.7997594
## 5 -0.24556432 -0.2314738 -0.3019712 -0.4244041 -0.2365486  1.1464039
## 6  3.53058403  2.9432973  2.7370935  3.9632357  3.4931639  3.6158796
##         1934       1935       1936        1937        1938          1939
## 1 -0.4721339 -0.4566924 -0.2930762 -0.42580847 -0.44215898 -0.3793591847
## 2  0.6835942  0.7060573  3.6753117 -0.06777218  0.09050014  4.7466750590
## 3  0.1477332 -0.1161546 -0.2913312 -0.14008393 -0.11234559  0.0008941894
## 4  0.9337646  1.4136376  0.2699282  1.30908484  1.41400171  0.4533162125
## 5  0.9265346 -0.4918852  0.1913850  0.92439280 -0.37241905 -0.0852555816
## 6  2.6631177  3.1412379  3.2553311  2.80202388  3.14923975  1.9948113528
##         1940        1941       1942       1943       1944       1945
## 1 -0.4638159 -0.47881406 -0.3572925 -0.4920818 -0.5088010 -0.3946474
## 2  2.9114208 -0.09069016  5.7429148  1.2787223  3.3234955  5.7411002
## 3 -0.1185018 -0.10720800 -0.1735362 -0.2470419 -0.1675787 -0.1529536
## 4  1.0490170  1.35741950  0.3831397  1.5422826  1.3765133  0.5522322
## 5  1.4761882  0.22743355  1.7010606  0.3908719  2.0631080  1.9931402
## 6  2.2681435  3.63122466  1.4424301  3.2783121  1.7308562  1.2371388
##         1946        1947       1948       1949       1950       1951
## 1 -0.4746854 -0.51794899 -0.5188314 -0.4818805 -0.5203480 -0.4244273
## 2  2.4058353  0.68587815  2.8046781  1.5175913  1.0617771  3.6901582
## 3 -0.2891867  0.03262851 -0.3018154 -0.1237660 -0.1612982 -0.2658836
## 4  1.2491751  1.62442528  1.5553410  0.9051364  1.8528778  0.7545752
## 5  0.9783027  0.83874824  2.4769565  3.2275905  0.9329471  4.0178357
## 6  3.2712138  1.95418892  2.1356157  2.8209200  2.1363640  1.7110463
##         1952       1953        1954       1955         1956        1957
## 1 -0.5055191 -0.3573271 -0.54355172 -0.4639970 -0.540073855 -0.48274615
## 2  1.6522822  3.2930167  2.49984432  3.6198326  1.318146149  3.63134958
## 3 -0.1736312 -0.2351682  0.09433979 -0.2410232  0.003786399  0.00798216
## 4  1.5417717  0.7544880  0.70165429  1.4659056  1.125325907  0.81697930
## 5  0.7231519  4.7531494  3.33575247  1.8451843  3.567762820  2.96367127
## 6  2.7247256  0.3667854  2.61164570  1.1553570  2.261350420  1.44884210
##          1958       1959        1960         1961        1962       1963
## 1 -0.56550431 -0.5377302 -0.50358665 -0.585773047 -0.50736897 -0.4416372
## 2  1.19829142  3.4917346  1.13036103  2.881280618  1.44179765  1.3092892
## 3  0.04250469  0.1084605 -0.06802398 -0.002828525  0.08410545 -0.0472874
## 4  1.27727694  0.7462996  1.38935780  1.565760621  0.92106911  1.6232669
## 5  3.86709012  2.9356254  3.38794562  0.817611892  4.41488823  0.9866424
## 6  1.88426735  2.0168410  1.50070489  2.255424475  1.45640432  0.8668727
##         1964       1965        1966        1967        1968       1969
## 1 -0.6385620 -0.5080511 -0.47611446 -0.35174365 -0.38617898 -0.3414986
## 2  2.4522519  1.2805846  1.77605170  1.98772263  1.31246741  0.4127892
## 3  0.5856504  0.1312060  0.08400011  0.26451588 -0.05975611 -0.1274049
## 4  0.8096763  1.3511662  1.14826845 -0.01095615  0.38196709  0.1989347
## 5  3.1316553  3.5883129  3.35112513  5.87918029  5.14024884  4.3673928
## 6  1.1363704  0.4111854  0.67132256 -0.07926563  1.74908399  2.8363113
##          1970       1971        1972         1973       1974       1975
## 1 -0.42898764 -0.5206229 -0.44564911 -0.421361201 -0.3596566 -0.2567468
## 2  0.76859028  0.8972479  2.75206759  0.338595398  0.7244987  5.1448347
## 3 -0.03352422  0.3179388  0.08217737 -0.016454721 -0.2981747 -0.2763307
## 4  0.94632553  0.7283411  1.52614545  1.861734156  2.0007478  0.8131053
## 5  5.42401297  4.7443738  1.03685732 -0.003463835 -0.2459478 -0.2020789
## 6  0.68391969  1.0548462 -0.24013764  0.658150735  0.7932912  0.4603274
##          1976       1977       1978        1979       1980       1981
## 1 -0.35441018 -0.4245726 -0.3161875 -0.33562503 -0.2799136 -0.3090609
## 2  1.18641490  4.6037796  0.1031165  1.87421885  1.5238405  1.7918509
## 3 -0.06854667  0.1997748  0.1172335 -0.02300845 -0.2873292 -0.1507364
## 4  1.13568981  0.2385661  1.1127716  0.75064261  0.6702452 -0.1421940
## 5 -0.38251187  0.6051158  0.4509942  1.02959351 -0.1422422  5.3666846
## 6  1.88413839  1.9496822  0.4826589  1.45708793  3.0774807  2.3122773
##          1982       1983       1984       1985       1986       1987
## 1 -0.23643057 -0.3309832 -0.2939556 -0.2653891 -0.3356441 -0.2366752
## 2  6.46993869  2.0438288  2.6637051  2.4120998  1.5447718  5.1404078
## 3 -0.06492493  0.4645065 -0.2427378 -0.2319585 -0.1391298 -0.3379437
## 4 -0.14780014  0.4666048  0.5385883  0.4252572  0.8571121  0.1089514
## 5  1.19431781  0.2127103  4.9522925  4.2811755  1.2667817  2.5608530
## 6  0.51481784 -0.1181221  0.3206276  0.6341986  1.8227618  1.2313347
##          1988        1989        1990       1991        1992        1993
## 1 -0.30010024 -0.33826621 -0.23622856 -0.2720894 -0.20399212 -0.20396305
## 2  5.07354861  2.67274427  6.42497073  4.4824907  0.17265365  5.51251499
## 3 -0.07465684 -0.27901375 -0.20586887 -0.2780612 -0.01376932 -0.09321162
## 4  0.21988209  1.10799940  0.03549939  0.0689494 -0.15642068  0.31145710
## 5 -0.31304999  3.99540187  2.43206193  0.1360472  6.69178931  0.54573898
## 6  1.87222059 -0.04950244  0.14069269  3.0945612  0.17265365 -0.39138856
##          1994        1995        1996        1997        1998       1999
## 1 -0.15577497 -0.06634261  0.04019927 -0.02714325 -0.05648498 -0.1980753
## 2  1.66994402  4.03194879  1.69895318  1.47753668  0.94790887  4.5631867
## 3 -0.14314648 -0.15948560 -0.17774443 -0.38279487 -0.25615968  0.1689352
## 4  0.55091939 -0.11796837 -0.25020103  0.44432313  0.61679002 -0.2342826
## 5  0.08236229  1.05353934  0.84669142 -0.08733044 -0.37656654  1.5760832
## 6  0.59501889 -0.31652900 -0.14761396  0.48444793  0.12011174 -0.3248009
##          2000       2001       2002
## 1 -0.12238801 -0.1217445 -0.1708634
## 2  5.70875668  6.6300840 -0.2687739
## 3 -0.22691557 -0.1988553  0.5323120
## 4  0.16506278 -0.1896020 -0.2687739
## 5 -0.45090321 -0.2913884  0.7103311
## 6 -0.04025921  0.3193298  0.2207786
## 
## Clustering vector:
##       Alabama        Alaska       Arizona      Arkansas    California 
##             3             1             1             1             2 
##      Colorado   Connecticut      Delaware            DC       Florida 
##             1             3             1             1             3 
##       Georgia        Hawaii         Idaho      Illinois       Indiana 
##             1             1             1             4             3 
##          Iowa        Kansas      Kentucky     Louisiana         Maine 
##             3             1             3             1             1 
##      Maryland Massachusetts      Michigan     Minnesota   Mississippi 
##             1             4             4             1             1 
##      Missouri       Montana      Nebraska        Nevada            NH 
##             1             1             1             1             1 
##            NJ            NM            NY            NC            ND 
##             4             1             6             3             1 
##          Ohio      Oklahoma        Oregon  Pennsylvania            RI 
##             4             1             1             6             1 
##            SC            SD     Tennessee         Texas          Utah 
##             1             1             3             5             1 
##       Vermont      Virginia    Washington            WV     Wisconsin 
##             1             3             3             3             4 
##       Wyoming 
##             1 
## 
## Within cluster sum of squares by cluster:
## [1] 355.2395   0.0000 281.9867 419.7744   0.0000 111.9421
##  (between_SS / total_SS =  68.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

We observe some similarities with the clusters from hierarchical clustering:

  • The populous states are in their own clusters. See Texas (cluster 1), California (cluster 3), and Pennsylvania and New York (cluster 5). Maybe the health system for a populous state has a particular complexity, not seen in smaller states, that impacts in the measles cases in a similar fashion.

  • Again we can find Indiana, Kentucky, Tennessee, and Alabama in the same cluster (See cluster 2).

  • In cluster 4 we can find Illinois and Michigan (which are close to each other). But here we also find Massachusetts.

  • Finally, in cluster 6 we find a lot of states from different regions of the US. We can find states from the north east (like Vermont, Maine, Rhode Island, and New Hampshire) to states from the south west (like Colorado, Arizona, and New Mexico).
    If we analyze each sub group, maybe this is telling us that the geographical proximity between states can explain some of the measles cases.