Author: Nicholas Vogt
MSU Data Science has an open blog! For members who want to show off some cool analysis they did in class or independently, we’ll post your findings here! Build your resumes and share the URL with employers, friends, and family!
I’m Nick, and I’m going to kick us off with a quick intro to R with the
iris dataset! I’ll first do some visualizations with ggplot. Then I’ll do two types of statistical analysis: ordinary least squares regression and logistic regression. Finally, I’ll examine the two models together to determine which is best!
Visualize the Data
First, we’ll attach the ggplot2 package and load the iris data into the namespace. For those unfamiliar with the
iris dataset, I encourage you to follow along in R!
library('ggplot2') #> Warning: package 'ggplot2' was built under R version 3.3.1 data(iris) head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3.0 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5.0 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa
Since the data is clean, we’ll go right into visualization.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species)) + geom_point() + ggtitle('Iris Species by Petal and Sepal Length')
Interesting! It looks like setosas are clearly different from the other two species. Versicolors and virginicas appear different too, however, it would be difficult to classify which is which on the border.
Let’s see what regression can do to classify this data using only
Sepal.Length as our explanatory variables. I’ll first create a dummy variable for versicolors. Then we’ll fit our model, and assume any observation who’s predicted probability is greater than one-half is a versicolor. Finally, we’ll examine our type 1 and type 2 errors.
iris[['Is.Versicolor']] <- as.numeric(iris[['Species']] == 'versicolor') fit.lm <- lm(Is.Versicolor ~ Petal.Length + Sepal.Length, data = iris) summary(fit.lm) #> #> Call: #> lm(formula = Is.Versicolor ~ Petal.Length + Sepal.Length, data = iris) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.6597 -0.3667 -0.1962 0.5520 0.7929 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.11893 0.40693 2.750 0.006715 ** #> Petal.Length 0.14794 0.04328 3.419 0.000815 *** #> Sepal.Length -0.22959 0.09226 -2.489 0.013940 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.4569 on 147 degrees of freedom #> Multiple R-squared: 0.07949, Adjusted R-squared: 0.06696 #> F-statistic: 6.347 on 2 and 147 DF, p-value: 0.002271 iris[['Predict.Versicolor.lm']] <- as.numeric(predict(fit.lm) > 0.5) ## Plot predicted vs. actual values table(iris[, c('Is.Versicolor', 'Predict.Versicolor.lm')]) #> Predict.Versicolor.lm #> Is.Versicolor 0 1 #> 0 89 11 #> 1 47 3
Looks like all the variables were significant with p < 0.05, but the model has poor predictive power. We correctly classified 92 species with only 3 true positives. That’s pretty bad. We could refine our model, but instead, let’s attempt logistical regression.
fit.logit <- glm(Is.Versicolor ~ Petal.Length + Sepal.Length, data = iris, family = binomial(link = 'logit')) summary(fit.logit) #> #> Call: #> glm(formula = Is.Versicolor ~ Petal.Length + Sepal.Length, family = binomial(link = "logit"), #> data = iris) #> #> Deviance Residuals: #> Min 1Q Median 3Q Max #> -1.5493 -0.9437 -0.6451 1.2645 1.7894 #> #> Coefficients: #> Estimate Std. Error z value Pr(>|z|) #> (Intercept) 3.0440 1.9752 1.541 0.12328 #> Petal.Length 0.7369 0.2282 3.229 0.00124 ** #> Sepal.Length -1.1262 0.4611 -2.443 0.01459 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> (Dispersion parameter for binomial family taken to be 1) #> #> Null deviance: 190.95 on 149 degrees of freedom #> Residual deviance: 178.32 on 147 degrees of freedom #> AIC: 184.32 #> #> Number of Fisher Scoring iterations: 4 iris[['Predict.Versicolor.logit']] <- as.numeric(predict(fit.logit) > 0.5) table(iris[, c('Is.Versicolor', 'Predict.Versicolor.logit')]) #> Predict.Versicolor.logit #> Is.Versicolor 0 1 #> 0 99 1 #> 1 50 0
Holy cow, our model is far too conservative. It only predicted one versicolor! Both explanatory variables are significant with p < 0.05, however the intercept is not.
Our results were less than extraordinary, but there’s a reason for that. If you’re onto my game already, give yourself a pat on the back. If you’re confused or haven’t been paying close attention, take another look at the visualization before I explain.
Significance isn’t everything. We’ve clearly shown that here. Let’s examine where we went wrong.
Sepal.Lengthare highly correlated. We knew this from the graph. Multicollinearity isn’t normally an issue in linear models, but it certainly is when the correlation is this strong. Collinearity, however, is not actually where we went wrong.
cor(iris[, c('Petal.Length', 'Sepal.Length')]) #> Petal.Length Sepal.Length #> Petal.Length 1.0000000 0.8717538 #> Sepal.Length 0.8717538 1.0000000
We knew from the visualization that
Sepal.Lengthwere important. We should not have stopped there. We should have examined other relationships which determine species and added them to our model. But, again, this is not where we went wrong.
Notice I chose to predict the classification of versicolors. Versicolor, as you can see in the visualization, is mapped right between both setosas and virginicas. Linear models work when you can draw a single, straight line through the data - a threshold. Clearly, one single, straight line cannot separate versicolors from non-versicolors in this model.
If we take this one step further, we find that our estimated coefficients for both
Sepal.Lengthare totally useless. I’ll leave that to you to figure out why.
Hopefully you learned something with our first blog post in Make Data Tidy Again! If you didn’t understand some of the terms or some of the methods, join our slack channel and ask our membership!
If you want to write a blog post, contact us! We’d love to host your content.
If you found any errors, big or small, leave a comment!