Quick Analysis in R with the Iris Dataset

Author: Nicholas Vogt

MSU Data Science has an open blog! For members who want to show off some cool analysis they did in class or independently, we’ll post your findings here! Build your resumes and share the URL with employers, friends, and family!

I’m Nick, and I’m going to kick us off with a quick intro to R with the iris dataset! I’ll first do some visualizations with ggplot. Then I’ll do two types of statistical analysis: ordinary least squares regression and logistic regression. Finally, I’ll examine the two models together to determine which is best!

Visualize the Data

First, we’ll attach the ggplot2 package and load the iris data into the namespace. For those unfamiliar with the iris dataset, I encourage you to follow along in R!

library('ggplot2')
#> Warning: package 'ggplot2' was built under R version 3.3.1
data(iris)
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

Since the data is clean, we’ll go right into visualization.

ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species)) + 
  geom_point() +
  ggtitle('Iris Species by Petal and Sepal Length')

Interesting! It looks like setosas are clearly different from the other two species. Versicolors and virginicas appear different too, however, it would be difficult to classify which is which on the border.

Regression

Let’s see what regression can do to classify this data using only Petal.Length and Sepal.Length as our explanatory variables. I’ll first create a dummy variable for versicolors. Then we’ll fit our model, and assume any observation who’s predicted probability is greater than one-half is a versicolor. Finally, we’ll examine our type 1 and type 2 errors.

iris[['Is.Versicolor']] <- as.numeric(iris[['Species']] == 'versicolor')
fit.lm <- lm(Is.Versicolor ~ Petal.Length + Sepal.Length, data = iris)
summary(fit.lm)
#> 
#> Call:
#> lm(formula = Is.Versicolor ~ Petal.Length + Sepal.Length, data = iris)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.6597 -0.3667 -0.1962  0.5520  0.7929 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   1.11893    0.40693   2.750 0.006715 ** 
#> Petal.Length  0.14794    0.04328   3.419 0.000815 ***
#> Sepal.Length -0.22959    0.09226  -2.489 0.013940 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4569 on 147 degrees of freedom
#> Multiple R-squared:  0.07949,    Adjusted R-squared:  0.06696 
#> F-statistic: 6.347 on 2 and 147 DF,  p-value: 0.002271
iris[['Predict.Versicolor.lm']] <- as.numeric(predict(fit.lm) > 0.5)
## Plot predicted vs. actual values
table(iris[, c('Is.Versicolor', 'Predict.Versicolor.lm')])
#>              Predict.Versicolor.lm
#> Is.Versicolor  0  1
#>             0 89 11
#>             1 47  3

Looks like all the variables were significant with p < 0.05, but the model has poor predictive power. We correctly classified 92 species with only 3 true positives. That’s pretty bad. We could refine our model, but instead, let’s attempt logistical regression.

fit.logit <- glm(Is.Versicolor ~ Petal.Length + Sepal.Length, data = iris,
                 family = binomial(link = 'logit'))
summary(fit.logit)
#> 
#> Call:
#> glm(formula = Is.Versicolor ~ Petal.Length + Sepal.Length, family = binomial(link = "logit"), 
#>     data = iris)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.5493  -0.9437  -0.6451   1.2645   1.7894  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)   
#> (Intercept)    3.0440     1.9752   1.541  0.12328   
#> Petal.Length   0.7369     0.2282   3.229  0.00124 **
#> Sepal.Length  -1.1262     0.4611  -2.443  0.01459 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 190.95  on 149  degrees of freedom
#> Residual deviance: 178.32  on 147  degrees of freedom
#> AIC: 184.32
#> 
#> Number of Fisher Scoring iterations: 4
iris[['Predict.Versicolor.logit']] <- as.numeric(predict(fit.logit) > 0.5)
table(iris[, c('Is.Versicolor', 'Predict.Versicolor.logit')])
#>              Predict.Versicolor.logit
#> Is.Versicolor  0  1
#>             0 99  1
#>             1 50  0

Holy cow, our model is far too conservative. It only predicted one versicolor! Both explanatory variables are significant with p < 0.05, however the intercept is not.

Post-Examination

Our results were less than extraordinary, but there’s a reason for that. If you’re onto my game already, give yourself a pat on the back. If you’re confused or haven’t been paying close attention, take another look at the visualization before I explain.

Significance isn’t everything. We’ve clearly shown that here. Let’s examine where we went wrong.

  1. Both Petal.Length and Sepal.Length are highly correlated. We knew this from the graph. Multicollinearity isn’t normally an issue in linear models, but it certainly is when the correlation is this strong. Collinearity, however, is not actually where we went wrong.
cor(iris[, c('Petal.Length', 'Sepal.Length')])
#>              Petal.Length Sepal.Length
#> Petal.Length    1.0000000    0.8717538
#> Sepal.Length    0.8717538    1.0000000
  1. We knew from the visualization that Petal.Length and Sepal.Length were important. We should not have stopped there. We should have examined other relationships which determine species and added them to our model. But, again, this is not where we went wrong.

  2. Notice I chose to predict the classification of versicolors. Versicolor, as you can see in the visualization, is mapped right between both setosas and virginicas. Linear models work when you can draw a single, straight line through the data - a threshold. Clearly, one single, straight line cannot separate versicolors from non-versicolors in this model.

  3. If we take this one step further, we find that our estimated coefficients for both Petal.Length and Sepal.Length are totally useless. I’ll leave that to you to figure out why.

Wrap Up

Hopefully you learned something with our first blog post in Make Data Tidy Again! If you didn’t understand some of the terms or some of the methods, join our slack channel and ask our membership!

If you want to write a blog post, contact us! We’d love to host your content.

If you found any errors, big or small, leave a comment!



Want to contribute to our members' blog?

Message Nicholas Vogt at msudatascience@gmail.com