Monday, July 2, 2018

Using Ellipses to Estimate Correlation

I recently had the privilege to participate in my fifth AP Statistics reading. One activity that I especially look forward to during the reading is the Best Practices Night. This year was even more special as I was one of the presenters at this year's Best Practices Night. Many of my colleagues encouraged me to submit this talk during informal discussions. Because of the reception the talk received, I decided to share the talk so that others may benefit from it as well.

What students need to understand going into this are the following:

  1. Correlation measures the strength of a linear association,
  2. A scatter plot that follows a straight line has a correlation of ±1, and
  3. A scatter plot with no linear association has a correlation of zero.
Take two scatter plots, one of zero correlation and one of positive one correlation, and draw ellipses around them (see fig. 1).

fig. 1

Now, while ellipses have a major and minor axis, my students don't remember much about ellipses by the time they take AP Statistics, so I simply refer to the length and width of the ellipse.

Ask your students how can the length and width be used to help estimate the value of r? Fairly shortly, students will say you can get the correlation of zero by simply subtracting L - W. That works great, but it doesn't give use the correlation of 1. With a little more thought, students will realize the dividing this quantity by L will yield the correlation of 1.

So, this is what we'll use, (L - W) / L. Correlation is unitless, and when this division is made, units will cancel out. This means it doesn't matter what is used to measure the length and width of the ellipse. Students can use markings on a paper, a ruler, their pencil or pen, or even their finger. The slope of the ellipse's length will determine whether the correlation is positive or negative.

fig. 2

Once students understand how this works, you can start examining the impact outliers have on correlation or what impact adding or removing a point from the scatter plot will have on correlation.

fig. 3

In fig. 3, the new point causes the ellipse to become longer while the width remains the same. This means that the numerator in our estimation equation becomes larger, so this point causes correlation to increase.

fig. 4

In fig. 4, the new point causes the ellipse to become wider while the length stays about the same. In this case, the numerator becomes smaller, which means that the correlation is weaker with this new point in the data set.

fig. 5

In fig. 5, the new point changes both the length and width. More importantly, the new point changes the slope of the ellipse length from positive to negative. In this situation, the outlier is causing the sign of the correlation to change.

You can even use ellipses to help students understand why a linear association may not be the best way to describe the relationship in the scatter plot.

fig. 6

Students, at this point, will naturally enclose the scatter plot with something that turns out to not be an ellipse. They will even start to measure the length and width of the enclosure. You have to remind them that correlation only measures linear associations and that they need to draw an ellipse.

What they will notice is that the ellipse does not provide uniform coverage of the scatter plot and their amoeba enclosure does a better job of capturing the scatter plot. This gap of coverage in the ellipse indicates that a linear association will not be the best way to describe the relationship for this data set.

By using ellipses to estimate correlation, students are able to:
  • visualize the strength of a linear association,
  • understand the impact of outliers on correlation,
  • understand the impact of adding and removing points from a scatter plot on correlation, and
  • recognize when non-linear associations are stronger than linear associations
I encourage you to have your students use ellipses to estimate correlation.

p.s. For those of you interested, there is an excellent paper on correlation entitled Thirteen Ways to Look at the Correlation Coefficient (Rodgers and Nicewander, 1988). If you look at item 10, you can see that the (L - W) / L is a less precise estimate of the formula given in the article. I will confess, I found the article after having used this method for many years. I didn't want to submit my presentation idea before having a more formal confirmation of its validity.