Thursday, June 13, 2013

Correlation and Simple Linear Regression

Today felt a little like a whirlwind. The class met in the computer lab today. I started by showing how to access and use basic features of Minitab. After showing how data entry is similar to using an Excel spreadsheet, I gave a quiz on descriptive statistics and comparing distributions. I asked students to enter the different data sets we collected into Minitab when they were done with their quizzes.

After everyone was done with the quiz, I focused on developing some conceptual understanding of correlation, least squares regression line, and R2. I used a couple of applets to help and being in the computer lab, everyone was able to try these out.

The first applet was Regression by Eye. I've used this applet many times. I started off having students describe the association: direction, strength, and form. I then focus on trying to quantify the strength of the relationship using values between -1 and 1. A value of -1 indicates a negative slope and all the points fall on the same line. A value of 1 indicates a positive slope with all the points falling on the same line.

I draw an oval around the scatter plot. Does the oval indicate the direction of the relationship. For different scatter plots, do fatter or narrower plots show stronger associations? As students start to consider these, I ask the class to think about the ratio of the oval's width to its length. A perfect positive association would have an oval with no width, suggesting we could look at correlation as expressed by 1 - w/l. A scatter plot with no linear association would have w and l nearly equal. The expression r = 1 - w/l provides a way to estimate correlation. The only piece to add in is the direction, negative associations will have r-values that are negative. Students started using this technique to estimate r and were able to become more accurate with some practice. I told the class that a visual estimate within 0.2 would be acceptable.

Next we focused on the line of best fit. Using the same applet, I had several students draw lines of best fit. Obviously this could continue with the entire class and conceivably we could end up with 30 or more different lines of best fit. One way to consider a "best" line of best fit would be to reduce the amount of error to a minimum. The next applet does just that.

The Least Squares Demonstration applet visually demonstrates the idea of least squares. This applet starts by using the mean line as an estimate. We can always use this line but it is generally not a very good estimator. Moving the line end points we can pass closer to points on the scatter plot. As we do, we can see that the squared error values change for each point. When the sum of these squared error values are the smallest we have a least squares regression line—the "best" line of best fit.

Students played around with this for a little bit and got a sense of how a least squares regression line worked. Returning to the Regression by Eye applet, we could now look at the various lines that we drew and the mean squared error, basically the average of the squared error values. In the sense of least squares, the line with the smallest mean squared error is better than the others. Clicking on the show regression line, we can see how the hand-picked line compares to the least square regression line.

The final conceptual piece to develop was for the R2 value. In this case, I use the Least Squares Demonstration applet to help explain R2. I have students focus on the response values compared against the mean line. There is variation around the mean line, specifically (x - x-bar) where x-bar is the sample mean. Square these values and sum and you get Σ(x - x-bar)2. If we divide by n - 1 and take the square root we would have the sample standard deviation. Unsquared we have the sample variance. I draw brackets on the y-axis to highlight the variation present in the original data, explaining this is the variation that is present in our data.

We can compare are actual values against predictions using the least square regression line. In this case we have Σ(y - y-hat)2 where y-hat is the predicted value of y. Re-orient your view to look at how much variation exists around the least square regression line. I draw a bracket around the regression line to show that the variation has narrowed considerably.

How much variation is explained away? On the y-axis, I draw a length equivalent to the variation around the regression line. I then shade out the variation that is no longer present; this is the amount of variation explained away by the regression line. The percentage explained away is the R2 value. I then point out that R2 = r2, where r is the correlation value.

Now that the conceptual piece was covered, we looked at the passing the buck data set. This data had a correlation of .993 and its scatter plot was nearly a straight line. I ran a regression in Minitab and we looked at the equation and the R2 value. The equation was time-hat = -1.65 + 0.68 people. We discussed what the values in the equation meant.
.
In this problem, if there were no people in line, it would take -1.65 seconds for the buck to be passed. In the context of this problem that value is not applicable. I explained that this is often the case. The value of 0.68 indicates that one additional person will typically add 0.68 seconds to the time it takes to pass the buck.

We can use the equation to make an estimate, such as how long would we estimate 10 people to take to pass the buck? Substituting values we see that it should take approximately 5.15 seconds.

What about 100 people. We could use the equation but this is so far beyond the range of the sample that it should be viewed with extreme caution. In this case we have extrapolated beyond the scope of the analysis. On the other hand, the rope and knot data would probably hold up well even under extrapolation because of the nature of the relationship between knot and rope length, assuming a taut rope.

With the last few minutes of class, I had students generate summary statistics, scatter plots, and perform a regression analysis with one of the data sets they entered. This enabled students to get a little more comfortable with the software that they will be required to use in their projects.

This was a lot of material to cover in one day, especially given that some of the concepts are difficult to grasp. Next class we'll focus on working through some of the data we generated using calculators. We'll also address relationship of regression slope to correlation and sample standard deviations. Finally, we'll take a look at the residuals to see what they tell us about the quality of the regression equation.

Below is the outline of today's lesson with any comments italicized between square brackets, [like this].

o   How can we measure strength and direction of the association?
   §  Use general direction of the association—positive slope versus negative slope
   §  Look at length versus breadth of association
   §  Use Regression by Eye applet
o   What line best describes this association
   §  Have students draw lines and discuss
   §  How can we determine a best line?
   §  Discuss what is happening
§  In what sense is this the best line
o   Use same app to discuss explained variation
   §  What is simplest model you can create?
·         Y = mean
   §  How much variation is explained by regression?
   §  Draw scatterplot and compare variation around mean versus variation around line
   §  Percent accounted for is R2
o   Computer lab
   §  Use drop data [used pass the buck data as I had it copied down already]
·         Estimates and predictions
o   Extrapolation
o   Influential points [will discuss next class]


No comments:

Post a Comment