Saturday, September 28, 2013

Why Do You Graph Data in Statistics?

My Inferential Probability and Statistics class is now transitioning from how to gather data to how to organize and begin to analyze data. We spent some time working with categorical data, which I discussed in a previous post. We also worked through some preliminaries for quantitative data using the name rank data described in a previous post.

The classes conducted an experiment for testing the melting times for chocolate chips (for a full description of the process see posts IPS Day 47 and IPS Day 48). The classes had gone through the basics of stem plots, histograms, and bin sizes of histograms using the name rank data. We also examined medians and mean. It was time to look at the melting time data.

Students wrote their melting times on the board under the three columns for milk chocolate, semi-sweet chocolate, and white chocolate. I asked the students to graph all of the data in one graph. Students were a bit confused as to why I would want to do this so I explained that we had chocolate melting times and even though they came from different types of chocolate they were still, fundamentally, melting times. Many students used stem plots, some entered their data into a calculator to create histograms.

I also asked students to calculate the median and mean of the distribution and decide which was more representative of the center. We had already looked at many sample graphs doing this and they had a pretty good sense that skewed data needed to use the median.

The stem and leaf plots looked similar to the one displayed below:
Students described this graph as being skewed positively and many thought the graph was uni-modal. A few students described the graph as being multi-modal. The graph was what I expected to see, after all, we were dealing with three different types of chocolate.

I asked the class what the possibilities could be for the melting times? This is a difficult question for students who are not used to considering practical aspects of problems. If all the chocolates melt at relatively the same rate we should see a graph that reflects this, basically a uni-modal graph with possible skew.

If two chocolates melt at approximately the same rate and the third melts at a different rate, we should expect to see two modes in the graph although the modes may be masked because of overlap and number of items of each type being graphed. (We actually saw this phenomenon when we graphed the heights of students in the class. The data should be bi-modal because of males and females but the number of females was so large compared to males that the male mode looked more like a wave than a mode since the height overlaps between the genders filled in the gap between the two modes.)

The third possibility is that all three chocolate types have different melting times, resulting in three modes. The amount of separation between the melting times could mask how distinct each mode is from the others.

I sketched the three possibilities out and asked students to consider which of the three their graph most resembled?

As students looked at these possibilities they realized that the data was not uni-modal. The stem plot follows a pattern of three distinct melt times. Without going back into the data, we cannot say, at this point, which melts faster and which melts slower, but the graph strongly suggests three distinct melt times, which provides a foundation for analyzing the data as three separate groups.

Students had never considered what a graph should look like and then compare their results to the possibilities. This was a real eye opener for them as they realized that graphs aren't just made to appease some requirement or make a pretty picture, graphs are made to gain understanding about the data we are about to analyze. We can learn so much about the data just by creating a graph and thinking about what the possible graph shapes can be and what our graph actually looks like.

I was able to push this further as we looked at histograms for the data. As I told the class, stem plots are great because they capture all of the raw data. They also show comparable results to histograms when working with smaller data sets. However, stem plots are limiting when making decisions about how to split apart the data, especially when you start working with larger data sets.

Below are histograms for the chocolate melting times, the only difference is in the bin width that was used: 50, 30, 15, and 5 seconds.

The 50 second bin width hides too much detail. From this graph we would describe the distribution as uni-modal with a slight positive skew.


The 30 second bin width tells a slightly different story. We now have two distinctive modes. There still appears to be slight skew to the right, but there is still too much detail hidden.


The 15 second bin width clearly shows three distinct modes. It also shows that one mode sits close to 170 seconds while the other two modes are closer together at approximately 60 seconds and 90 seconds.

The 5 second bin width shows more detail. Notice how the lower mode is definitely below 70 seconds. The green bar at approximately 80 seconds is interesting, especially considering the mode near 100 seconds. This peak at 80 seconds is what I would call the tidal effect, as the tails of two distributions push against each other and raise up values where they overlap (recall my discussion of heights).

Although the 15 second bin width is the easiest to describe, the 5 second bin width graph provides additional insight into the data. Using histograms provides the flexibility to analyze our data distribution by examining the distribution at different levels.

At this point the class started to understand that graphs were a tool to help them better understand the data they are tasked with analyzing.

Why do you graph data in statistics? Is it take make a pretty picture? Or, is it to make a telling story about data distributions and their implications for analysis?






No comments:

Post a Comment