Saturday, September 28, 2013

Why Do You Graph Data in Statistics?

My Inferential Probability and Statistics class is now transitioning from how to gather data to how to organize and begin to analyze data. We spent some time working with categorical data, which I discussed in a previous post. We also worked through some preliminaries for quantitative data using the name rank data described in a previous post.

The classes conducted an experiment for testing the melting times for chocolate chips (for a full description of the process see posts IPS Day 47 and IPS Day 48). The classes had gone through the basics of stem plots, histograms, and bin sizes of histograms using the name rank data. We also examined medians and mean. It was time to look at the melting time data.

Students wrote their melting times on the board under the three columns for milk chocolate, semi-sweet chocolate, and white chocolate. I asked the students to graph all of the data in one graph. Students were a bit confused as to why I would want to do this so I explained that we had chocolate melting times and even though they came from different types of chocolate they were still, fundamentally, melting times. Many students used stem plots, some entered their data into a calculator to create histograms.

I also asked students to calculate the median and mean of the distribution and decide which was more representative of the center. We had already looked at many sample graphs doing this and they had a pretty good sense that skewed data needed to use the median.

The stem and leaf plots looked similar to the one displayed below:
Students described this graph as being skewed positively and many thought the graph was uni-modal. A few students described the graph as being multi-modal. The graph was what I expected to see, after all, we were dealing with three different types of chocolate.

I asked the class what the possibilities could be for the melting times? This is a difficult question for students who are not used to considering practical aspects of problems. If all the chocolates melt at relatively the same rate we should see a graph that reflects this, basically a uni-modal graph with possible skew.

If two chocolates melt at approximately the same rate and the third melts at a different rate, we should expect to see two modes in the graph although the modes may be masked because of overlap and number of items of each type being graphed. (We actually saw this phenomenon when we graphed the heights of students in the class. The data should be bi-modal because of males and females but the number of females was so large compared to males that the male mode looked more like a wave than a mode since the height overlaps between the genders filled in the gap between the two modes.)

The third possibility is that all three chocolate types have different melting times, resulting in three modes. The amount of separation between the melting times could mask how distinct each mode is from the others.

I sketched the three possibilities out and asked students to consider which of the three their graph most resembled?

As students looked at these possibilities they realized that the data was not uni-modal. The stem plot follows a pattern of three distinct melt times. Without going back into the data, we cannot say, at this point, which melts faster and which melts slower, but the graph strongly suggests three distinct melt times, which provides a foundation for analyzing the data as three separate groups.

Students had never considered what a graph should look like and then compare their results to the possibilities. This was a real eye opener for them as they realized that graphs aren't just made to appease some requirement or make a pretty picture, graphs are made to gain understanding about the data we are about to analyze. We can learn so much about the data just by creating a graph and thinking about what the possible graph shapes can be and what our graph actually looks like.

I was able to push this further as we looked at histograms for the data. As I told the class, stem plots are great because they capture all of the raw data. They also show comparable results to histograms when working with smaller data sets. However, stem plots are limiting when making decisions about how to split apart the data, especially when you start working with larger data sets.

Below are histograms for the chocolate melting times, the only difference is in the bin width that was used: 50, 30, 15, and 5 seconds.

The 50 second bin width hides too much detail. From this graph we would describe the distribution as uni-modal with a slight positive skew.


The 30 second bin width tells a slightly different story. We now have two distinctive modes. There still appears to be slight skew to the right, but there is still too much detail hidden.


The 15 second bin width clearly shows three distinct modes. It also shows that one mode sits close to 170 seconds while the other two modes are closer together at approximately 60 seconds and 90 seconds.

The 5 second bin width shows more detail. Notice how the lower mode is definitely below 70 seconds. The green bar at approximately 80 seconds is interesting, especially considering the mode near 100 seconds. This peak at 80 seconds is what I would call the tidal effect, as the tails of two distributions push against each other and raise up values where they overlap (recall my discussion of heights).

Although the 15 second bin width is the easiest to describe, the 5 second bin width graph provides additional insight into the data. Using histograms provides the flexibility to analyze our data distribution by examining the distribution at different levels.

At this point the class started to understand that graphs were a tool to help them better understand the data they are tasked with analyzing.

Why do you graph data in statistics? Is it take make a pretty picture? Or, is it to make a telling story about data distributions and their implications for analysis?






A First Look at Graphing Distributions of Quantitative Data

[This post was supposed to be published last week, but I got really busy and forgot I had put this post together. Then, I realized that I didn't have this post to refer back to as I was writing my next post. So here it is. Just realize there were several lessons that took place between this post and the next even though both posts were published on the same day.]

Up to this point, the data we've been working with in my Inferential Probability and Statistics class has been categorical data. We are beginning to transition to working with quantitative data. To that end we conducted an observational study on name popularity and conducted an experiment on the melting time of chocolate chips.

The name popularity activity comes from Making Sense of Statistical Studies. Students found the ranking of their given name for their birth year and for the most recent year data was available using the Social Security Administration's Baby Name database.One of the questions asked students "to construct a graphical display that shows the distribution of the [year] ranks data."

Students either had no idea how to proceed or started to create graphs displaying categorical characteristics of the data set, such as year versus gender or high versus low ranks. When I asked my classes about histograms, only a handful indicated they knew about or how to use histograms. This was interesting since students had been taught (note: I am not using the term learned, as it is evident that learning did not take place) histograms since the sixth grade.

I wonder why students have so much exposure to histograms yet seem so ignorant as to their use. Is it possible that so much time is spent teaching students how to make "pretty" graphs without really thinking about what the graph is telling you about the data distribution? Even with an AP Statistics class, it is difficult to move students away from just making a graph to considering what the graph tells you about your data.

I will be working primarily with stem plots, histograms and box plots to analyze data sets. It will be interesting to see how well students move off of here is how to make a graph to here is what the graph is telling you about the data distribution.

Friday, September 20, 2013

Teaching the normal model using a z-table

We are just wrapping up working with z-scores and the normal model in my AP Statistics class. In the past, I had students use the empirical rule (68-95%-99.7% of distribution within 1, 2 or 3 standard deviations) and then moved to using a graphing calculator.

This summer, while teaching the introductory statistics class at MSU Denver, I had to teach the normal model a z-table since a graphing calculator was not required. What I discovered was that using the table made the conversions among raw scores, z-scores, and percentages was much more transparent to students and, hence, more understandable. I decided that I needed to bring this experience to my AP classes.

I started through the z-score and normal model unit as I have in the past, including showing students how to find cumulative probability and z-scores from percentiles. Then we started working through an ten-part problem that required using the empirical rule and explicit values. I showed students how to read a z-table and asked a few questions to confirm they could find a percentage from a z-score or a z-score from a percentile. I then told students they could not use their calculators to work through the problem; they had to use the z-table.

There were a few questions as students started working through the problem parts but all readily picked up on how to make use of the table. But, more importantly, students started getting a better feel of how to transition easily between a raw score through a z-score to determine a percentage or vice-versa.

For the assessment on this unit, it was interesting to see that more than half the class asked if they could use the z-table, which was, of course, okay to use. I regret that I did not see the value of teaching to use tables earlier. If you are like me and have not previously taught using the z-table and just worked with technology, I highly recommend spending some time using the z-table in your instruction. You may be pleasantly surprised, as I am.

Friday, September 13, 2013

Using Contingency Tables and Segmented Bar Graphs to Determine Association

For my Inferential Probability and Statistics class, I am beginning to bounce back and forth between gathering data and organizing data. Students used sampling methods to collect information on cars in our school parking lot. The data are primarily categorical in nature. From preliminary discussions I saw that the classes were solid in creating pie graphs and bar graphs for their data. To help push their thinking about how to analyze data, I focused on making use of contingency tables and segmented bar graphs.

It is easy to get students started on contingency tables. I created a 3 x 3 table on the board, labeling the vertical boxes on the left as Male, Female, and Total (see table below). Across the top boxes I used the labels Jeans, No Jeans, and Total. I then took a quick poll of boys and girls as to whether they were wearing denim jeans or not. Voila, a contingency table.

Next we worked through calculating percentages of total, row percentages, and column percentages. I like to use different color markers for this so it is easier to reference percentage types. From here, it is an easy matter to begin discussing marginal and conditional distributions, to compare these distributions, and to use these to have students begin to think about variables being dependent or independent.

It is surprising how difficult it is for students to get independence and dependence straight in their minds. They are so used to thinking of independent variables and dependent variables from a mathematical perspective that it is difficult for them to shift gears. Rather than get into a formal look at independence and dependence (such as with probabilities), I simply ask them to consider if the marginal and conditional distributions are tracking along similarly. For example, if the class is 60% male, then it should be reasonable to assume that we would see 60% of jean wearers being male. If the marginal and conditional distributions are "close" then the two variables are independent.

On the other hand, if we see marked differences between the marginal and conditional distribution then the two variables are dependent. Knowing that an item has a certain characteristic, such as being female, alters my perception of how likely that person will wear jeans.

I conducted another quick poll for hair color and eye color. I like to keep it simple, so I just broke these into light and dark categories. These two variables are very much dependent, so the marginal and conditional distributions will show differences that everyone can readily see.

At  this point, I have lots of data that students can work with. We used the class survey data to gauge association between political leaning and gender. I have a worksheet that looks at highest level of education completed and whether or not the person is a smoker. This practice allows students to create contingency tables and create segmented bar graphs from the data. The discussions of what is created enables students to see many examples and allows me to point out strengths to emulate and weaknesses to avoid.

With all of this practice under their belts, they can now turn to working with the car data that they collected. I have them formulate a question about association and then create a contingency table and segmented bar graph to assess the association. It is easy for students to lose sight of why they are doing this work. As the grind through the data, double checking their counts and calculating percentages, it is easy to forget why you are putting all this effort into working with this data. I reminded the class numerous times to not lose sight of the question they were addressing. We are not creating tables and graphs simply to make the data look nice, we are doing it to understand relationships that may or may not exist in the data.

This work took up nearly 90 minutes of time. For the next class we looked at what students created. The class presentations are helpful because it allows me to focus in on things that are done well and points out issues that need attention. It also enables students to see and hear how to communicate their results. Finally, it provides a forum to view data from multiple lenses, hopefully broadening students' perspective on how to analyze data.

The presentations went well. Below is one example of the results presented in class. The group decided there appeared to be an association and the class concurred.


I was pleased with the results presented and the classes indicated they felt comfortable working with contingency tables.