This semester I have made a more direct message about why we graph data distributions. I think in the past my students viewed graphing as part of the response requirement but weren't really considering the ramifications of the data distribution. So I have decided to hammer this point home.
We generated some data and looked at graphs that displayed the data distributions. In looking at shape, I emphasized the aspect of symmetry tells us which summary statistics we can employ to describe the data. Moderately to extremely skewed data indicates that the mean and standard deviation are not appropriate to use; in this situation we should use the median and inter-quartile range (IQR).
The same goes for gaps and outliers. The presence of these adversely affects the mean and standard deviation. As I told my class, an outlier may be screwing up your analysis but you can't simply drop the outlier due to annoyance. We worked with how many movies each student watched in a movie theater this summer. The data contained an outlier. I said that unless we knew something about this data point, such as the individual was a movie critic, that we were stuck working with the outlier included. In fact, our work just increased because we need to understand what if any impact the outlier is having on the overall distribution. This means we calculate summary statistics with and without the outlier included.
Modes produces a different issue. Multiple modes typically indicates that distinct sub-groups are present in the data. This means we'll need to determine if the sub-groups exist and then analyze each identified sub-group separately--more work again!
It is important to realize that graphing is a key step in understanding what you can and cannot do with your data. I am hopefully that students will gain a better appreciation of this fact and become better at determining appropriate statistical measures that can be applied to the data.
Tuesday, August 27, 2013
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment