28 November 2012

how to summarize data.

Hi! I hope everyone had a great time with family and friends last week. I'm very thankful that I got to spend the week relaxing and seeing old faces.

Today I wanted to write up a little guide on how I summarize data. This post was motivated by poker... I know nothing about poker but after grasping some of the concepts, I went straight to the UCI Machine Learning Repository to see if I could find a poker-related data set to play with in R. Fortunately, the Poker Hand data set is one of the most popular ones.

This guide comes from old notes that I took for my senior thesis last year but they came in handy! This guide will be helpful when trying to understand the most basic properties of a data set.

#1. Discover the central tendency.

  • To find the central tendency of your data, look at the sample mean and median.
  • The sample mean and median are not always the same! If these values are different, find out why.
  • Sample mean - the sum of all measurements divided by the number of measurements in the set (or the average). 
  • Note: since the sample mean equally represents each measurement, any extreme value (or outlier) will create an impact on the mean.
  • Sample median - the middle value of the ordered data. If there is an even number of observations, the median is the average of the two middle values.
  • Note: the data set must be properly ordered before finding the median.

#2. Measure the variability.

  • Determining the variability means to measure how the data are spread out relative to the center of the data set. There are a few ways to do this depending on how the data are distributed.
  • Range - subtract the smallest value from the largest value.
  • Note: the value for range increases as the sample size increases. It's only fair to compare the ranges between two or more samples if the sample sizes are equal.
  • Variance - the measure of how the data is dispersed.
  • Note: if units of your data are measured in seconds, then the units of variance are seconds-squared. (I hope that makes sense, I could only define the units with an example!) 
  • Standard deviation - the measure of dispersion (or variation) from the mean. 
  • Note: standard deviation is determined by the square root of variance and is measured in the original units of the sample.
  • Interquartile range - the distance between the upper and lower quartiles or the difference between the 75th and 25th percentile.
  • Note: quartiles break a data set into four even parts (25/50/75th percentiles) to create a box plot.


#3. Visualize the data.

  • It's good to visualize your data so you can see its distribution (where the center of the data occurs and how the observations are spread out around that center). 
  • One useful way is to use histograms, which are graphs that display the frequency of data.
  • Box plots, like histograms, are organized to give you a sense of dispersion and skewness. I like box plots because you can pinpoint the extreme values.
  • Scatter plots are used to see how bivariate data are distributed. This is when you determine if there is a correlation between x and y -- and if it's positive or negative.
Cool, huh? One last point to consider is sensitivity to oultiers. Sample means and averages are sensitive to outliers, whereas IQRs and medians are not.  


No comments:

Post a Comment