Today I wanted to write up a little guide on how I summarize data. This post was motivated by poker... I know nothing about poker but after grasping some of the concepts, I went straight to the UCI Machine Learning Repository to see if I could find a poker-related data set to play with in R. Fortunately, the Poker Hand data set is one of the most popular ones.
This guide comes from old notes that I took for my senior thesis last year but they came in handy! This guide will be helpful when trying to understand the most basic properties of a data set.
#1. Discover the central tendency.
- To find the central tendency of your data, look at the sample mean and median.
- The sample mean and median are not always the same! If these values are different, find out why.
- Sample mean - the sum of all measurements divided by the number of measurements in the set (or the average).
- Note: since the sample mean equally represents each measurement, any extreme value (or outlier) will create an impact on the mean.
- Sample median - the middle value of the ordered data. If there is an even number of observations, the median is the average of the two middle values.
- Note: the data set must be properly ordered before finding the median.
#2. Measure the variability.
- Determining the variability means to measure how the data are spread out relative to the center of the data set. There are a few ways to do this depending on how the data are distributed.
- Range - subtract the smallest value from the largest value.
- Note: the value for range increases as the sample size increases. It's only fair to compare the ranges between two or more samples if the sample sizes are equal.
- Variance - the measure of how the data is dispersed.
- Note: if units of your data are measured in seconds, then the units of variance are seconds-squared. (I hope that makes sense, I could only define the units with an example!)
- Standard deviation - the measure of dispersion (or variation) from the mean.
- Note: standard deviation is determined by the square root of variance and is measured in the original units of the sample.
- Interquartile range - the distance between the upper and lower quartiles or the difference between the 75th and 25th percentile.
- Note: quartiles break a data set into four even parts (25/50/75th percentiles) to create a box plot.
#3. Visualize the data.
- It's good to visualize your data so you can see its distribution (where the center of the data occurs and how the observations are spread out around that center).
- One useful way is to use histograms, which are graphs that display the frequency of data.
- Box plots, like histograms, are organized to give you a sense of dispersion and skewness. I like box plots because you can pinpoint the extreme values.
- Scatter plots are used to see how bivariate data are distributed. This is when you determine if there is a correlation between x and y -- and if it's positive or negative.