andy anchovy: November 2012

28 November 2012

how to summarize data.

Hi! I hope everyone had a great time with family and friends last week. I'm very thankful that I got to spend the week relaxing and seeing old faces.

Today I wanted to write up a little guide on how I summarize data. This post was motivated by poker... I know nothing about poker but after grasping some of the concepts, I went straight to the UCI Machine Learning Repository to see if I could find a poker-related data set to play with in R. Fortunately, the Poker Hand data set is one of the most popular ones.

This guide comes from old notes that I took for my senior thesis last year but they came in handy! This guide will be helpful when trying to understand the most basic properties of a data set.

#1. Discover the central tendency.

To find the central tendency of your data, look at the sample mean and median.
The sample mean and median are not always the same! If these values are different, find out why.
Sample mean - the sum of all measurements divided by the number of measurements in the set (or the average).
Note: since the sample mean equally represents each measurement, any extreme value (or outlier) will create an impact on the mean.
Sample median - the middle value of the ordered data. If there is an even number of observations, the median is the average of the two middle values.
Note: the data set must be properly ordered before finding the median.

#2. Measure the variability.

Determining the variability means to measure how the data are spread out relative to the center of the data set. There are a few ways to do this depending on how the data are distributed.
Range - subtract the smallest value from the largest value.
Note: the value for range increases as the sample size increases. It's only fair to compare the ranges between two or more samples if the sample sizes are equal.
Variance - the measure of how the data is dispersed.
Note: if units of your data are measured in seconds, then the units of variance are seconds-squared. (I hope that makes sense, I could only define the units with an example!)
Standard deviation - the measure of dispersion (or variation) from the mean.
Note: standard deviation is determined by the square root of variance and is measured in the original units of the sample.
Interquartile range - the distance between the upper and lower quartiles or the difference between the 75th and 25th percentile.
Note: quartiles break a data set into four even parts (25/50/75th percentiles) to create a box plot.

#3. Visualize the data.

It's good to visualize your data so you can see its distribution (where the center of the data occurs and how the observations are spread out around that center).
One useful way is to use histograms, which are graphs that display the frequency of data.
Box plots, like histograms, are organized to give you a sense of dispersion and skewness. I like box plots because you can pinpoint the extreme values.
Scatter plots are used to see how bivariate data are distributed. This is when you determine if there is a correlation between x and y -- and if it's positive or negative.

Cool, huh? One last point to consider is sensitivity to oultiers. Sample means and averages are sensitive to outliers, whereas IQRs and medians are not.

15 November 2012

daily joy and appreciation.

Can you believe it's already Thanksgiving next week? It seems like this morning I was looking in the mirror to fix my cap and gown. Anyways, I recently reread a book that was given to me a while back titled, "Politically Incorrect Secrets for Getting Through College" written by Dr. Nicole Radziwill. As a gift to all students everywhere, Dr. Radziwill provides a link to the free pdf ebook!

Stated in the title, the book provides secrets for getting through college in a funny, smart, and motivational way. I enjoyed the book because it gave me a new perspective on things. The cool part about the book is that these secrets can be applied at anytime. I'm happy that I stumbled across this book again because now I have the politically incorrect secrets for getting through post-grad life :)

Dr. Radziwill explains a three-point plan for success and making your dreams come true. One of these points says to choose a daily joy and appreciation. I'd like to share my daily joy and appreciation with you all!

While taking a drive today, I got to relax and really enjoy this sunset! It was so peaceful that I had to dangerously pull out my phone for a picture. My appreciation for today is dedicated to Occam's Razor by Avinash Kaushik, a blog that's all about decision-making and web analytics.

web analytics gold.

Fact: Avinash Kaushik is a pure genius.

Avinash Kaushik is the Digital Marketing Evangelist for Google, the Co-Founder and Chief Education Officer for Market Motive, and the author of my next two must-reads (Web Analytics 2.0 and Web Analytics: An Hour A Day).

Occam's Razor by Avinash Kaushik. If you're infatuated by web analytics, then it should be a requirement to read everything from this blog. I appreciate this blog because you can learn about web analytics and Kaushik's articles radiate inspiration.

I'm still rummaging through all of his great blog posts, but I just finished taking notes and reading Web Analytics 101: Definitions: Goals, Metrics, KPIs, Dimensions, Targets. Yes, I take notes in my Moleskine on everything and anything I find awesome and useful. Kaushik is brilliant because he explains concepts and terms with examples that just make sense. Before reading this article, I was confused between a metric and a dimension but Kaushik defines the terms extremely well. A metric (count or ratio) is simply a number, but a dimension is an attribute of the visitor and their activity on your website. Dimensions are also important for analysis because they help to group your web data.

In addition to the examples, Kaushik plants pieces of advice and reminders throughout his articles that reinforce your understanding and learning. Now whenever I think about business objectives I'll remember that they must be DUMB: Doable, Understandable, Manageable, Beneficial. If you want to learn more about web analytics and relevant topics, then Occam's Razor is the place to go!!

don't be screened out.

Whether you're looking for your first job or looking to advance in the professional workforce, you may be asked to schedule a phone interview.

With more people earning degrees, the competition becomes increasingly fierce. Also, as a result of advancing technology, we can practically learn about anything we desire! Basically, the chances of anyone getting an offer are slimming down as more smarties pop out. There's also the economy... but I won't go there.

On the other hand, it's just as tough, if not tougher, for companies to find these smarties! Again, better technology means more competition, which applies to companies too. As companies are faced with a million problems, they post job openings in hopes of finding someone who can solve these problems.

HMM? What's the fast and efficient way of sorting through all the candidates? Oh yeah, phone interviews. Interviews are stressful by nature, but there are ways to combat those sweaty palms. Prepare yourself by learning all about your future employer and their hiring practices. Be energetic because who wants to hire a Negative Nancy?

While I was preparing, I found an article on CNNMoney that had some useful tips. In "Don't wear pajamas for a phone interview," Annie Stevens suggests wearing business attire, eating a medicated cough drop beforehand, having a photo of the interviewer on your computer screen, and taking notes. Interesting, huh?! Check out the article for other tips like these!

10 November 2012

more than raw.

Hi! Do you remember your earliest years of science class? Last night I tried to 5S my closet but stopped when I found my lab notebook from sixth grade!! I kept this notebook because of my sixth grade epiphany, which was realizing that I enjoy science.

This lab notebook contains the first time I conducted a lab experiment that required me to collect raw data. I remember going home and stressing over the numbers that didn't help me answer the questions for analysis. During an era where books were majorly used to find answers, I pulled out my textbook and found equations that used my raw data to find derived data and BAM BAM BAMMM!!! It was like looking through a microscope and finding the perfect adjustment for a magnified and crystal clear view of a chloroplast. The analysis was clearer than ever.

Why am I telling you guys about this? I was inspired to share my nerdy moment because of an article that I found on Viget's Advance blog. I'm infatuated with this company and their blogs. The article (Change is Good) was written in 2010, but the information is still very relevant and useful.

The author and marketing strategist, Anjali Merchant, for Viget explains why focusing on raw numbers only reveals raw numbers. Check it out!

08 November 2012

political numbers.

Let's talk politics.

Okay, just kidding. Political science was never my thing but thanks to the Revolution Analytics blog and a post from yesterday (How Nate Silver won the election with Data Science) by David Smith, politics became my thing... or at least for about 20 minutes!

Smith explains the great details of Nate Silver's successes as a statistician such as using many data sources, understanding correlations, consistency in methodologies, and great communication skills. Check out Smith's article!

Nate Silver's forecasting analysis concluded that President Obama had a 90.9% chance of winning! I'm a big believer of numbers and numbers do not lie. Whether politics or data are your thing, read "As Nation and Parties Change, Republicans Are at an Electoral College Disadvantage" because I guarantee you'll learn something (plus, the graphs and charts are neat)!

07 November 2012

something to think about.

On a daily basis, we are required to make all sorts of decisions. In the aggregate amount of decisions, I lost sense of what it means to really make a decision. Yesterday, I came across an insightful blog post written by one of the most influential people in my life, Dr. Nicole Radziwill.

In "Decidere: The Power of Decision," Dr. Radziwill notes the Latin origin of the word "decision," which is decidere - to cut off all other options. I (naively) thought that post-grad life would be a breezy walk in the park, but I was so wrong... and I'm happy that I was wrong. The anxiety of making the wrong decision plagues my mind everyday, but why should I worry about this?

"Being submerged in a continual stream of decisions not only weakens mental energy, but depletes emotional reserves (and willpower) too."

When I started the search for my first job in the professional workforce, I was worried about anything and everything. Where and when should I apply to jobs for better chances to be hired? Will I have a chance with this company? The infinite list of worries and questions clouded my judgment and, ultimately, deterred my endeavors to find a job.

Thank you, Dr. Radziwill, for illuminating my subconscious and showing me how to improve the quality of life by making decisions.

04 November 2012

data and the future.

I recently found an awesome blog that provides information on analytics! The website is called Decision Stats and I think it's great because it shows readers where they can learn for free!

Right now I'm learning how to incorporate the data that is collected from Google Analytics into R to forecast information!! I'll be keeping in touch with the results!

show me the numbers.

Why am I crazy about numbers? Well, it all started last year with my senior thesis. A group of us led by Dr. Nicole Radziwill had the opportunity to analyze the production data from Starr Hill Brewery and the goal was to improve the overall beer brewing process. By using R to play with the numbers, we were able to use:

descriptive statistics to characterize key metrics in the brewing process
k-means clustering and derivative dynamic time warping to distinguish between good and bad batches
multiple regression models to predict product volume loss during manufacturing
and an ANOVA (analysis of variance) to determine if there was a statistically significant difference between percentage losses between products

Cool huh? The project also employed Lean Six Sigma management methods and the DMAIC framework. Any time there are problems to solve with data, count me in!!