Datathon 2018

This page houses information for the 2018 Datathon event.

Basic Statistics

Statistics can you communicate trends in data. You might find some of the following basic statistics to be helpful in your analysis: 

  • Mean 
  • Mode 
  • Median 
  • Standard deviation
  • p-Value

 

Mean

Mean = Average
Knowing the average anticipates how much the next entry will be
  • To get an average, add all the entries together and divide by the number of entries
In this example, 225 is the average
 
250 + 300 + 225 = 775
775 ÷ 3 = 225 is the mean (or average)

What is the mean good for?

Sometimes it’s useful to know if something is above or below average.

For example, is your grade above or below the average of the rest of the class?

Mode

Mode = the most repeated number or category

  • To find the mode, write down all the data

The mode is the only measure of average that can be used with nominal data (i.e. categories).

For example, late-night users of the library were classified by faculty as:

  • 14% science students
  • 32% social science students
  • 54% biological sciences students

The median or mean can not be calculated (what would it it be a bio-soci-sci student?).

The mode is biological science students since they are most common.

What is the mode good for?

Mode could be used for scheduling, what hours are most people buying the product, coming in to the store, calling for help?

The mode is most helpful when a single value is repeated much more often than others.

Median

Median = Middle value
  • To find the median: Put all the numbers in order
–If there is an odd number of results, the median is the middle number
–If there is an even number of results, the median is the mean of the two central numbers
 
What’s the median good for?
–If you have outliers (numbers that are much higher or lower than the rest of your data), those outliers skew an average so that average  is not good a describer of the data
–Skewed numbers lean more one way than another 
–Some data is generally reported as a median, such as rent and income

Standard Deviation

Standard deviation summarizes the amount by which every value within a dataset varies from the mean (average)
  • To find the standard deviation, go to the Khan Academy: https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-spread-distributions/a/calculating-standard-deviation-step-by-step
When the values in a dataset are close together the standard deviation is small.
When the values are spread apart the standard deviation is large.
 
What's the standard deviation good for?
 
In many datasets, values deviate from the mean due to chance and such datasets are said to display a normal distribution. In a dataset with a normal distribution most of the values are clustered around the mean while a few values tend to be extremely high or extremely low (bell curve). Many natural phenomena display a normal distribution.
 
When data is more than two standard deviations away from the average, it’s very likely on the ends of a bell curve distribution, and it's telling you that the situation is not normal.
 
Source: Quora

p-Value

p-Value is used to show statistically significance, the relationship between variables is stronger than random chance
–In hypothesis testing, the null hypothesis = there is no relationship.
–A researcher calculates a p-Value, which is the probability of observing an effect given that the null hypothesis is true.
The p-Value is a number between 0 and 1.
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, thus showing there is a relationship.This means there is a relationship between the two variables and we are reasonably sure that it will happen again. It's statistically significant.
A large p-Value (typically > 0.05) indicates weak evidence against the null hypothesis, showing there may be no relationship between the variables.
–The p-value is different in different fields of study. In some fields, it’s much lower than 5%.
-p-Values close to the .05 cutoff are marginal, so always report your p-values, and let the readers make their own conclusions.
 
To find the p-Value is complex, go here for more instruction: https://www.wikihow.com/Calculate-P-Value
 
What's the p-value good for?
If you want to test someone's hypothesis, you could conduct your own tests, and find out whether the other hypothesis is true or false. This is sometimes called validating a claim. It's important in science, because scientific claims need to be repeatable.