Statistics vocab and methods

… Population: The set of ‘things’ of interest, e.g. the students at your school.

… Parameter: a number (e.g. mean height) that describes the whole population.

… Census: a collection of every data point in the population.

… Sample: a selection or subset taken from the population.

… Sampling Frame: The list of population members from which you are going to take a sample.

… Statistic: A number calculated from a sample that can be used to estimate a population parameter.

… What are the advantages and disadvantages of a sample vs. census? Which would you use to find the mean life of bulbs produced by a factory?

… The sampling distribution of a statistic is the probability distribution of all possible values of a statistic. For example, if you took a random sample of 100 people from a town to obtain their mean height, you could do this many times to get different mean values. We could then show the distribution of these means (which would have a much smaller spread compared to the distributions of heights in the population).

Sampling methods:

… Simple Random Sampling

e.g. number every member of the population. Then select members at random to form your sample size.

… Systematic Sampling

e.g. Split up a population of 3000 into 60 groups. Each group will therefore contain 3000/60 = 50 members.

We will choose a single member from each group, i.e. a sample size of 60.

Now choose a random number between 1-50. That numbered member of the population is chosen and then every 50th member thereafter.

Note that a portion at the end of the sampling frame may be left out of the sampling procedure. To avoid this, a random number can be generated to start at a position which will give data in that end portion a chance of being selected.

… Stratified sampling

Find the representative fraction of each group (‘strata’ layer) compared to the population size and multiply this by the sample size required.

E.g. There are 60 students in year 13 and the school has 800 students in total. For a total sample size of 40 we would take:

60/800 x 40 = 3 students from year 13 using a simple random sampling method.

Remember that if you get decimals you may need to adjust the sample sizes of each category up or down to the nearest whole number. Do check by adding up the samples at the end!

For example, if your sample was 80 and your stratified sample sizes for each group were:

23.8, 41.7, 14.5

then your adjusted sample sizes would be

24, 42, 14 (notice that we rounded the last sample DOWN rather than up otherwise they would have added to 81!)

The above sampling methods require you to number every member of the population, which may not be practical if the population has millions of members!

In that case, the following sampling methods can be used:

… Opportunity Sampling

e.g. choose whatever data points you have access to until you have your required sample size.

… Quota Sampling

This is the same as Opportunity Sampling, but using distinct groups to represent your sample.

e.g. you might want a sample size of 20 boys and 20 girls from the students in your school. You could survey students for these groups by standing at the school gate and asking whoever turns up until you have reached 20 in each sample size.

… Cluster Sampling

Cluster Sampling involves dividing the population into distinct groups, or ‘clusters’, then randomly selecting a certain number of these clusters. Then, randomly sample within the selected clusters to get the sample size required. This method is often used when a population is spread out across a large geographical area. You can identify several representative ‘cluster’ areas and then sample within them using an appropriate method.

Note that these methods may introduce significant bias – for e.g. only Year 7 students might turn up at the school gate at the particular time you were taking a survey!

Always try to think about HOW you would obtain the data and where you might be able to obtain a sampling frame (list).

Representing data and measures of dispersion

… The mode of a distribution is the x value at the highest point.

… Skewness of distributions can be found visually by looking at the trailing ‘tail’ of the distribution.

A stretched tail in the positive x direction is described as ‘positive skew’ and will ‘pull’ the median and mean to the right: mode < median < mean. Positive skew is also indicated if Q3 – Q2 > Q2 – Q1

A stretched tail in the negative direction is described as ‘negative skew’ and will ‘pull’ the median and mean to the left: mean < median < mode. Negative skew is also indicated if Q2 – Q1 > Q3 – Q2

… Skewness of a distribution can be calculated using 3(mean – median)/σ.

1 = positively skewed

0 = symmetric distribution

-1 = negativity skewed

… For skewed data, the median is the better average to use as it will be more representative of the given distribution (it is less affected by outliers)

… Outliers can be applied to data which are:

a) at least 1.5 x IQR beyond the nearer quartile (Q1 or Q3)

b) at least 2 standard deviatio ns from the mean (MEI)

If an outlier is detected, further investigation is needed to see if it should be removed from the original data.

… When plotting box and whisker diagrams, extend the ‘whiskers’ ONLY up to the outlier boundaries (if outliers have been identified in the question. Mark any outliers with an ‘X’ on your diagram.

… Finding Quartiles for DISCRETE data, placed in order of increasing size. (Note that this is NOT for grouped data, which we would have to assume is continuous).

1) The (n+1)/4 method

If you get a decimal, find the mean between the values above and below

2) The EdExcel n/4 method

If you get a whole number, find the mean between that value and the value above.

If you get a decimal, round UP to the value above.

3) The ‘splitting’ method.

First calculate the median position first using (n+1)/2

Then count how many values are below this (for example, let’s say there are N=6 values below the median)

use (N+1)/2 to find the Q1 position, counting from the start of the data.

If you get a decimal, take the mean between the values above and below.

All three methods will usually only give estimates for Q1 and Q3, as they often don’t split the data exactly into 25% : 75% (for Q1) or 75% : 25% (for Q3).

The EdExcel method does work, but tends to be less accurate in some cases.

Method 1) is probably the easiest because you can also use it to find percentiles.

For example, if you wanted to find the 30th percentile for 17 discrete data values:

P30 position = (30/100)(n+1) = 0.3(18) = 5.4th position.

This is a decimal, so P30 = the mean between the 5th and 6th values.

However, the EdExcel method will also work in a similar way.

–

Finding Quartiles of discrete data using the n+1 method

1, 5, 7, 8, 11, 14

The Q1 position = (n+1)/4 = 7/4 = 1.75th term… but as this is a decimal, find the mean between the values above and below:

Q1 = (1+5)/2 = 3

1, 5, 7, 8, 11, 14, 17, 24

The Q3 position = 3/4 × (8+1) = 6.75th term… but as this is a decimal find the mean between the values above and below = (14 + 17)/2 = 15.5

This method can also be used to find percentiles, e.g.

30th percentile position = (30/100) × (8+1) = 2.7th value

This is a decimal, so find the mean between the values above and below this position:

P30 = (5+7)/2 = 6

–

Finding quartiles using the ‘splitting’ method

First calculate the median, Q2, which is found from the (n+1)/2 term.

You may need to find the median between two values, for example:

1, 5, 5, 8, 11, 11

Median position = (6+1)/2 = 3.5th term, which is between the values 5 and 8.

So median = the mean of 5 and 8 = (5+8)/2 = 6.5

To find the lower quartile, Q1, count the number of terms, n, to the LEFT of the median. Q1 is then found from the (n+1)/2 term counting from the beginning of the data.

In the above case, our median was 6.5 and there are 3 numbers to the left:

Q1 position = (3+1)/2 = 2nd term

So Q1 = 5

To find the upper quartile, Q3, count the number of terms, n, to the RIGHT of the median. Q1 is then found from the (n+1)/2 term counting from the data beginning to the RIGHT of the median.

In the above case, our median was 6.5 and there are 3 numbers to the right (8,11,11):

Q3 position = (3+1)/2 = 2nd term

So Q3 = 11

–

Finding Quartiles using the EdExcel method

1, 5, 7, 8, 11, 14

The Q1 position = n/4 = 6/4 = 1.5th term… but as this is a decimal, round UP to the next term:

Q1 = 5

1, 5, 7, 8, 11, 14, 17, 24

The Q3 position = 3/4 × 8 = 6th term… but as this is a whole number find the midpoint between this term and the next term = (14 + 17)/2 = 15.5

This method can also be used to find percentiles, e.g.

30th percentile position = (30/100) × 8 = 2.4th value

This is a decimal, so round UP to the next value:

P30 = 7

–

… The interquartile range is a measure of how spread out (or dispersed) the middle 50% of data is. IQR = Q3 – Q1.

… The interpercentile range is a measure spread. For example, a commonly used interpercentile range is P90 – P10 which tells us the spread of 80% of the central data.

… Q4 is the value of the largest term in the data set

… Calculating Quartiles for CONTINUOUS data, ** OR for any data in a grouped frequency table (because we have to assume is continuous)**.

no need to add 1 to the number of observations

Use interpolation (position on the bottom, score on the top)

Q1 position = 0.25n

Q2 position = 0.5n

Q3 position = 0.75n

For grouped frequency tables, identify the group continuing the quartile you want and use the lower and upper class boundaries to set up your interpolation scores. For example:

Time/min. No. Students Cumul. Freq

30-31 2 2

32-33 25 27

34-36 30 57

37-39 13 70

Q1 position = 70/4 = 17.5, which occurs in the 2nd group (time in this example is continuous data, so there is no need for the +1)

31.5 Q1 33.5 <- scores

|—————|————————–|

2 17.5 27 <- positions

Scores ratio = Positions ratio

(Q1 – 31.5) / (33.5 – 31.5) = (17.5 – 2) / (27 – 2)

… Note that if Q1 landed in the first group, then the positions would run from 0 to 2 (on the bottom of the chart above).

Variance and Standard Deviation

… ‘Deviations’ in a set of data can be found by subtracting the mean from each data point, (x – xbar)

… Variance is also a way to measure the spread of data about a mean value (xbar). It is calculated by finding the mean of all the squares of the deviations (x – xbar). Squaring removes any negative values if x is below the mean.

So:

Variance = mean square deviation, msd = Σ(x – xbar)² / n

… Note that in more advanced statistics, when we calculate the variance of a SAMPLE taken from a POPULATION, we divide by n-1 instead of n. This is because the calculated variance of the sample (using n) will tend to UNDERESTIMATE the variance of the population from which the sample was taken.

So using ‘n-1’ in the calculation results in a better estimate of the population variance. Check your exam syllabus – usually you will only need to use the ‘n’ version :).

Here is a useful video explaining why we do this:

https://youtu.be/KkaU2ur3Ymw

… Sxx is an abbreviation for the sum of the squares of the deviations,

Sxx = Σ(x – xbar)² , or

Sxx = Σ(x²) – n(xbar)², or

Sxx = Σ(x²) – (Σx)²/n (EdExcel)

or for a grouped frequency table:

Sxx = Σf(x – xbar)² , or

Sxx = Σ(fx²) – n(xbar)²

Sxx = Σ(fx²) – (Σx)²/n (EdExcel)

… Variance = Sxx/n (if your data is the entire population)

or

Variance = Sxx/(n-1) if your data is a sample from the population. This gives the best unbiased estimate of the population variance (MEI)

… Standard deviation, σ = √variance = √(Sxx/n) …or √(Sxx/(n-1) )

… Another useful formula for Variance is “the mean of the squares minus the square of the mean”

Variance = ( Σ(x² ) / n ) – xbar²

(But note that you would have to multiply this whole formula by n/(n-1) to get the best unbiased estimate of the population if your data were from a sample… This is usually only relevant to MEI candidates).

… If your data is summarised in a frequency table, then n = Σf and so:

EDEXCEL / AQA

Variance = Σf(x – xbar)² / Σf = Σ(fx²) / Σf – ( Σfx / Σf )²

These are exactly the same as the above equations, just with frequencies. Note that if you have a grouped frequency table, you will have to estimate the x values by taking the midpoint of each class.

MEI

Sxx = Σ(x – xbar)² = Σ(x²) – n(xbar)²

… But if your data is summarised in a frequency table, then:

Sxx = Σf(x – xbar)² = Σ(fx²) – n(xbar)²

Note that if you have a grouped frequency table, you will have to estimate the x values by taking the midpoint of each class

… Standard deviation and variance come in two flavours, one when the whole population has been measured, and the other for when a sample has been used to estimate statistics for the population:

σ² = population variance = mean square deviation (msd) = Sxx/n

σ = population standard deviation = root mean square deviation (rmsd) = √(Sxx/n)

s² = sample variance = Sxx/(n-1)

s = sample standard deviation = √( Sxx/(n-1) )

Discrete random variables (when you have a probability distribution of a complete population)

… The expectation is the expected mean of a discrete random variable X, found from its distribution:

E(X) = ΣxP(X=x)

To find the expectation E(X) multiply each probability by the corresponding score, then add the results.

… Variance of a discrete random variable tells us how spread out the distribution is:

Variance = “The expectation of the squares minus the square of the expectation”

Variance = E(X²) – E(X)²

To find the expectation E(X²) multiply each probability by the corresponding score SQUARED, then add the results.

… So a typical table for calculating expectation and variance will have rows:

x

P(X = x)

xP(X=x)

x²

x²P(X = x)

… Standard deviation (σ) = √variance and is a more meaningful measure of the spread of data.

In a ‘normal’ distribution of data (a bell shaped histogram) +/- 1 standard deviation will account for 68% of all the data above and below the mean.

+/-2σ accounts for 95% of all the data.

+/-3σ accounts for 99.7% of all the data. Knowing this can be useful, for example, if you are designing a product and want it to work or fit 99.7% of the people who use it.

We can also use this idea to test if an experimental result disproves a hypothesis, given a certain ‘significance level’…

Coding

… Data values can be ‘coded’ into a new data set by adding, subtracting, multiplying or dividing.

Usually you will be applying a coding formula such as:

y = ax + b

where a and b are numbers and y is the new coded data point. From the coded data, we can find σy and ybar (the mean).

To find the new mean ‘ybar’, rearrange the coding formula:

ybar = a(xbar) + b

To find the new standard deviation, σy

σy = a(σx)

Noting that ‘b’ does not affect the spread of the data.

Note that:

1) Adding or subtracting has NO affect on the standard deviation

2) Multiplying or dividing the data DOES affect the standard deviation… To find the new standard deviation, multiply (or divide) the original standard deviation by the coding number that you multiplied (or divided) the original data by.

3) Adding or subtracting DOES affect on the original mean (simply add or subtract the coding number from the original mean)

4) Multiplying or dividing DOES affect the mean (simply multiply or divide the mean by the coding number)

This is summarised by the following:

E(aX + b) = aE(X) + b

Var(aX + b) = a²Var(X)

StdDev(aX + b) = a[ StdDev(X) ]

Car salesman example: the mean number of cars sold per month = 23.6 and the standard deviation = 5.2

The car salesman earns a set amount of £500 plus commission of £100 per car sold

The mean salary per month = 23.6(100) + 500 = £2860

The standard deviation of salary per month = 5.2(100) = £520

Histograms

…. These questions usually require the ‘counting squares’ type methods, firstly to calculate the frequency represented by each block (counting larger blocks is easier rather than the tiny squares!).

… It is very helpful to make yourself a grouped frequency table with a mid point of each group (x-bar). Then find the sum of the products of f and x-bar and divide by the total frequency to calculate the mean.

… To find the median of a histogram, add 1 to the total frequency and divide by 2. Round up to the nearest whole as this is discrete data. Then count using your blocks to find in which bar on the histogram the median will occur. Use the formula Class Width = f/frequency density to find out ‘how far’ into that bar you will need to go to find the median score.

Using a calculator to find statistics – this is really useful for checking your calculations!

… You can use your calculator to find a mean, variance, standard deviation and the product moment correlation coefficient, (pmcc):

- MODE > 2 (STAT) > 1 (1-VAR)
- Input your data into the table. To activate the frequency column for each data point: SHIFT > SETUP > scroll down > 3 (STAT) > 1 (ON)
- Press AC (the calculator will store your tabulated data)
- SHIFT > 1 (STAT) > 4 (VAR)
- This will give you the value of:

n = the number of data points

xbar = the mean,

σx = the standard deviation

sx = the sample standard deviation, which uses the n-1 correction as a better (slightly higher) estimate for the population standard deviation from which the sample was taken. - Using the other menu choices from SHIFT > 1 (STAT) > you can also access Σx, Σ(x^2) and minimum and maximum values.

Permutations

… Permutations are the number of ways you can arrangement different objects, so order IS important.

… If you have n objects, then the permutations are n!

… If r of those objects are the same, then there will be n!/r! permutations

… If you choose r objects out of n DIFFERENT objects, then the permutations (arrangements) of doing this are n!/(n-r)! This is notated “nPr”.

Rather than remembering this formula, the easiest way to work this out is using the “spaces method”. For example:

If I have 5 different objects, A B C D E, what are the possible arrangements (permutations) of choosing 3 objects?

__

There are 5 ways of filling the first space, 4 ways to fill the second space and 3 ways to fill the third space:

5P3 = 5 × 4 × 3 = 60

Permutations (arrangements) Questions

… In how many different ways can you colour a 3 x 3 grid using blue, green and red paint if each colour is used three times?

(Hint: first find how many arrangements are possible if all the colours were different)

… How many ways can 5 men and 3 women be arranged in a row if no two women are standing next to one another?

(Hint: try using the ‘space out’ method, considering the number of ways of arranging the 5 men, and then the 3 women)

… How many arrangements of the letters BANANA can be made such that no two N’s appear in adjacent positions?

(Hint: first find the number of arrangements of the letters bearing in mind that some letters are repeated. Then treat ‘NN’ as a single letter and find the number of arrangements again).

… In how many ways can the letters in the word SUCCESS be arranged if no two S’s are next to one another?

(Hint: use the ‘space out method’ but note the repetitions)

Combinations

… Combinations are when you choose r objects out of n and the order of the objects DOES NOT matter (for example ABC is the same thing as CBA).

… So if you choose 3 letters out of A, B, C and D, what are the possible combinations? nCr = 4C3 = n!/(n-r)!r! = 4!/(4-3)!3! = 4

… The nCr combinations are used in bino mial expansion, and relate to Pascal’s triangle:

The power on the bracket to be expanded (n) is the row number (starting at n=0)

The power on each x term in the expansion (r) is the column number (starting at r=0)

Combinations Questions

… How many different combinations (selections) of 3 letters can be made from the letters A B C D E ?

(Hint: start with the number of permutations possible and then consider that order doesn’t matter in combinations).

… How many ways can a committee of 3 men and 2 women be formed from 7 men and 10 women?

(Hint: consider 5 spaces and work out the combinations of men that can occupy 3 spaces and then the combinations of women that occupy the remaining 2 spaces).

… How many ways are there of selecting 5 letters from the word ADVANTAGE ? (Take care with this one! Consider each case separately where you get 0 A’s, 1 A, 2 A’s and 3 A’s)

Discrete probability distributions (probability mass functions)

… A random variable is usually given a capital letter, e.g. X

… A particular value of the random variable is usually given the lower case letter, e.g. x

… P(X=x) means “the probability that the random variable X is some particular value x). So the probability distribution of a fair die would be:

x 1 2 3 4 5 6

P(X=x) 1/6 1/6 1/6 1/6 1/6 1/6

… Note that all the probabilities must always add up to 1, which is a good check for your calculations.

… “Show the Probability distribution…” usually always means draw a table with rows for x and P(X=x)

… In order to work out probability distributions, you will often need to view the ‘probability space’ for that particular question. For example, here are some common probability space ‘tools’ for questions involving:

- A dice thrown twice, or two coins flipped: use a 2-way table to show the possible outcomes.
- Fair coin flips: use letters to work out probabilities, e.g. If N is the number of tails thrown in three throws, then P(N=1) = THH, HTH, HHT = 3(0.5³)
- Drawing counters from a bag: use a probability tree when the counters are NOT replaced (so the second draw becomes dependent on the first draw).

… The expectation is the mean value. If the question asks something like, “find the expected value…”, then you know that you will need to calculate the expectation.

… Probability distributions.

E(X) = Σ(xp)

To find the expectation E(X) multiply each probability by the corresponding score, then add the results. A table with rows of x and P(X = x) is useful here.

… Var(X) = E(X²) – [E(X)]²

(Find the expectation of the squares and subtract E(X) squared).

… If the data is ‘coded’, for example:

E(aX + b) = aE(X) + b

or

VAR(aX + b) = a²VAR(X)

… The cumulative frequency distribution, F(x) = P(X ≤ x)

… Probability distributions can be sketched: for discrete distributions use bar graphs with heights representing probability.

… Note that P(A|B) means, “the probability of A happening given that B has already happened”.

S1 Probability (and playing darts)

… The probability of event A happening, P(A), can be thought of as a circle on a dartboard. The larger the area, the more chance there is of a dart that your throw hitting within the circle.

… If two events can happen at the same time (so your dart could land in both circles A and B), then you have an intersection probability which is shown as P(A ∩ B).

… The probability of your dart landing in A or B, or BOTH is given by the ‘Union’ symbol:

P(A U B)

The Union is the total area of the (possibly) overlapping circles A and B (but only count the overlapping part once!)

… The ‘complement’ (an apostrophe symbol) means ‘Not’. For example,

A’ means everything outside of set A

So, A’ means Not A, or the Complement of A. Imagine there is a force field that doesn’t let your pencil enter the A circle. Hatch everywhere OUTSIDE of A.

… For questions involving sets and subsets, it is usually easiest to use a Venn Diagram, especially for “…Given that…” Type questions (conditional probability)

… It can often be useful to start with an ‘x’ in the centre and work outwards using data given in the question.

… Mutually exclusive events (events which cannot happen at the same time): you can test whether two events are mutually exclusive if:

P(A) + P(B) = P(A ∪ B).

This is called the ‘Addition Rule for Mutually Exclusive events’.

… Two mutually exclusive events can be visualised using a Venn diagram with two separate circles in the box (the ‘universal set’ or ‘sample space’)

… Using the ‘Addition Rule’, if two events are NOT mutually exclusive (i.e. both events could happen at the same time), then:

P(A ∪ B) = P(A) + P(B) – P(A ∩ B)

Think of two areas, A and B, which overlap. If you add both area A with area B then you are counting the overlap twice… That’s why we need to subtract one of the overlaps to find the area of P(A ∪ B).

… Independent events are events that do not affect each other. So if A and B are independent, the probability of A happening is the same whether B has happened or not. We can say:

P(A|B) = P(A)

… To test if two events are independent, then

P(A ∩ B) = P(A) x P(B)

This is called the ‘Multiplication Rule applied to independent events’

… Conditional Probability: for example, the probability of event L happening given that event G has already happened:

P(L|G) = P(L ∩ G) / P(G)

Note that the top of this fraction is not P(L) x P(G) because the two events may NOT be independent.

… Another way to think of ‘given that’ conditional probability situations is to imagine that your universe ‘shrinks’ in this case to the bubble called event G (which has already happened)

… Questions can often be approached by first writing down a ‘data list’ of the probabilities given in the question. Then apply the relevant formula (some of which are given in your formulae booklet).

… Watch out for OR cases, especially when there are two or more ‘picks’. For example, “what is the probability of selecting two students at random from the school, one of whom is a girl hockey player and the other a boy.

There are two ways to do this:

P(G ∩ H) × P(B) … OR … P(B) × P(G ∩ H)

After choosing a girl hockey player, the total number of students decreases by 1, which affects the second pick.

The OR means ADD the probabilities together.

… ‘Given that’ probability questions involving the normal distribution:

If P(X>45) = 0.625 (call this event A)

and

P(X>57) = 0.322 (call this event B)

Then the probability of B happening given that A has already happened:

P(B|A) = P(A∩B)/P(A) = 0.322/0.625

Here, P(A∩B) must be the same as event A because of the ‘overlap’ in the distribution: the only way that A and B can happen is if X>57

Normal Distributions

… Understanding that a distribution of measurements (‘data’) often follows what we call a ‘normal distribution’. The standard deviation tells us the distance from the mean which accounts for about 68% of all the measurements. The standard deviation is therefore a measure of the spread of data (like the Inter-Quartile range)

… Normal distribution for a continuous random variable, X~N(mean,variance)

… The total area under a normal distribution is always 1.

… When comparing distributions, always comment on the means and the standard deviations (the spread of the data).

… The number of standard deviations from the mean is also called the “Standard Score”, “sigma” or “z-score”.

… We can take any Normal Distribution and convert it to The Standard Normal Distribution.

This is done by converting any value to what is called a Standard Score (or “z-score”):

- first subtract the mean,
- then divide by the Standard Deviation

Doing this is called “Standardizing”:

… Draw and label two distributions, one above the other; one for the random variable X, and the other for the standardised score, Z.

… Calculating a Standardised Z score = (value – mean)/standard deviation

… Standardised Z scores help us to compare two different measurements, for e.g. Is machine A more accurate than machine B at making bottle tops?

… We can also use a Z-score to look up the area (probability) under the bell curve to the left of the z-score. To do this we use the ‘standard normal distribution table’ or your calculator (much easier!). “Phi of z”, Φ(z) = P(Z < z)

… Calculating a z score and then finding cumulative probabilities P(Z<z) from the normal distribution tables / calculator.

… If you are given a probability, you can use the inverse normal distribution on your calculator to find the standardised z score. Note that you may need to form and solve simultaneous equations using the above z score formula.

… Usually answers in statistics are required to either 3 or 4 significant figures.

… Note that P(X<x) is exactly the same thing as P(X≤x) because the variable X is continuous

.. Note that the Normal PD function on your calculator gives you the y-value (probability density) of the normal curve at a certain x-value. We’re mostly interested in the Normal CD function (cumulative distribution).

Geometric distributions

… Geometric distributions ask,”what is the probability of getting a success on the Xth attempt?” (When all other previous attempts were failures). X ~ Geo(p) where p is the probability of success in any attempt.

… P(X = x) = p(1-p)^(x-1)

Binomial distributions

… Binomial distributions ask, “what is the probability of getting X successes in n attempts?” X ~ B(n,p) where n is the number of attempts and p is the probability of success in any attempt.

… Questions in which the outcome can only be one of two possibilities (success or fail) are usually always binomial (or geometric).

… The assumptions made in a binomial distribution are:

- Each trial results in one of two possible outcomes (success or failure),
- The probability of success is the same for each trial
- The trials are independent, meaning here that a success in one trial does not influence the probability of success in another.
- There are a set number of trials.

… Always define what your random variable (X) is before writing the distribution model. For example:

“X is the number of people who have to wait more than 30 minutes at A&E”

If the probability of a person having to wait more than 30 minutes = 0.3 and 12 people come into A&E, then:

X ~ B(12,0.3)

… For a binomial distribution, P(X=r) = (nCr)(p^r)(1-p)^(n-r)

or your calculator’s Binomial PD function.

… To find P(X ≤ r) use the cumulative binomial tables or your calculator’s Binomial CD function.

You may need to use the result that P(X ≥ r) = 1 – P(X ≤ r-1), for example if we have 8 tests and we want to know the probability of at least 2 successes:

P(X ≥ 2) = 1 – P(X ≤ 1)

…You may be asked to find a value s such that:

P(X ≥ s) < 0.01

Here is another way of looking at this problem:

P(X ≤ 6) = 0.9894 … so P(X ≥ 7) = 1 – 0.9894 = 0.0106

P(X ≤ 7) = 0.9984 … so P(X ≥ 8) = 1 – 0.9984 = 0.0016

Therefore s = 8 is the smallest such value.

… If you are asked to find, for example, P(4 ≤ X < 8), then write out a number line and highlight the values which are required:

1 2 3 4* 5* 6* 7* 8 9

Then we can say:

P(4 ≤ X < 8) = P(X ≤ 7) – P(X ≤ 3)

… Note that geometric distributions are very similar to binomial distributions but without the nCr coefficient and with r=1

Hypothesis testing

… The flow of answering a hypothesis test question is usually:

- Define the variable, X in words. For example, “X is the number of people who pass their driving test first time”.
- Define the probability, p, in words. For example, “p is the probability that a person passes their driving test the first time”.
- State the distribution of X. For example, “X ~ B(20,0.6)
- State the null and alternative hypotheses, for example:

“Ho: p = 0.6, H1: p <0.6”.

If the question mentions that, for example, “…it is believed that the probability has changed…” then we would know that H1:p ≠ 0.6 and the we would have a ‘two-tailed’ test. - State the significance level used for this hypotheses test. For example, “α = 5%”. If we have a two-tailed test, then this significance level is split between the upper and lower tails equally.
- Write down the ‘rejection criterion’, which for example, could either be:

- “Reject Ho if P(X ≤ observed value) < 0.05”.

Note that the inequality direction P(X≤6) will always match the H1 alternative hypothesis, H1: p<0.6

or - “Reject Ho if the observed value falls in the critical region”.

- Then use the cumulative tables, or the formula, or your calculator to find the probability of the observed value, in this case P(X≤6).

Alternatively, find the critical value x for which P(X≤x) is just less than 0.05 (i.e. x is the boundary of the critical region). Note that the BPD List function on your calculator can be useful for this. - State whether there is sufficient evidence to reject Ho (or not). ALSO state what this means in reality. For example: “there is insufficient evidence to reject Ho, so there is no evidence to suggest that the driving instructor is overestimating his pass rate”

… Note that the ACTUAL significance level of the test is found by adding the probabilities in the critical region(s) together. Note that these probabilities will be less than the ‘tail probability’. For example:

If you are testing a one-tailed test (upper) with a significance level of 5%, you may find that the probability of a result landing in the actual critical region is 4.3%.

Correlation and regression

Spearman’s Rank Correlation Coefficient

… To rank two sets of data X and Y, you give the rank 1 to the highest of the x values, 2 to the next highest and so on. Then do the same for the y values.

… Calculate the difference, d, between each pair of ranks, then calculate d^2 and Σd^2.

… The Spearman’s rank correlation coefficient is general by:

rs = 1 – 6(Σd^2) / n(n^2 – 1)

This number lies between -1 (a completely negative correlation) and +1 (a completely positive correlation). A value of 0 means there is no correlation at all.

Note that you can only use this formula if there are NO tied ranks. If there ARE tied ranks, then use the calculator method to find the Product Moment Correlation Coefficent (pmcc) below.

… When asked to comment on the Spearman’s rank correlation coefficient, you can use descriptions such as:

rs = -0.841 “There is a strong negative correlation between the X and Y, so as one increases, the other tends to decrease” (strong disagreement!)

rs = 0.611 “There is a reasonable positive correlation” (fair agreement)

rs = 0.169 “There is only a weak positive correlation…” (little agreement)

… When data is ranked, then the Spearman’s Rank Correlation Coefficient = Product Moment Correlation Coefficient (pmcc). You can use the calculator to calculate this value:

MODE > 2 (STAT) > 2 (A + BX) > [enter data in table] > AC

then

SHIFT > 1 (STAT) > 5 (REG) > r

This is particularly useful if there are tied ranks when you cannot use the usual formula.

Regression and the Product Moment Correlation Coefficient (pmcc), r

… The product moment correlation coefficient is ‘r’. This tells us how correlated two variables are to each other.

r = 0 means the two variables are totally uncorrelated.

r = 1 means they have a perfect positive correlation (positive gradient)

r = -1 means they have a perfect negative correlation (negative gradient)

So r can be any decimal between -1 and 1.

Calculating a Regression line (line of best fit) using Sxx, Sxy and Syy.

… Sxx, Syy and Sxy are the sum of the squares of the deviations:

Sxx = Σ(x – xbar)² = Σx² – n(xbar)²

Syy = Σ(y – ybar)² = Σy² – n(ybar)²

Sxy = Σ(x – xbar)(y – ybar) = Σxy – n(xbar)(ybar)

… r = Sxy / √(SxxSyy)

… It’s worth noting that:

Variance = Sxx/n = Σ(x² ) / n – xbar²

But you would have to multiply this whole formula by n/(n-1) to get the best unbiased estimate of the population if your data were from a sample (MEI candidates only)…

… The equation of the regression line of y on x is: y = bx + a

b = Sxy/Sxx and a = ybar – b(xbar)

… The line of “y on x” should be used when y is the ‘response’ variable (dependent) and x ‘explanatory’ variable (independent).

… The regression line of “y on x” may well be slightly different to the line of “x on y” as as the first method minimises y ‘residuals’ (vertical distances from the predicted value to the data point), and the second method minimises x ‘residuals’.

… The calculator STAT mode can calculate r, a and b :

MENU > STATISTICS > A + BX > [enter data in table] > OPTN > REGRESSION CALC

This will give you the values of a, b and r. Your line of best fit is then given by the equation y = bx + a.

… Hypothesis tests for correleation – using the Product Moment Correlation Coefficient

Use the PMCC (product moment correlation coefficient) critical value table, which is given in your formula booklet

Here’s a tutorial to take you through how it works:

https://youtu.be/fTh5GnDqZfw

Please see attached summary sheet from this video.

… The Chi Squared test is used:

- for estimating how closely an observed distribution of data matches an expected distribution (e.g. For 60 rolls of a supposedly fair die, do the observed scores match the expected number, for example 10 rolls of “1”?)
- it is also used to examine if two variables are connected in some way, i.e. are they independent?

You won’t need to know the Chi Squared test for your maths course, but here is a useful website which explains it in more detail:

http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html