Let’s play a game. Heads or tails?
I assume everyone in the world knows this little game. Throw a coin in the air and check (on landing) which side is up, It’s also one of the easiest decisionmaking tools, it is even used in football games. We assume that the coin has no bias and that we have a 50% chance on heads and 50% on tails. To be 100% sure of this 50/50 chance there is even a specially minted coin for the National Football League. So, really 50/50 then isn’t it?
Well… let’s start with an extreme: we toss 1 time. Now we have 100% of one side, 0% of the other side. Does this mean that the 50/50 isn’t true? I hope you understand that we get this result because we have tried only 1 time. So, let’s throw another! We again have a 50/50 chance so it is possible to have a 2-0 score. Does this prove we have a system that doesn’t work? No, it’s all about samplesize.
When we have a random game like this we have a 50/50 chance with each try but we have a 25% chance we have a 2-0. We even have a chance that we end up with 10-0 after 10 tries but that chance is only 0.1%. These extremes are so rare that they don’t really play a big role. We have to acknowledge that this can happen but that this is rare.
Now we throw 100 times and log the number of heads and tails. On the X-axis (most left) we have 100 heads and on the other side (most right) we have 100 tails. In the middle there is the 50/50 situation. Now we play this game thousands of times and log all those rounds in a graph. The result will be something like the included graph: a normal distribution.
It illustrates the chance of 100% heads (on one side of the graph) to 100% tails (on the other side). Those chances are both very small and the 50/50 in the middle with the highest chance. The green area is the 95% region: 95% of the rounds will end up in this green zone.
We have to accept that, in theory, some situations can occur but chances are so small that we (in the real world) ignore the possibility, but it is possible. Instead of taking all extremes into account we can say 95% of the times it is within these margins, the 95% confidence level. We then ignore the 2.5% area’s in the extremes. It is also possible to use a 99% certainty but usually 95% is enough. The 1.96 is called the z-score, just remember this one…
Margin of Error
With this information we can calculate the margin of error. This means the width of the green area in the graph. So when we have, for example, a 45% ± 5% that means that the score is probably (95% certain) between 40% and 50%. So we still have a 5% chance that we are totally wrong.
The formula for this calculation is: z * √ ( p*(1-p) / n )
- z = the z-score (1.96 for a 95% certainty)
- p = expected chance/outcome, for example heads
- n = sample size, number of measurements
In the example of a head-and-tails game:
- p = 0.60 (10 tosses, 6 times head)
- 1-p = the chance we have not p, 1 minus 0.60 = 0.40
- n = the number of times we try, in this example 10 tosses
With these values we calculate the margin of error (e):
- e = 1.96 * √ ( 0.60*(1-0.60) / 10)
- e = 1.96 * √ ( 0.24/10 )
- e = 1.96 * √0.024
- e = 0.30 –> 30%
When we toss 10 times and get 60% heads we say we have 60% ± 30%. Not a great score!
But we can lower this margin of error by increasing the samplesize. When we get the same 60% but try more often we get the following results:
- n = 10 –> e = 30%
- n = 100 –> e = 9%
- n = 1000 –> e = 3,0%
- n = 10000 –> e = 0,96%
So the number of samples is very important in our game. The question is: how certain do you need to be? When running an A/B test we have to be sure that we tell the truth. Maybe the result isn’t what we expected, but that can also be valuable information.
When we have for example (as in the A/B test example) a test if a green button works better than a blue one and we have the following results:
- Blue: conversion rate = 15% ± 5% (that means: between 10 and 20%)
- Green: conversion rate = 20% ± 5% (that means: between 15 and 25%)
we still have no answer. The error-margin is just to high and we need to increase the number of samples to give a good answer. But what is enough? We don’t want to spend more time and effort than needed but a result with a too high margin of error is just wasted time and resources. So we have to calculate how many samples we need before we start.
For this calculation we can use the cochran formula: n = ( z² * p * (1-p) ) / e²
So, if we want to check if the distribution of heads and tails is really 50/50 with 2% margin of error we get this calculation:
- n = ( 1.96² * 0.5 * (1-0.5) ) / 0.02²
- n = ( 1.96² * 0.25 ) / 0.02²
- n = 0.9604 / 0.0004
- n = 2401
In this case we have to flip the coins 2401 times to get to this ±2%
In the case of the blue and green button with ±1% we get (we just use the 15% as a known for the blue button, we don’t know green yet):
- n = ( z² * p * (1-p) ) / e²
- n = ( 1.96² * 0.15 * (1-0.15) ) / 0.01²
- n = 4899
We need to run 4899 tests (for each situation!) to get a usefull test. The reason is that we actually run on both pages the test: does the user click, yes or no.
That means we have to display the page with the blue button 4899 times and the page with the green button also 4899 times.
One last addition: sometimes we need a samplesize that is greater than or a large part of the population. For example: We have an election for the new mayor. We have 2 mayors and the population of the (small) city is 2000 persons.
We want to run a poll with 95% confidence and 2% margin of error.
- n = ( z² * p * (1-p) ) / e²
- n = ( 1.96² * 0.5 * (1-0.5)) / 0.02²
- n = 2401
But the population is just 2000 persons, how can we run a poll in this case?
In situations where the population is small we can use the following addition:
sn = n / ( 1 + (n-1)/N )
- n = the previous samplesize, 2401 in our case
- N = population
- sn = n for small populations
When we use these values we get:
- sn = 2401 / (1+ (2401-1)/2000)
- sn = 2401 / ( 1 + 2400/2000 )
- sn = 2401 / ( 1 + 1.2 )
- sn = 2401 / 2.2
- sn = 1091.4 = 1092 (we have to round up)
That means we have to ask 1092 people to know with just 2% margin of error which mayor will be elected. If 5% margin of error was good enough we could have done the poll with a samplesize (sn) of 323 persons.
This addition can be important if you want to send for example a questionnaire to your customers when the number of customers is limited.