Statistical fundamentals for testing

Stenburgen Ruwa
5 min readOct 8, 2020

Having the basic understanding of statistics can help you evaluate test results or case studies of A/B testing correctly. In blog I will give an overview of the statistical concepts that every digital marketer and CRO should know.

Sampling populations, parameters, and statistics

Sampling populations can be considered as all potential users or people in a group or things we want to measure. A blog done on the CXL website by Matt Gershoff, about figuring out the difference in temperature between a couple coffee shops and the coffee they serve.

The questions in this experiment is whether one coffee serves hotter coffee than the other? In this case we use statistics to compare a collection of, or sample of things. When comparing the temperature of all the coffee cups at two shops, the parameter of interest here is the mean temperature of each coffee shop. The purpose is to find out if there is a difference in mean temperature in the two shops.

Since we will never know about the true population of the cups in all the coffee shops, we just take a sample of the cups from each shop and do a statistic on that. In this case we estimate using statistics also known us mean of a sample. In the case of true population of the cup there is a mean and standard deviation, where Mean is represented by m and Standard Deviation represented by s.

While the statistics of from our small sample of coffee cups, the mean is represented by Latin symbol X. Once we have as sample, we use the sample statistics to make inferences about the population parameters.

Mean Variance, and standard deviation

To understand an important concept of A/B test like significance of power and importance of sample size, we first need to understand the mean, which is the most common measure of central tendency.

For example, when measuring a bunch of coffee cups, and plot the results of temperatures of the cup of copies on a graph we get spread of data like the one shown on the table below. The value drawn by a red lien at the mid-point is the mean, also known as central tendency.

When thinking about A/B testing and collecting the data that you collect you need to think about the shape of the date or how spread out the data is, what’s known as variance. The most common measure of variability in statistics is the standard deviation.

Confidence intervals

Confidence intervals are the range of values defined in a way that there is a specific probability of the value that parameter that lies within it. The ingredients for confidence intervals is the mean while the other ingredients that determine how wide the confidence interval is, is the sample size, variability of the data, shape of the data and confidence level.

For A/B testing, there is some important practical implications of it. So, confidence intervals are the amount of errors allowed in A/B testing, which is also the measure of reliability of the estimate. The confidence intervals experiment is normally done since the true conversion rate cannot be measured. If the tools you are using to calculate the confidence intervals says we are 95% confident that the conversion rate is 5% plus or minus X, then you need to account for the plus or minus as the margin of error.

Statistical Significance and the p-value.

Statistical significance helps us to measure whether a result is likely due to chance. So statistical significance means a result is unlikely due to chance. Whenever we talk about statistical significance the, oftentimes the P-Value comes into play.

The P-Value is the probability of obtaining the margin of error. Which means in a 95% confidence the 5% margin of error is the P-Value. Some of the misconceptions to remember is that the P-Value doesn’t tell us that the probability of A is better than that of B. Similarly, it doesn’t tell us that we will be making a mistake in selecting A over B.

Statistical Power

It is the probability that any test of significance will reject a false null hypothesis. In a layman’s language statistical power is the likelihood that a study will detect an effect, when there is an effect to be detected. It is determined by the size of the effect you want to detect and size of the sample used to detect it. Larger samples offer greater test sensitivity that smaller samples.

Sample size and how to calculate it.

Like any other statistical procedure, a common question is what sample is needed? For A/B testing, the right sample size comes down to how large of a difference you want to detect should one exist at all.

The factors to consider when calculating sample size is how large there is a difference to detect, level of confidence and the power and variability of the data. So, values closer to 50% have a higher variability.

When calculating sample size the variables involved are control group’s expected conversion rate, the minimum relative change in conversions you want to be able to detect, so the Lift and how confident you want to be, and the inverse of that is how much of a risk or a type one error or false positive do you want to accept?

Statistics Traps

Below I will cover a few statistics traps you may encounter when doing testing.

1. Regression to the mean and sampling error

One of the big ones is stopping too early which has something to do with a couple of concepts called regression to the mean and sampling error. Sampling error is an error that arises when from the sample is not a representative of its population.

2. Too many variants

Most of the times when doing tests, you will hear suggestion about testing as many variants as possible hence one of the tests will definitely work. When doing optimization, it you should make sure that it is hypothesized.

The problem with more test you run, the higher the probability of declaring one of them a winner when actually there is none. We accept a 5% probability of a type 1 error (false positive), with a 95% confidence level. This low error probability however is only valid in the case of one test variant. When running your test limit, yourself to 3 variants which are hypothesis driven.

3. Click rates and conversion rates

In this statistical trap we often see is inflating the level of importance between click rates and conversion rates. Just because you increased visits to a product page or because visitors place and item in the shopping cart more often, doesn’t mean that you have increased the macro goal of conversions. It also doesn’t mean that more people will perform an actual purchase.

In this case it is very important to select a main KPI and a main metric before you go in to testing.

4. Frequentists vs Bayesian test procedure

Another statistics trap that people get into is the philosophical debate between Frequentist Statistics and Bayesian Statistics. In a nutshell the difference is that in Bayesian point of view, a hypothesis is assigned a probability, but in a frequentist point of view, a test is run without a hypothesis being assigned a probability.

--

--