Significance ($1-\alpha$): Also called a confidence level, it refers to the probability that an observed difference in a study reflects a true difference in the population under the assumptions of the null hypothesis, i.e. that the observed difference is not due to chance given the null hypothesis is true.
Type I error ($\alpha$): The probability of rejecting a true null hypothesis, or incorrectly concluding that a difference exists (the null hypothesis is not appropriate) when in fact a difference really does not exist in the population. A false positive decision.
(Statistical) Power ($1-\beta$): The probability of correctly rejecting the null hypothesis when it is in fact false, and thus, the probability of observing a difference in the sample when an equal or greater difference is present in the population.
$H_0$ True | $H_0$ False ($H_a$ True) | |
---|---|---|
Reject $H_0$ | Type I error ($\alpha$) | Power ($1-\beta$) |
Accept $H_0$ | Significance ($1-\alpha$) | Type II error ($\beta$) |
$$ \alpha = {\rm Pr(Reject ~~ H_0 ~~| ~~H_0 ~~is ~~True)} $$
$$ 1 - \alpha = {\rm Pr(Accept ~~ H_0 ~~| ~~H_0 ~~is ~~True)} $$
$$ \beta = {\rm Pr(Accept ~~ H_0 ~~| ~~H_0 ~~is ~~False)} $$
$$ 1 - \beta = {\rm Pr(Reject ~~ H_0 ~~| ~~H_0 ~~is ~~False)} $$
from IPython.display import Image
Image("images/Power.jpg")
In the above figure, each curve represents the difference between two groups on some continuous measurement $\delta=p-p_0$. The null hypothesis, that there is no difference between the two groups, is represented by the curve on the left ($\mu=p_0$). The alternative hypothesis, that there is some difference between the two groups, is represented by the curve on the right ($\mu=p=p_0+\delta$). Each distribution is also characterized by its variance ($\sigma^2$) which is usually assumed to be the same for both distributions.
Image("images/sample_size.png")
Considering the two normal distributions in Figure 1: $$ N_1(\mu, \dfrac{\sigma}{\sqrt{n}}),~~N_2(\mu+\delta, \dfrac{\sigma}{\sqrt{n}}) $$
The critical line for the left normal distribution in Figure 1 locates at (for two-tail case) $$ \mu + Z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}} $$
The critical line for the right normal distribution in Figure 1 locates at $$ \mu + \delta - Z_{\beta} \dfrac{\sigma}{\sqrt{n}} $$
The critical line in the right and left normal distribution locates at the same place. Let the above two expressions equal to each other, one finds
$$ \delta = ( Z_{\alpha/2} + Z_{\beta}) \dfrac{\sigma}{\sqrt{n}} $$ from which one determine the sample size as $$ n = \dfrac{( Z_{\alpha/2} + Z_{\beta})^2}{(\dfrac{\delta}{\sigma})^2} $$
Effect size: the magnitude of the difference between groups divided by the standard deviation $$ {\rm Effect ~~ size} =\dfrac{\delta}{\sigma} $$
The effect size is the main finding of a quantitative study. While a P value can inform the reader whether an effect exists, the P value will not reveal the size of the effect.
difference in the conversion rate $\delta=r_A - r_B$
The test statistic: $$ \dfrac{(r_A - r_B)}{{\rm SE}} $$ follows the normal distribution $N(0,1)$, where $$ {\rm SE} = \sqrt{\sigma^2_A/n_A + \sigma^2_B/n_B} = \sqrt{\sigma^2_A + \sigma^2_B} \sqrt{1/n} =\sigma/\sqrt{n}, $$ from which one finds $$ \sigma^2 = \sigma^2_A + \sigma^2_B = r_A(1-r_A) + r_B (1-r_B). $$
determine the sample size for the AB test with the previous formula:
For one-tail test case, the sample size is determined by
$$ n_1 = \dfrac{(Z_\alpha + Z_{\beta})^2\left[r_A(1-r_A) + r_B (1-r_B)\right]}{(r_A - r_B)^2} $$
For two-tail test case, the sample size is determined by
$$ n_2 = \dfrac{(Z_{\alpha/2} + Z_{\beta})^2\left[r_A(1-r_A) + r_B (1-r_B)\right]}{(r_A - r_B)^2} $$
where $Z_\alpha$ is the z-score.
In some cases, the following formula is used $$ n_2 = \dfrac{\left(Z_{\alpha/2}\sqrt{2\bar r(1-\bar r)} + Z_{\beta}\sqrt{r_A(1-r_A) + r_B (1-r_B)}\right)^2}{(r_A - r_B)^2}, $$ where $\bar r = (r_A+r_B)/2$.
The above expression indicates that: The smaller the difference between the conversion rate values of the variations A and B which is to be identified, the greater sample size is required for the test.
The conversion rate of variation B: 26% ($r_B = 0.26$).
The minimum difference between the conversion values of variations A and B is 6%; The minimum absolute difference of 6% corresponds to the relative difference of 30% (20% * 0.3 = 6%).
Variation B performed better than variation A.
Let’s choose the confidence level of 95% and the significance level of 5%.
The value of statistical power in the course of A/B tests is 80% as a rule, namely $\beta=0.2$.
from scipy.stats import norm
def samplesize_ABtest(rA, dr, alpha, beta, sides):
"""
return sample size for the AB test,
given
- baseline conversion rate rA
- required minimum difference dr = rB-rA,
- significance level alpha
- statistical power 1-beta
"""
assert dr>1.e-20
# conversion rate of B group
rB = rA + dr
zscore = norm.isf(alpha)
zpower = norm.isf(beta)
pooled_var = rA*(1-rA) + rB*(1-rB) # 2*r*(1-r)
if sides ==2: # two tailed
zscore = norm.isf(alpha/2)
n12 = (zscore+zpower)**2 * pooled_var/ dr**2
return n12
norm.isf(0.05/2),norm.isf(1-0.05/2)
norm.isf(0.2),norm.isf(0.8)
size = samplesize_ABtest(0.2, 0.06, 0.05, 0.2, 2)
print("Sample size of group A in the two-tail test: {} \n The total sample size is: {}".format(np.floor(size),
np.floor(2*size)))
Therefore, if conversion rate of group A is 20% and the estimated conversion rate value of group B is at least 26%, we’ll have to run our experiment until each variation gets 768 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Thus, the total number of the experiment visitors should be 1536.
If there is the third variation C in the experiment. This variation C should be compared with variation A (the control one). Therefore, we’ll have to fill variation C with n visitors as well and make the total number of experiment visitors equal to $3*n=3*768$.
A pharmaceutical company has developed a brand new antibiotic against pathogen X. No other antibiotics are available, so no comparison can be made with existing antibiotic treatments. However, it is known from field data that 70% recover from the disease (although effects on production are tremendous). It is expected (hoped) that after use of the antibiotic 95% of the animals will improve and that the duration of the disease will be shorter as well. Thus,
alpha=0.05
norm.isf(alpha/2)
beta=1-0.8
norm.isf(beta)
from scipy.stats import norm
def samplesize_ABtest2(rA, dr, alpha, beta, sides):
"""
return sample size for the AB test,
given
- baseline conversion rate rA
- required minimum difference dr = rB-rA,
- significance level alpha
- statistical power 1-beta
"""
assert dr>1.e-20
# conversion rate of B group
rB = rA + dr
zscore = norm.isf(alpha)
zpower = norm.isf(beta)
r_avg = (rA+rB)/2
pooled_var = 2*r_avg*(1-r_avg)
sum_var = rA*(1-rA) + rB*(1-rB) # 2*r*(1-r)
if sides ==2:
zscore = norm.isf(alpha/2)
n12 = (zscore*np.sqrt(pooled_var)+zpower*np.sqrt(sum_var))**2/ dr**2
return n12
size = samplesize_ABtest2(0.7, 0.25, 0.05, 0.2, 1)
print("Sample size of group A in the one-tail test: {} \n The total sample size is: {}".format(np.floor(size), np.floor(2*size)))
Let’s assume that we would like to compute the minimal sample size to detect a 20% increase in conversion rates where the control conversion rate is 50%.
Let’s plug in the numbers into the formula.
The control conversion rate $r_1$ is equal 50%
The variation conversion $r_2$ is equal to 1.2*50%=60%.
We assume an equal ratio of visitors to both control and variation.
We assume statistic power $1-\beta=0.8$, and $\beta=0.2$.
We assume a significance level of $\alpha=0.05$.
size = samplesize_ABtest(0.5, 0.1, 0.05, 0.2, 2)
print("Sample size of group A in the two-tail test: {}. The total sample size is: {}".format(int(size),
int(size*2)))
size = samplesize_ABtest2(0.5, 0.1, 0.05, 0.2, 2)
print("Sample size of group A in the two-tail test: {}. The total sample size is: {}".format(int(size),
int(size*2)))
from statsmodels.stats.power import TTestIndPower
# parameters for power analysis
rA = 0.7
dr = 0.25
rB = rA+dr
r = (rA+rB)/2
pooled_var = 2*r*(1-r) # rA*(1-rA) + rB*(1-rB)
effect_size = dr/np.sqrt(pooled_var)
alpha = 0.05
power = 0.8 #
# perform power analysis
analysis = TTestIndPower()
result = analysis.solve_power(effect_size, power=power, nobs1=None, ratio=1.0, alpha=alpha)
print('Effect size: {:.2f}, and sample Size: {:.2f}'.format(effect_size,result))
# calculate power curves for varying sample and effect size
from numpy import array
from matplotlib import pyplot
from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
# parameters for power analysis
effect_sizes = array([0.2, 0.5, 0.8])
sample_sizes = array(range(5, 100))
# calculate power curves from multiple power analyses
analysis = TTestIndPower()
analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes)
plt.show()