Power Analysis

  • Significance ($1-\alpha$): Also called a confidence level, it refers to the probability that an observed difference in a study reflects a true difference in the population under the assumptions of the null hypothesis, i.e. that the observed difference is not due to chance given the null hypothesis is true.

  • Type I error ($\alpha$): The probability of rejecting a true null hypothesis, or incorrectly concluding that a difference exists (the null hypothesis is not appropriate) when in fact a difference really does not exist in the population. A false positive decision.

  • (Statistical) Power ($1-\beta$): The probability of correctly rejecting the null hypothesis when it is in fact false, and thus, the probability of observing a difference in the sample when an equal or greater difference is present in the population.

  • Type II error ($\beta$): The probability of accepting a false null hypothesis, or incorrectly concluding that a difference does not exist (the null hypothesis is appropriate) when in fact a difference really does exist in the population. A false negative decision.
$H_0$ True $H_0$ False ($H_a$ True)
Reject $H_0$ Type I error ($\alpha$) Power ($1-\beta$)
Accept $H_0$ Significance ($1-\alpha$) Type II error ($\beta$)

$$ \alpha = {\rm Pr(Reject ~~ H_0 ~~| ~~H_0 ~~is ~~True)} $$

$$ 1 - \alpha = {\rm Pr(Accept ~~ H_0 ~~| ~~H_0 ~~is ~~True)} $$

$$ \beta = {\rm Pr(Accept ~~ H_0 ~~| ~~H_0 ~~is ~~False)} $$

$$ 1 - \beta = {\rm Pr(Reject ~~ H_0 ~~| ~~H_0 ~~is ~~False)} $$

In [37]:
from IPython.display import Image
Image("images/Power.jpg")
Out[37]:

In the above figure, each curve represents the difference between two groups on some continuous measurement $\delta=p-p_0$. The null hypothesis, that there is no difference between the two groups, is represented by the curve on the left ($\mu=p_0$). The alternative hypothesis, that there is some difference between the two groups, is represented by the curve on the right ($\mu=p=p_0+\delta$). Each distribution is also characterized by its variance ($\sigma^2$) which is usually assumed to be the same for both distributions.

  • $\delta$ is called the margin (estimate) of error, sometimes denoted as $E$ for the maximum difference between the observed sample mean and the true value of the population mean.

image.png

In [38]:
Image("images/sample_size.png")
Out[38]:
  • Larger sample size increases power ($1-\beta$) because the standard error of the mean decreases by the square root of N.
  • For a given alpha level and separation between the $H_0$ and $H_1$ distributions, there is a smaller probability (smaller area under the $H_1$ curve) of rejecting $H_0$ if it is false when there are fewer subjects (A), than when there are more subjects (B).

Sample size determination

  • Considering the two normal distributions in Figure 1: $$ N_1(\mu, \dfrac{\sigma}{\sqrt{n}}),~~N_2(\mu+\delta, \dfrac{\sigma}{\sqrt{n}}) $$

  • The critical line for the left normal distribution in Figure 1 locates at (for two-tail case) $$ \mu + Z_{\alpha/2} \dfrac{\sigma}{\sqrt{n}} $$

  • The critical line for the right normal distribution in Figure 1 locates at $$ \mu + \delta - Z_{\beta} \dfrac{\sigma}{\sqrt{n}} $$

The critical line in the right and left normal distribution locates at the same place. Let the above two expressions equal to each other, one finds

$$ \delta = ( Z_{\alpha/2} + Z_{\beta}) \dfrac{\sigma}{\sqrt{n}} $$ from which one determine the sample size as $$ n = \dfrac{( Z_{\alpha/2} + Z_{\beta})^2}{(\dfrac{\delta}{\sigma})^2} $$

Effect size: the magnitude of the difference between groups divided by the standard deviation $$ {\rm Effect ~~ size} =\dfrac{\delta}{\sigma} $$

The effect size is the main finding of a quantitative study. While a P value can inform the reader whether an effect exists, the P value will not reveal the size of the effect.

Determine sample size in the AB tests

  • Control group A and test group B with equal size $n_A=n_B=n$
  • Conversion rate (Baseline) of control group A is $r_A$
  • Conversion rate of test group B is $r_B$
  • difference in the conversion rate $\delta=r_A - r_B$

  • The test statistic: $$ \dfrac{(r_A - r_B)}{{\rm SE}} $$ follows the normal distribution $N(0,1)$, where $$ {\rm SE} = \sqrt{\sigma^2_A/n_A + \sigma^2_B/n_B} = \sqrt{\sigma^2_A + \sigma^2_B} \sqrt{1/n} =\sigma/\sqrt{n}, $$ from which one finds $$ \sigma^2 = \sigma^2_A + \sigma^2_B = r_A(1-r_A) + r_B (1-r_B). $$

  • determine the sample size for the AB test with the previous formula:

For one-tail test case, the sample size is determined by

$$ n_1 = \dfrac{(Z_\alpha + Z_{\beta})^2\left[r_A(1-r_A) + r_B (1-r_B)\right]}{(r_A - r_B)^2} $$

For two-tail test case, the sample size is determined by

$$ n_2 = \dfrac{(Z_{\alpha/2} + Z_{\beta})^2\left[r_A(1-r_A) + r_B (1-r_B)\right]}{(r_A - r_B)^2} $$

where $Z_\alpha$ is the z-score.

In some cases, the following formula is used $$ n_2 = \dfrac{\left(Z_{\alpha/2}\sqrt{2\bar r(1-\bar r)} + Z_{\beta}\sqrt{r_A(1-r_A) + r_B (1-r_B)}\right)^2}{(r_A - r_B)^2}, $$ where $\bar r = (r_A+r_B)/2$.

The above expression indicates that: The smaller the difference between the conversion rate values of the variations A and B which is to be identified, the greater sample size is required for the test.

An example

  • The conversion rate of variation A: 20% ($r_A = 0.2$);
  • The conversion rate of variation B: 26% ($r_B = 0.26$).

  • The minimum difference between the conversion values of variations A and B is 6%; The minimum absolute difference of 6% corresponds to the relative difference of 30% (20% * 0.3 = 6%).

  • Variation B performed better than variation A.

  • Let’s choose the confidence level of 95% and the significance level of 5%.

  • The value of statistical power in the course of A/B tests is 80% as a rule, namely $\beta=0.2$.

In [12]:
from scipy.stats import norm
def samplesize_ABtest(rA, dr, alpha, beta, sides):
    """
    return sample size for the AB test, 
    given 
    - baseline conversion rate rA 
    - required minimum difference dr = rB-rA,
    - significance level alpha
    - statistical power 1-beta
    """
    
    assert dr>1.e-20
    # conversion rate of B group
    rB = rA + dr
    
    zscore = norm.isf(alpha)
    zpower = norm.isf(beta)
    
    pooled_var = rA*(1-rA) + rB*(1-rB) # 2*r*(1-r)
    
    
    if sides ==2:  # two tailed
        zscore = norm.isf(alpha/2)
        
    n12 = (zscore+zpower)**2 * pooled_var/ dr**2  
    return n12
In [14]:
norm.isf(0.05/2),norm.isf(1-0.05/2)
Out[14]:
(1.9599639845400545, -1.959963984540054)
In [15]:
norm.isf(0.2),norm.isf(0.8)
Out[15]:
(0.8416212335729142, -0.8416212335729143)
In [17]:
size = samplesize_ABtest(0.2, 0.06, 0.05, 0.2, 2)
print("Sample size of group A in the two-tail test: {} \n The total sample size is: {}".format(np.floor(size), 
                                                                                               np.floor(2*size)))
Sample size of group A in the two-tail test: 768.0 
 The total sample size is: 1536.0

Therefore, if conversion rate of group A is 20% and the estimated conversion rate value of group B is at least 26%, we’ll have to run our experiment until each variation gets 768 different visitors to check the statistical significance at the significance level of 5% and with 80% statistical power. Thus, the total number of the experiment visitors should be 1536.

If there is the third variation C in the experiment. This variation C should be compared with variation A (the control one). Therefore, we’ll have to fill variation C with n visitors as well and make the total number of experiment visitors equal to $3*n=3*768$.

Exampe2

A pharmaceutical company has developed a brand new antibiotic against pathogen X. No other antibiotics are available, so no comparison can be made with existing antibiotic treatments. However, it is known from field data that 70% recover from the disease (although effects on production are tremendous). It is expected (hoped) that after use of the antibiotic 95% of the animals will improve and that the duration of the disease will be shorter as well. Thus,

  • $p_1 = 0.70$ and $p_2 = 0.95$.
  • Choosing a two-sided confidence of $95\%$ and a power of $80\%$.
  • It shows that the values for $Z_{\alpha/2}=1.96$ and $Z_{\beta}=0.84$, respectively.
  • Using the formula, calculate n for each group.
In [11]:
alpha=0.05
norm.isf(alpha/2)
Out[11]:
1.9599639845400545
In [18]:
beta=1-0.8
norm.isf(beta)
Out[18]:
0.8416212335729143
In [8]:
from scipy.stats import norm
def samplesize_ABtest2(rA, dr, alpha, beta, sides):
    """
    return sample size for the AB test, 
    given 
    - baseline conversion rate rA 
    - required minimum difference dr = rB-rA,
    - significance level alpha
    - statistical power 1-beta
    """
    
    assert dr>1.e-20
    # conversion rate of B group
    rB = rA + dr
    
    zscore = norm.isf(alpha)
    zpower = norm.isf(beta)
    
    r_avg = (rA+rB)/2
    pooled_var = 2*r_avg*(1-r_avg)
    sum_var = rA*(1-rA) + rB*(1-rB) # 2*r*(1-r)
    
    if sides ==2:
        zscore = norm.isf(alpha/2)
        
    n12 = (zscore*np.sqrt(pooled_var)+zpower*np.sqrt(sum_var))**2/ dr**2  
    return n12
In [10]:
size = samplesize_ABtest2(0.7, 0.25, 0.05, 0.2, 1)
print("Sample size of group A in the one-tail test: {} \n The total sample size is: {}".format(np.floor(size), np.floor(2*size)))
Sample size of group A in the one-tail test: 35.0 
 The total sample size is: 70.0

Example 3

Let’s assume that we would like to compute the minimal sample size to detect a 20% increase in conversion rates where the control conversion rate is 50%.

Let’s plug in the numbers into the formula.

The control conversion rate $r_1$ is equal 50%

The variation conversion $r_2$ is equal to 1.2*50%=60%.

We assume an equal ratio of visitors to both control and variation.

We assume statistic power $1-\beta=0.8$, and $\beta=0.2$.

We assume a significance level of $\alpha=0.05$.

In [21]:
size = samplesize_ABtest(0.5, 0.1, 0.05, 0.2, 2)
print("Sample size of group A in the two-tail test: {}. The total sample size is: {}".format(int(size), 
                                                                                             int(size*2)))
Sample size of group A in the two-tail test: 384. The total sample size is: 769
In [22]:
size = samplesize_ABtest2(0.5, 0.1, 0.05, 0.2, 2)
print("Sample size of group A in the two-tail test: {}. The total sample size is: {}".format(int(size), 
                                                                                             int(size*2)))
Sample size of group A in the two-tail test: 304. The total sample size is: 609

Use package

In [36]:
from statsmodels.stats.power import TTestIndPower
# parameters for power analysis 
rA = 0.7
dr = 0.25
rB = rA+dr

r = (rA+rB)/2

pooled_var = 2*r*(1-r) # rA*(1-rA) + rB*(1-rB)
effect_size = dr/np.sqrt(pooled_var)

alpha = 0.05
power = 0.8  #
# perform power analysis
analysis = TTestIndPower()
result = analysis.solve_power(effect_size, power=power, nobs1=None, ratio=1.0, alpha=alpha)
print('Effect size: {:.2f}, and sample Size: {:.2f}'.format(effect_size,result))
Effect size: 0.47, and sample Size: 73.50
In [41]:
# calculate power curves for varying sample and effect size
from numpy import array
from matplotlib import pyplot
from statsmodels.stats.power import TTestIndPower

import matplotlib.pyplot as plt

# parameters for power analysis
effect_sizes = array([0.2, 0.5, 0.8])
sample_sizes = array(range(5, 100))
# calculate power curves from multiple power analyses
analysis = TTestIndPower()
analysis.plot_power(dep_var='nobs', nobs=sample_sizes, effect_size=effect_sizes)
plt.show()