Central Limit Theorem

Applied Statistics
Statistics
Data Analysis
Python
Author
Published

September 15, 2023

Note

This is a combined work of Khush Shah and Kajal Panchal.

Abstract

According to the central limit theorem, the sample average has approximately Normal distribution for large sample sizes. The following exercise is to experimentally verify this theorem on a simulated as well as a real dataset. 1000 random samples are drawn from 3 different distributions (Uniform, Normal and Laplace) with samples of sizes of 10, 20, 40, 50, and 100 each using Python NumPy library. To verify it on a real data set, Global YouTube Statistics was taken from Kaggle. 500 samples are drawn of sizes 30, 50, and 100 with and without replacement each. Sample means are calculated and plotted to check the underlying distribution of sample means for each data set and respective sample sizes. In the case of simulated data, sample means are normally distributed for large n, irrespective of the original population distribution. In the case of the YouTube statistics dataset, Sample mean distributions drawn with replacement are right skewed for all sample sizes, and those without replacement are right skewed for smaller sample sizes but approximately normal for sample size 100.

Introduction

Central Limit Theorem

Let \(X_1, X_2, ..., X_n\) be IID random variables with mean \(\mu\) and variance \(\sigma^2\). Let \(\bar{X_n} = \frac{1}{n} \sum_{i=1}^{n} X_i\). Then

\[Z_n = \frac{\bar{X_n} - \mu}{\sqrt{{\sigma^2}/{n}}} = \frac{\sqrt{n}(\bar{X_n} - \mu)}{\sigma} \sim N(0,1)\]

In other words,

\[\lim_{n \to \infty} P(Z_n \leq z) = \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} dx\]

where \(\Phi(z)\) is the CDF of the standard normal distribution.

Using the above theorem, probability statements about the sample mean can be approximated using a normal distribution. It’s the probability statements that are being approximated, not the random variable itself.

Samples and Distributions

To verify the central limit theorem, in this exercise, we have used two types of data sets. First is simulated ones, where 1000 random samples are generated for 3 different distributions each (Uniform, Normal and Laplace) with samples of sizes of 10, 20, 40, 50, and 100. That is 100035 total sample sets. These samples are generated using NumPy python library functions1. The second is a real data set: Global YouTube Statistics from Kaggle. 500 samples are taken of sizes 30, 50, and 100 with and without replacement, each using numpy.random.choice() function. Then, sample means are calculated and plotted to check the underlying distribution of Sample means.

Distributions

Distributions considered here are of continuous random variable. The probability density functions of these distributions are given below:

1. Uniform Distribution

If the probability distribution of a random variable X follows the bellow function, then it is called Uniform distribution.

\[f(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & otherwise \end{cases}\]

\(f(x)\) is the probability density function of the uniform distribution.

\(a\) and \(b\) are the lower and upper limits of the distribution.

PDF of Uniform distribution is defined over a closed interval \([a, b]\). the PDF is constant within this interval and zero outside of it.

2. Normal Distribution

If the probability distribution of a random variable follows the bellow probability density function, then the distribution is called Normal or Gaussian distribution.

\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

\(f(x)\) is the probability density function of the normal distribution.

\(\mu\) is the mean of the distribution, it represents the location of the peak of the bell curve.

\(\sigma\) is the standard deviation of the distribution, it represents the width/spread of the bell curve.

This formula describes how the probability density varies for different values of x in a normal distribution. The normal distribution is symmetric and bell-shaped, and it is often used to model various natural phenomena and statistical processes due to its mathematical properties and prevalence in real-world data.

3. Laplace Distribution

If the probability distribution of a random variable follows the bellow probability density function, then the distribution is called Laplace distribution.

\[f(x) = \frac{1}{2b} e^{-\frac{|x-\mu|}{b}}\]

\(f(x)\) is the probability density function of the Laplace distribution.

\(\mu\) is the mean of the distribution, it represents the location of the peak of the bell curve.

\(b\) is the scale parameter of the distribution, it represents the width/spread of the bell curve.

The absolute value of x−μ in the exponent ensures that the Laplace distribution is symmetric about its mean μ. It is characterized by sharp peaks at μ and heavy tails. The Laplace distribution is often used in statistics and data analysis, particularly in situations where data is sparse or exhibits sudden changes or outliers.

Samples

1. Parameter used to generate samples for Uniform distribution

numPy.random.uniform(low = -5, high = 15, size = (10,20,40,50,100))

Low: Lower boundary of the output interval. All values generated will be greater than or equal to low.

High: Upper boundary of the output interval. All values generated will be less than high.

2. Parameter used to generate samples for Normal distribution

numPy.random.normal(loc = 5, scale = 2, size = (10,20,40,50,100))

Loc: Mean of the distribution.

Scale: Standard deviation of the distribution.

3. Parameter used to generate samples for Laplace distribution

numPy.random.laplace(loc = 5, scale = 2, size = (10,20,40,50,100))

Loc: The position of mu, of the distribution peak.

Scale: lambda, the exponential decay.

Results

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import matplotlib.pyplot as plt
sizes = (10, 20, 40, 50, 100)
samp_sizes = len(sizes)

1. Uniform Distribution

Test

means = [[] for _ in range(samp_sizes)]
min_mean = np.inf
max_mean = -np.inf

for s in range(1000):
    np.random.seed(s)
    for i in range(samp_sizes):
        uniform = np.random.uniform(-5, 15, sizes[i])
        mean = np.mean(uniform)
        means[i].append(mean)

        if mean < min_mean:
            min_mean = mean
        if mean > max_mean:
            max_mean = mean
df = pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df
Sample_Size10 Sample_Size20 Sample_Size40 Sample_Size50 Sample_Size100
0 7.315326 6.275947 3.516244 5.064370 4.943963
1 1.292586 4.917143 4.468110 6.249390 4.599627
2 2.176020 2.764809 5.425378 5.080627 4.479303
3 4.350047 3.690273 4.471169 5.215467 4.636943
4 6.571719 5.177381 5.392267 5.529318 5.465427
... ... ... ... ... ...
995 4.110400 5.407123 4.593660 4.931387 4.980018
996 3.776225 3.260758 5.139498 5.573408 4.083426
997 5.521992 5.848969 5.080605 3.698350 4.168001
998 8.681317 3.745099 4.122767 5.075479 4.939119
999 4.640361 4.032694 6.760931 6.696176 5.004901

1000 rows × 5 columns

fig = ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig.add_shape(type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.update_layout(title_text="Distributions of Sample Means (Uniform)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

Comments

  1. Sample means are highly dispersed for smaller sample sizes as possible values of random variables have equal probability.

  2. With the increase in sample size, variance decreases.

2. Normal Distribution

Test

means = [[] for _ in range(samp_sizes)]
min_mean = np.inf
max_mean = -np.inf

for s in range(1000):
    np.random.seed(s)
    for i in range(samp_sizes):
        normal = np.random.normal(5, 2, sizes[i])
        mean = np.mean(normal)
        means[i].append(mean)

        if mean < min_mean:
            min_mean = mean
        if mean > max_mean:
            max_mean = mean
df = pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df
Sample_Size10 Sample_Size20 Sample_Size40 Sample_Size50 Sample_Size100
0 6.476046 5.590546 4.274789 5.685278 5.012593
1 4.805718 4.917324 5.343190 4.969505 5.285519
2 3.737300 4.638546 4.996812 4.999921 5.160925
3 4.719044 4.765772 4.744089 5.015770 5.036081
4 4.476193 5.368613 5.182945 5.108665 4.995094
... ... ... ... ... ...
995 3.865953 4.784016 5.127216 5.254136 5.046751
996 4.784242 4.400601 5.338084 5.556812 4.916327
997 5.180634 4.311766 5.079467 5.345439 4.925435
998 5.565577 4.487421 4.834397 5.146891 4.695533
999 5.389380 4.846252 5.443953 5.209518 4.898074

1000 rows × 5 columns

fig = ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig.add_shape(type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.update_layout(title_text="Distributions of Sample Means (Normal)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

Comments

  1. The sample means are normally distributed for all sample sizes.
  2. The distribution becomes skinnier and symmetric about the mean with the increase in sample size.
  3. Confidence intervals shrink shorter.

3. Laplace Distribution

Test

means = [[] for _ in range(samp_sizes)]
min_mean = np.inf
max_mean = -np.inf

for s in range(1000):
    np.random.seed(s)
    for i in range(samp_sizes):
        laplace = np.random.laplace(5, 2, sizes[i])
        mean = np.mean(laplace)
        means[i].append(mean)

        if mean < min_mean:
            min_mean = mean
        if mean > max_mean:
            max_mean = mean
df = pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df
Sample_Size10 Sample_Size20 Sample_Size40 Sample_Size50 Sample_Size100
0 5.984471 5.383774 4.462571 4.961485 4.958143
1 2.418243 4.872504 4.714305 5.486043 4.687591
2 3.904200 4.097192 5.393066 4.933448 4.642493
3 4.715620 4.379835 4.964626 5.047558 4.802165
4 5.758612 5.035049 5.177542 5.195536 5.238218
... ... ... ... ... ...
995 4.885447 5.879259 4.660367 4.725307 4.796725
996 4.823875 4.373668 4.914142 5.072047 4.726472
997 5.449383 6.437400 4.855217 4.347397 4.698067
998 6.312200 4.659888 4.620671 5.114754 4.913229
999 4.705271 4.550642 5.892077 5.763841 5.006041

1000 rows × 5 columns

fig = ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig.add_shape(type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.update_layout(title_text="Distributions of Sample Means (Laplace)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

Comments

  1. The sample means are normally distributed for all sample sizes.
  2. The variance of the sample mean decreases with the increase in sample size.

Real Data Set

Global YouTube dataset: Source

df = pd.read_csv("./A2/YouTube_Statistics.csv")
df
rank Youtuber subscribers video views category Title uploads Country Abbreviation channel_type ... subscribers_for_last_30_days created_year created_month created_date Gross tertiary education enrollment (%) Population Unemployment rate Urban_population Latitude Longitude
0 1 T-Series 245000000 2.280000e+11 Music T-Series 20082 India IN Music ... 2000000.0 2006.0 Mar 13.0 28.1 1.366418e+09 5.36 471031528.0 20.593684 78.962880
1 2 YouTube Movies 170000000 0.000000e+00 Film & Animation youtubemovies 1 United States US Games ... NaN 2006.0 Mar 5.0 88.2 3.282395e+08 14.70 270663028.0 37.090240 -95.712891
2 3 MrBeast 166000000 2.836884e+10 Entertainment MrBeast 741 United States US Entertainment ... 8000000.0 2012.0 Feb 20.0 88.2 3.282395e+08 14.70 270663028.0 37.090240 -95.712891
3 4 Cocomelon - Nursery Rhymes 162000000 1.640000e+11 Education Cocomelon - Nursery Rhymes 966 United States US Education ... 1000000.0 2006.0 Sep 1.0 88.2 3.282395e+08 14.70 270663028.0 37.090240 -95.712891
4 5 SET India 159000000 1.480000e+11 Shows SET India 116536 India IN Entertainment ... 1000000.0 2006.0 Sep 20.0 28.1 1.366418e+09 5.36 471031528.0 20.593684 78.962880
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
990 991 Natan por Aï¿ 12300000 9.029610e+09 Sports Natan por Aï¿ 1200 Brazil BR Entertainment ... 700000.0 2017.0 Feb 12.0 51.3 2.125594e+08 12.08 183241641.0 -14.235004 -51.925280
991 992 Free Fire India Official 12300000 1.674410e+09 People & Blogs Free Fire India Official 1500 India IN Games ... 300000.0 2018.0 Sep 14.0 28.1 1.366418e+09 5.36 471031528.0 20.593684 78.962880
992 993 Panda 12300000 2.214684e+09 NaN HybridPanda 2452 United Kingdom GB Games ... 1000.0 2006.0 Sep 11.0 60.0 6.683440e+07 3.85 55908316.0 55.378051 -3.435973
993 994 RobTopGames 12300000 3.741235e+08 Gaming RobTopGames 39 Sweden SE Games ... 100000.0 2012.0 May 9.0 67.0 1.028545e+07 6.48 9021165.0 60.128161 18.643501
994 995 Make Joke Of 12300000 2.129774e+09 Comedy Make Joke Of 62 India IN Comedy ... 100000.0 2017.0 Aug 1.0 28.1 1.366418e+09 5.36 471031528.0 20.593684 78.962880

995 rows × 28 columns

Population Distribution

subscribers = df["subscribers"]

mean = np.mean(subscribers)

fig = ff.create_distplot([subscribers], ["Subscribers"], bin_size=1000000, colors=("red",))
fig.update_layout(title_text=f"Distribution of Subscribers (Mean: {int(mean)})", xaxis_title="Subscribers", yaxis_title="Density")
fig.show()
sample_size = (30,50,100)
# sample with replacement
mean_with_replacement = [[] for _ in range(len(sample_size))]
# sample withount replacement
mean_without_replacement = [[] for _ in range(len(sample_size))]

# calculating mean of sample with replacement, 500 samples
for s in range(500):
    for i in range(len(sample_size)):
        # random sampling with replacement
        sample = np.random.choice(subscribers, sample_size[i], replace=True)
        mean_with_replacement[i].append(np.mean(sample))
        # random sampling without replacement
        sample = np.random.choice(subscribers, sample_size[i], replace=False)
        mean_without_replacement[i].append(np.mean(sample))
# with replacement
df_means_wr = pd.DataFrame({f"Sample_Size{sample_size[i]}" : mean_with_replacement[i] for i in range(len(sample_size))})
df_means_wr
Sample_Size30 Sample_Size50 Sample_Size100
0 2.181000e+07 23064000.0 23113000.0
1 2.439667e+07 21542000.0 21662000.0
2 2.002667e+07 30156000.0 23871000.0
3 2.223333e+07 19546000.0 25976000.0
4 2.373333e+07 26302000.0 24369000.0
... ... ... ...
495 1.946667e+07 25854000.0 20054000.0
496 3.431000e+07 22392000.0 22566000.0
497 2.102000e+07 26390000.0 23828000.0
498 2.321333e+07 22754000.0 21636000.0
499 2.286333e+07 22706000.0 23879000.0

500 rows × 3 columns

fig = ff.create_distplot(mean_with_replacement, [f"Sample_Size{sample_size[i]}" for i in range(len(sample_size))], bin_size=100000, colors=("red", "blue", "green"))
fig.update_layout(title_text="Distributions of Sample Means (With Replacement)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()
# without replacement
df_means_wo = pd.DataFrame({f"Sample_Size{sample_size[i]}" : mean_without_replacement[i] for i in range(len(sample_size))})
df_means_wo
Sample_Size30 Sample_Size50 Sample_Size100
0 1.953667e+07 22990000.0 25088000.0
1 2.045000e+07 24594000.0 23897000.0
2 2.202000e+07 22940000.0 22299000.0
3 2.734333e+07 20158000.0 24075000.0
4 2.540667e+07 22112000.0 22228000.0
... ... ... ...
495 2.444000e+07 20702000.0 24035000.0
496 1.970000e+07 22114000.0 23646000.0
497 2.421000e+07 29664000.0 22453000.0
498 2.367667e+07 21016000.0 23258000.0
499 2.631000e+07 21296000.0 24449000.0

500 rows × 3 columns

fig = ff.create_distplot(mean_without_replacement, [f"Sample_Size{sample_size[i]}" for i in range(len(sample_size))], bin_size=100000, colors=("red", "blue", "green"))
fig.update_layout(title_text="Distributions of Sample Means (Without Replacement)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()
means = []

for i in range(len(sample_size)):
    means.append((np.mean(mean_with_replacement[i]), np.mean(mean_without_replacement[i])))

means = np.array(means)
fig = go.Figure()
fig.update_layout(xaxis = dict(tickmode = 'array',
        tickvals = [i for i in range(len(sample_size))],
        ticktext = [str(i) for i in sample_size]
    ),
    title='Means for different sample sizes',
    xaxis_title='Sample Size',
    yaxis_title='Mean of Sample Means')
fig.add_trace(go.Bar(x=[i+0.05 for i in range(len(sample_size))], y=means[:,1], name='Mean without Replacement', width=0.4, opacity=0.75))
# fig.add_trace(go.Scatter(x=[i+0.05 for i in range(len(sample_size))], y=means[:,1], name='Standard Deviation estimate', line=dict(color='blue', width=2)))
fig.add_trace(go.Bar(x=[i-0.1 for i in range(len(sample_size))], y=means[:,0], name='Mean with Replacement', width=0.4, opacity=0.75))
# fig.add_trace(go.Scatter(x=[i-0.1 for i in range(len(sample_size))], y=means[:,0], name='Mean estimate', line=dict(color='yellow', width=2)))

fig.add_trace(go.Scatter(x=[i-0.5 for i in range(len(sample_size)+1)], y=[mean]*(len(sample_size)+1), mode='lines', name='True Mean', line=dict(color='purple', width=1.5)))

fig.show()

Conclusions:

Simulated Data

  1. The distribution of sample means is Normal. By visual inspection of the kernel density estimation (KDE) curve fit on the histogram, we can see that the sample means are normally distributed for large sample size, irrespective of the original population distribution.

  2. For each distribution, the variance of sample means is inversely proportional to the sample size. The larger the sample size, the smaller the variance of sample means distribution. The confidence in estimating the population mean increases with the increase in sample size, which coincides with the central limit theorem.

  3. For each distribution, the mean of the sample means is equal to the mean of the original population.

  4. Comparing three different distributions, the sample means of uniform distribution is more spread out than the other two distributions. At the same time, the sample means of samples from a normal distribution is the most compact of the three. This is because the variance of uniform distribution is higher than the other two distributions.

Real Dataset

  1. The original population distribution is skewed to the right. However, the sample means are normally distributed for large sample sizes, irrespective of the actual population distribution, as expected by the central limit theorem.

  2. Difference between sampling with replacement and without replacement: When sampling is done with replacement, the sample means are more spread out than the sampling without replacement. This is because the same sample can be selected more than once when sampling is done with replacement. So, the mean can vary more than the sampling without replacement. When we sample without replacement, the same sample cannot be selected more than once. So, the mean is more stable than the sampling with replacement.