Central Limit Theorem

Applied Statistics

Statistics

Data Analysis

Python

Author

Khush Shah, Kajal Panchal

Published

September 15, 2023

Note

This is a combined work of Khush Shah and Kajal Panchal.

Abstract

According to the central limit theorem, the sample average has approximately Normal distribution for large sample sizes. The following exercise is to experimentally verify this theorem on a simulated as well as a real dataset. 1000 random samples are drawn from 3 different distributions (Uniform, Normal and Laplace) with samples of sizes of 10, 20, 40, 50, and 100 each using Python NumPy library. To verify it on a real data set, Global YouTube Statistics was taken from Kaggle. 500 samples are drawn of sizes 30, 50, and 100 with and without replacement each. Sample means are calculated and plotted to check the underlying distribution of sample means for each data set and respective sample sizes. In the case of simulated data, sample means are normally distributed for large n, irrespective of the original population distribution. In the case of the YouTube statistics dataset, Sample mean distributions drawn with replacement are right skewed for all sample sizes, and those without replacement are right skewed for smaller sample sizes but approximately normal for sample size 100.

Introduction

Central Limit Theorem

Let \(X_1, X_2, ..., X_n\) be IID random variables with mean \(\mu\) and variance \(\sigma^2\). Let \(\bar{X_n} = \frac{1}{n} \sum_{i=1}^{n} X_i\). Then

\[Z_n = \frac{\bar{X_n} - \mu}{\sqrt{{\sigma^2}/{n}}} = \frac{\sqrt{n}(\bar{X_n} - \mu)}{\sigma} \sim N(0,1)\]

In other words,

\[\lim_{n \to \infty} P(Z_n \leq z) = \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} dx\]

where \(\Phi(z)\) is the CDF of the standard normal distribution.

Using the above theorem, probability statements about the sample mean can be approximated using a normal distribution. It’s the probability statements that are being approximated, not the random variable itself.

Samples and Distributions

To verify the central limit theorem, in this exercise, we have used two types of data sets. First is simulated ones, where 1000 random samples are generated for 3 different distributions each (Uniform, Normal and Laplace) with samples of sizes of 10, 20, 40, 50, and 100. That is 100035 total sample sets. These samples are generated using NumPy python library functions1. The second is a real data set: Global YouTube Statistics from Kaggle. 500 samples are taken of sizes 30, 50, and 100 with and without replacement, each using numpy.random.choice() function. Then, sample means are calculated and plotted to check the underlying distribution of Sample means.

Distributions

Distributions considered here are of continuous random variable. The probability density functions of these distributions are given below:

1. Uniform Distribution

If the probability distribution of a random variable X follows the bellow function, then it is called Uniform distribution.

\[f(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & otherwise \end{cases}\]

\(f(x)\) is the probability density function of the uniform distribution.

\(a\) and \(b\) are the lower and upper limits of the distribution.

PDF of Uniform distribution is defined over a closed interval \([a, b]\). the PDF is constant within this interval and zero outside of it.

2. Normal Distribution

If the probability distribution of a random variable follows the bellow probability density function, then the distribution is called Normal or Gaussian distribution.

\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]

\(f(x)\) is the probability density function of the normal distribution.

\(\mu\) is the mean of the distribution, it represents the location of the peak of the bell curve.

\(\sigma\) is the standard deviation of the distribution, it represents the width/spread of the bell curve.

This formula describes how the probability density varies for different values of x in a normal distribution. The normal distribution is symmetric and bell-shaped, and it is often used to model various natural phenomena and statistical processes due to its mathematical properties and prevalence in real-world data.

3. Laplace Distribution

If the probability distribution of a random variable follows the bellow probability density function, then the distribution is called Laplace distribution.

\[f(x) = \frac{1}{2b} e^{-\frac{|x-\mu|}{b}}\]

\(f(x)\) is the probability density function of the Laplace distribution.

\(\mu\) is the mean of the distribution, it represents the location of the peak of the bell curve.

\(b\) is the scale parameter of the distribution, it represents the width/spread of the bell curve.

The absolute value of x−μ in the exponent ensures that the Laplace distribution is symmetric about its mean μ. It is characterized by sharp peaks at μ and heavy tails. The Laplace distribution is often used in statistics and data analysis, particularly in situations where data is sparse or exhibits sudden changes or outliers.

Samples

1. Parameter used to generate samples for Uniform distribution

numPy.random.uniform(low = -5, high = 15, size = (10,20,40,50,100))

Low: Lower boundary of the output interval. All values generated will be greater than or equal to low.

High: Upper boundary of the output interval. All values generated will be less than high.

2. Parameter used to generate samples for Normal distribution

numPy.random.normal(loc = 5, scale = 2, size = (10,20,40,50,100))

Loc: Mean of the distribution.

Scale: Standard deviation of the distribution.

3. Parameter used to generate samples for Laplace distribution

numPy.random.laplace(loc = 5, scale = 2, size = (10,20,40,50,100))

Loc: The position of mu, of the distribution peak.

Scale: lambda, the exponential decay.

Results

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import matplotlib.pyplot as plt

sizes = (10, 20, 40, 50, 100)
samp_sizes = len(sizes)

1. Uniform Distribution

Test

means = [[] for _ in range(samp_sizes)]
min_mean = np.inf
max_mean = -np.inf

for s in range(1000):
    np.random.seed(s)
    for i in range(samp_sizes):
        uniform = np.random.uniform(-5, 15, sizes[i])
        mean = np.mean(uniform)
        means[i].append(mean)

        if mean < min_mean:
            min_mean = mean
        if mean > max_mean:
            max_mean = mean

df = pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df

	Sample_Size10	Sample_Size20	Sample_Size40	Sample_Size50	Sample_Size100
0	7.315326	6.275947	3.516244	5.064370	4.943963
1	1.292586	4.917143	4.468110	6.249390	4.599627
2	2.176020	2.764809	5.425378	5.080627	4.479303
3	4.350047	3.690273	4.471169	5.215467	4.636943
4	6.571719	5.177381	5.392267	5.529318	5.465427
...	...	...	...	...	...
995	4.110400	5.407123	4.593660	4.931387	4.980018
996	3.776225	3.260758	5.139498	5.573408	4.083426
997	5.521992	5.848969	5.080605	3.698350	4.168001
998	8.681317	3.745099	4.122767	5.075479	4.939119
999	4.640361	4.032694	6.760931	6.696176	5.004901

1000 rows × 5 columns

fig = ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig.add_shape(type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.update_layout(title_text="Distributions of Sample Means (Uniform)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

Comments

Sample means are highly dispersed for smaller sample sizes as possible values of random variables have equal probability.
With the increase in sample size, variance decreases.

2. Normal Distribution

Test

means = [[] for _ in range(samp_sizes)]
min_mean = np.inf
max_mean = -np.inf

for s in range(1000):
    np.random.seed(s)
    for i in range(samp_sizes):
        normal = np.random.normal(5, 2, sizes[i])
        mean = np.mean(normal)
        means[i].append(mean)

        if mean < min_mean:
            min_mean = mean
        if mean > max_mean:
            max_mean = mean

df = pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df

	Sample_Size10	Sample_Size20	Sample_Size40	Sample_Size50	Sample_Size100
0	6.476046	5.590546	4.274789	5.685278	5.012593
1	4.805718	4.917324	5.343190	4.969505	5.285519
2	3.737300	4.638546	4.996812	4.999921	5.160925
3	4.719044	4.765772	4.744089	5.015770	5.036081
4	4.476193	5.368613	5.182945	5.108665	4.995094
...	...	...	...	...	...
995	3.865953	4.784016	5.127216	5.254136	5.046751
996	4.784242	4.400601	5.338084	5.556812	4.916327
997	5.180634	4.311766	5.079467	5.345439	4.925435
998	5.565577	4.487421	4.834397	5.146891	4.695533
999	5.389380	4.846252	5.443953	5.209518	4.898074

1000 rows × 5 columns

fig = ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig.add_shape(type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.update_layout(title_text="Distributions of Sample Means (Normal)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

Comments

The sample means are normally distributed for all sample sizes.
The distribution becomes skinnier and symmetric about the mean with the increase in sample size.
Confidence intervals shrink shorter.

3. Laplace Distribution

Test

means = [[] for _ in range(samp_sizes)]
min_mean = np.inf
max_mean = -np.inf

for s in range(1000):
    np.random.seed(s)
    for i in range(samp_sizes):
        laplace = np.random.laplace(5, 2, sizes[i])
        mean = np.mean(laplace)
        means[i].append(mean)

        if mean < min_mean:
            min_mean = mean
        if mean > max_mean:
            max_mean = mean

df = pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df

	Sample_Size10	Sample_Size20	Sample_Size40	Sample_Size50	Sample_Size100
0	5.984471	5.383774	4.462571	4.961485	4.958143
1	2.418243	4.872504	4.714305	5.486043	4.687591
2	3.904200	4.097192	5.393066	4.933448	4.642493
3	4.715620	4.379835	4.964626	5.047558	4.802165
4	5.758612	5.035049	5.177542	5.195536	5.238218
...	...	...	...	...	...
995	4.885447	5.879259	4.660367	4.725307	4.796725
996	4.823875	4.373668	4.914142	5.072047	4.726472
997	5.449383	6.437400	4.855217	4.347397	4.698067
998	6.312200	4.659888	4.620671	5.114754	4.913229
999	4.705271	4.550642	5.892077	5.763841	5.006041

1000 rows × 5 columns

fig = ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig.add_shape(type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.update_layout(title_text="Distributions of Sample Means (Laplace)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

Comments

The sample means are normally distributed for all sample sizes.
The variance of the sample mean decreases with the increase in sample size.

Real Data Set

Global YouTube dataset: Source

df = pd.read_csv("./A2/YouTube_Statistics.csv")
df

	rank	Youtuber	subscribers	video views	category	Title	uploads	Country	Abbreviation	channel_type	...	subscribers_for_last_30_days	created_year	created_month	created_date	Gross tertiary education enrollment (%)	Population	Unemployment rate	Urban_population	Latitude	Longitude
0	1	T-Series	245000000	2.280000e+11	Music	T-Series	20082	India	IN	Music	...	2000000.0	2006.0	Mar	13.0	28.1	1.366418e+09	5.36	471031528.0	20.593684	78.962880
1	2	YouTube Movies	170000000	0.000000e+00	Film & Animation	youtubemovies	1	United States	US	Games	...	NaN	2006.0	Mar	5.0	88.2	3.282395e+08	14.70	270663028.0	37.090240	-95.712891
2	3	MrBeast	166000000	2.836884e+10	Entertainment	MrBeast	741	United States	US	Entertainment	...	8000000.0	2012.0	Feb	20.0	88.2	3.282395e+08	14.70	270663028.0	37.090240	-95.712891
3	4	Cocomelon - Nursery Rhymes	162000000	1.640000e+11	Education	Cocomelon - Nursery Rhymes	966	United States	US	Education	...	1000000.0	2006.0	Sep	1.0	88.2	3.282395e+08	14.70	270663028.0	37.090240	-95.712891
4	5	SET India	159000000	1.480000e+11	Shows	SET India	116536	India	IN	Entertainment	...	1000000.0	2006.0	Sep	20.0	28.1	1.366418e+09	5.36	471031528.0	20.593684	78.962880
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
990	991	Natan por Aï¿	12300000	9.029610e+09	Sports	Natan por Aï¿	1200	Brazil	BR	Entertainment	...	700000.0	2017.0	Feb	12.0	51.3	2.125594e+08	12.08	183241641.0	-14.235004	-51.925280
991	992	Free Fire India Official	12300000	1.674410e+09	People & Blogs	Free Fire India Official	1500	India	IN	Games	...	300000.0	2018.0	Sep	14.0	28.1	1.366418e+09	5.36	471031528.0	20.593684	78.962880
992	993	Panda	12300000	2.214684e+09	NaN	HybridPanda	2452	United Kingdom	GB	Games	...	1000.0	2006.0	Sep	11.0	60.0	6.683440e+07	3.85	55908316.0	55.378051	-3.435973
993	994	RobTopGames	12300000	3.741235e+08	Gaming	RobTopGames	39	Sweden	SE	Games	...	100000.0	2012.0	May	9.0	67.0	1.028545e+07	6.48	9021165.0	60.128161	18.643501
994	995	Make Joke Of	12300000	2.129774e+09	Comedy	Make Joke Of	62	India	IN	Comedy	...	100000.0	2017.0	Aug	1.0	28.1	1.366418e+09	5.36	471031528.0	20.593684	78.962880

995 rows × 28 columns

Population Distribution

subscribers = df["subscribers"]

mean = np.mean(subscribers)

fig = ff.create_distplot([subscribers], ["Subscribers"], bin_size=1000000, colors=("red",))
fig.update_layout(title_text=f"Distribution of Subscribers (Mean: {int(mean)})", xaxis_title="Subscribers", yaxis_title="Density")
fig.show()

sample_size = (30,50,100)
# sample with replacement
mean_with_replacement = [[] for _ in range(len(sample_size))]
# sample withount replacement
mean_without_replacement = [[] for _ in range(len(sample_size))]

# calculating mean of sample with replacement, 500 samples
for s in range(500):
    for i in range(len(sample_size)):
        # random sampling with replacement
        sample = np.random.choice(subscribers, sample_size[i], replace=True)
        mean_with_replacement[i].append(np.mean(sample))
        # random sampling without replacement
        sample = np.random.choice(subscribers, sample_size[i], replace=False)
        mean_without_replacement[i].append(np.mean(sample))

# with replacement
df_means_wr = pd.DataFrame({f"Sample_Size{sample_size[i]}" : mean_with_replacement[i] for i in range(len(sample_size))})
df_means_wr

	Sample_Size30	Sample_Size50	Sample_Size100
0	2.181000e+07	23064000.0	23113000.0
1	2.439667e+07	21542000.0	21662000.0
2	2.002667e+07	30156000.0	23871000.0
3	2.223333e+07	19546000.0	25976000.0
4	2.373333e+07	26302000.0	24369000.0
...	...	...	...
495	1.946667e+07	25854000.0	20054000.0
496	3.431000e+07	22392000.0	22566000.0
497	2.102000e+07	26390000.0	23828000.0
498	2.321333e+07	22754000.0	21636000.0
499	2.286333e+07	22706000.0	23879000.0

500 rows × 3 columns

fig = ff.create_distplot(mean_with_replacement, [f"Sample_Size{sample_size[i]}" for i in range(len(sample_size))], bin_size=100000, colors=("red", "blue", "green"))
fig.update_layout(title_text="Distributions of Sample Means (With Replacement)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

# without replacement
df_means_wo = pd.DataFrame({f"Sample_Size{sample_size[i]}" : mean_without_replacement[i] for i in range(len(sample_size))})
df_means_wo

	Sample_Size30	Sample_Size50	Sample_Size100
0	1.953667e+07	22990000.0	25088000.0
1	2.045000e+07	24594000.0	23897000.0
2	2.202000e+07	22940000.0	22299000.0
3	2.734333e+07	20158000.0	24075000.0
4	2.540667e+07	22112000.0	22228000.0
...	...	...	...
495	2.444000e+07	20702000.0	24035000.0
496	1.970000e+07	22114000.0	23646000.0
497	2.421000e+07	29664000.0	22453000.0
498	2.367667e+07	21016000.0	23258000.0
499	2.631000e+07	21296000.0	24449000.0

500 rows × 3 columns

fig = ff.create_distplot(mean_without_replacement, [f"Sample_Size{sample_size[i]}" for i in range(len(sample_size))], bin_size=100000, colors=("red", "blue", "green"))
fig.update_layout(title_text="Distributions of Sample Means (Without Replacement)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.show()

means = []

for i in range(len(sample_size)):
    means.append((np.mean(mean_with_replacement[i]), np.mean(mean_without_replacement[i])))

means = np.array(means)

fig = go.Figure()
fig.update_layout(xaxis = dict(tickmode = 'array',
        tickvals = [i for i in range(len(sample_size))],
        ticktext = [str(i) for i in sample_size]
    ),
    title='Means for different sample sizes',
    xaxis_title='Sample Size',
    yaxis_title='Mean of Sample Means')
fig.add_trace(go.Bar(x=[i+0.05 for i in range(len(sample_size))], y=means[:,1], name='Mean without Replacement', width=0.4, opacity=0.75))
# fig.add_trace(go.Scatter(x=[i+0.05 for i in range(len(sample_size))], y=means[:,1], name='Standard Deviation estimate', line=dict(color='blue', width=2)))
fig.add_trace(go.Bar(x=[i-0.1 for i in range(len(sample_size))], y=means[:,0], name='Mean with Replacement', width=0.4, opacity=0.75))
# fig.add_trace(go.Scatter(x=[i-0.1 for i in range(len(sample_size))], y=means[:,0], name='Mean estimate', line=dict(color='yellow', width=2)))

fig.add_trace(go.Scatter(x=[i-0.5 for i in range(len(sample_size)+1)], y=[mean]*(len(sample_size)+1), mode='lines', name='True Mean', line=dict(color='purple', width=1.5)))

fig.show()

Conclusions:

Simulated Data

The distribution of sample means is Normal. By visual inspection of the kernel density estimation (KDE) curve fit on the histogram, we can see that the sample means are normally distributed for large sample size, irrespective of the original population distribution.
For each distribution, the variance of sample means is inversely proportional to the sample size. The larger the sample size, the smaller the variance of sample means distribution. The confidence in estimating the population mean increases with the increase in sample size, which coincides with the central limit theorem.
For each distribution, the mean of the sample means is equal to the mean of the original population.
Comparing three different distributions, the sample means of uniform distribution is more spread out than the other two distributions. At the same time, the sample means of samples from a normal distribution is the most compact of the three. This is because the variance of uniform distribution is higher than the other two distributions.

Real Dataset

The original population distribution is skewed to the right. However, the sample means are normally distributed for large sample sizes, irrespective of the actual population distribution, as expected by the central limit theorem.
Difference between sampling with replacement and without replacement: When sampling is done with replacement, the sample means are more spread out than the sampling without replacement. This is because the same sample can be selected more than once when sampling is done with replacement. So, the mean can vary more than the sampling without replacement. When we sample without replacement, the same sample cannot be selected more than once. So, the mean is more stable than the sampling with replacement.