import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.io as pio
= "plotly_mimetype+notebook_connected"
pio.renderers.default import matplotlib.pyplot as plt
Central Limit Theorem
This is a combined work of Khush Shah and Kajal Panchal.
Abstract
According to the central limit theorem, the sample average has approximately Normal distribution for large sample sizes. The following exercise is to experimentally verify this theorem on a simulated as well as a real dataset. 1000 random samples are drawn from 3 different distributions (Uniform, Normal and Laplace) with samples of sizes of 10, 20, 40, 50, and 100 each using Python NumPy library. To verify it on a real data set, Global YouTube Statistics was taken from Kaggle. 500 samples are drawn of sizes 30, 50, and 100 with and without replacement each. Sample means are calculated and plotted to check the underlying distribution of sample means for each data set and respective sample sizes. In the case of simulated data, sample means are normally distributed for large n, irrespective of the original population distribution. In the case of the YouTube statistics dataset, Sample mean distributions drawn with replacement are right skewed for all sample sizes, and those without replacement are right skewed for smaller sample sizes but approximately normal for sample size 100.
Introduction
Central Limit Theorem
Let \(X_1, X_2, ..., X_n\) be IID random variables with mean \(\mu\) and variance \(\sigma^2\). Let \(\bar{X_n} = \frac{1}{n} \sum_{i=1}^{n} X_i\). Then
\[Z_n = \frac{\bar{X_n} - \mu}{\sqrt{{\sigma^2}/{n}}} = \frac{\sqrt{n}(\bar{X_n} - \mu)}{\sigma} \sim N(0,1)\]
In other words,
\[\lim_{n \to \infty} P(Z_n \leq z) = \Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} dx\]
where \(\Phi(z)\) is the CDF of the standard normal distribution.
Using the above theorem, probability statements about the sample mean can be approximated using a normal distribution. It’s the probability statements that are being approximated, not the random variable itself.
Samples and Distributions
To verify the central limit theorem, in this exercise, we have used two types of data sets. First is simulated ones, where 1000 random samples are generated for 3 different distributions each (Uniform, Normal and Laplace) with samples of sizes of 10, 20, 40, 50, and 100. That is 100035 total sample sets. These samples are generated using NumPy python library functions1. The second is a real data set: Global YouTube Statistics from Kaggle. 500 samples are taken of sizes 30, 50, and 100 with and without replacement, each using numpy.random.choice() function. Then, sample means are calculated and plotted to check the underlying distribution of Sample means.
Distributions
Distributions considered here are of continuous random variable. The probability density functions of these distributions are given below:
1. Uniform Distribution
If the probability distribution of a random variable X follows the bellow function, then it is called Uniform distribution.
\[f(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b \\ 0 & otherwise \end{cases}\]
\(f(x)\) is the probability density function of the uniform distribution.
\(a\) and \(b\) are the lower and upper limits of the distribution.
PDF of Uniform distribution is defined over a closed interval \([a, b]\). the PDF is constant within this interval and zero outside of it.
2. Normal Distribution
If the probability distribution of a random variable follows the bellow probability density function, then the distribution is called Normal or Gaussian distribution.
\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
\(f(x)\) is the probability density function of the normal distribution.
\(\mu\) is the mean of the distribution, it represents the location of the peak of the bell curve.
\(\sigma\) is the standard deviation of the distribution, it represents the width/spread of the bell curve.
This formula describes how the probability density varies for different values of x in a normal distribution. The normal distribution is symmetric and bell-shaped, and it is often used to model various natural phenomena and statistical processes due to its mathematical properties and prevalence in real-world data.
3. Laplace Distribution
If the probability distribution of a random variable follows the bellow probability density function, then the distribution is called Laplace distribution.
\[f(x) = \frac{1}{2b} e^{-\frac{|x-\mu|}{b}}\]
\(f(x)\) is the probability density function of the Laplace distribution.
\(\mu\) is the mean of the distribution, it represents the location of the peak of the bell curve.
\(b\) is the scale parameter of the distribution, it represents the width/spread of the bell curve.
The absolute value of x−μ in the exponent ensures that the Laplace distribution is symmetric about its mean μ. It is characterized by sharp peaks at μ and heavy tails. The Laplace distribution is often used in statistics and data analysis, particularly in situations where data is sparse or exhibits sudden changes or outliers.
Samples
1. Parameter used to generate samples for Uniform distribution
numPy.random.uniform(low = -5, high = 15, size = (10,20,40,50,100))
Low: Lower boundary of the output interval. All values generated will be greater than or equal to low.
High: Upper boundary of the output interval. All values generated will be less than high.
2. Parameter used to generate samples for Normal distribution
numPy.random.normal(loc = 5, scale = 2, size = (10,20,40,50,100))
Loc: Mean of the distribution.
Scale: Standard deviation of the distribution.
3. Parameter used to generate samples for Laplace distribution
numPy.random.laplace(loc = 5, scale = 2, size = (10,20,40,50,100))
Loc: The position of mu, of the distribution peak.
Scale: lambda, the exponential decay.
Results
= (10, 20, 40, 50, 100)
sizes = len(sizes) samp_sizes
1. Uniform Distribution
Test
= [[] for _ in range(samp_sizes)]
means = np.inf
min_mean = -np.inf
max_mean
for s in range(1000):
np.random.seed(s)for i in range(samp_sizes):
= np.random.uniform(-5, 15, sizes[i])
uniform = np.mean(uniform)
mean
means[i].append(mean)
if mean < min_mean:
= mean
min_mean if mean > max_mean:
= mean max_mean
= pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df df
Sample_Size10 | Sample_Size20 | Sample_Size40 | Sample_Size50 | Sample_Size100 | |
---|---|---|---|---|---|
0 | 7.315326 | 6.275947 | 3.516244 | 5.064370 | 4.943963 |
1 | 1.292586 | 4.917143 | 4.468110 | 6.249390 | 4.599627 |
2 | 2.176020 | 2.764809 | 5.425378 | 5.080627 | 4.479303 |
3 | 4.350047 | 3.690273 | 4.471169 | 5.215467 | 4.636943 |
4 | 6.571719 | 5.177381 | 5.392267 | 5.529318 | 5.465427 |
... | ... | ... | ... | ... | ... |
995 | 4.110400 | 5.407123 | 4.593660 | 4.931387 | 4.980018 |
996 | 3.776225 | 3.260758 | 5.139498 | 5.573408 | 4.083426 |
997 | 5.521992 | 5.848969 | 5.080605 | 3.698350 | 4.168001 |
998 | 8.681317 | 3.745099 | 4.122767 | 5.075479 | 4.939119 |
999 | 4.640361 | 4.032694 | 6.760931 | 6.696176 | 5.004901 |
1000 rows × 5 columns
= ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.add_shape(="Distributions of Sample Means (Uniform)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.update_layout(title_text fig.show()
2. Normal Distribution
Test
= [[] for _ in range(samp_sizes)]
means = np.inf
min_mean = -np.inf
max_mean
for s in range(1000):
np.random.seed(s)for i in range(samp_sizes):
= np.random.normal(5, 2, sizes[i])
normal = np.mean(normal)
mean
means[i].append(mean)
if mean < min_mean:
= mean
min_mean if mean > max_mean:
= mean max_mean
= pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df df
Sample_Size10 | Sample_Size20 | Sample_Size40 | Sample_Size50 | Sample_Size100 | |
---|---|---|---|---|---|
0 | 6.476046 | 5.590546 | 4.274789 | 5.685278 | 5.012593 |
1 | 4.805718 | 4.917324 | 5.343190 | 4.969505 | 5.285519 |
2 | 3.737300 | 4.638546 | 4.996812 | 4.999921 | 5.160925 |
3 | 4.719044 | 4.765772 | 4.744089 | 5.015770 | 5.036081 |
4 | 4.476193 | 5.368613 | 5.182945 | 5.108665 | 4.995094 |
... | ... | ... | ... | ... | ... |
995 | 3.865953 | 4.784016 | 5.127216 | 5.254136 | 5.046751 |
996 | 4.784242 | 4.400601 | 5.338084 | 5.556812 | 4.916327 |
997 | 5.180634 | 4.311766 | 5.079467 | 5.345439 | 4.925435 |
998 | 5.565577 | 4.487421 | 4.834397 | 5.146891 | 4.695533 |
999 | 5.389380 | 4.846252 | 5.443953 | 5.209518 | 4.898074 |
1000 rows × 5 columns
= ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.add_shape(="Distributions of Sample Means (Normal)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.update_layout(title_text fig.show()
Comments
- The sample means are normally distributed for all sample sizes.
- The distribution becomes skinnier and symmetric about the mean with the increase in sample size.
- Confidence intervals shrink shorter.
3. Laplace Distribution
Test
= [[] for _ in range(samp_sizes)]
means = np.inf
min_mean = -np.inf
max_mean
for s in range(1000):
np.random.seed(s)for i in range(samp_sizes):
= np.random.laplace(5, 2, sizes[i])
laplace = np.mean(laplace)
mean
means[i].append(mean)
if mean < min_mean:
= mean
min_mean if mean > max_mean:
= mean max_mean
= pd.DataFrame({f"Sample_Size{sizes[i]}" : means[i] for i in range(samp_sizes)})
df df
Sample_Size10 | Sample_Size20 | Sample_Size40 | Sample_Size50 | Sample_Size100 | |
---|---|---|---|---|---|
0 | 5.984471 | 5.383774 | 4.462571 | 4.961485 | 4.958143 |
1 | 2.418243 | 4.872504 | 4.714305 | 5.486043 | 4.687591 |
2 | 3.904200 | 4.097192 | 5.393066 | 4.933448 | 4.642493 |
3 | 4.715620 | 4.379835 | 4.964626 | 5.047558 | 4.802165 |
4 | 5.758612 | 5.035049 | 5.177542 | 5.195536 | 5.238218 |
... | ... | ... | ... | ... | ... |
995 | 4.885447 | 5.879259 | 4.660367 | 4.725307 | 4.796725 |
996 | 4.823875 | 4.373668 | 4.914142 | 5.072047 | 4.726472 |
997 | 5.449383 | 6.437400 | 4.855217 | 4.347397 | 4.698067 |
998 | 6.312200 | 4.659888 | 4.620671 | 5.114754 | 4.913229 |
999 | 4.705271 | 4.550642 | 5.892077 | 5.763841 | 5.006041 |
1000 rows × 5 columns
= ff.create_distplot(means, [f"Sample_Size{sizes[i]}" for i in range(samp_sizes)], bin_size=0.1, colors=("red", "blue", "green", "orange", "purple"))
fig type="line", x0=-5, y0=0, x1=15, y1=1, line=dict(color="red", width=0))
fig.add_shape(="Distributions of Sample Means (Laplace)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.update_layout(title_text fig.show()
Comments
- The sample means are normally distributed for all sample sizes.
- The variance of the sample mean decreases with the increase in sample size.
Real Data Set
Global YouTube dataset: Source
= pd.read_csv("./A2/YouTube_Statistics.csv")
df df
rank | Youtuber | subscribers | video views | category | Title | uploads | Country | Abbreviation | channel_type | ... | subscribers_for_last_30_days | created_year | created_month | created_date | Gross tertiary education enrollment (%) | Population | Unemployment rate | Urban_population | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | T-Series | 245000000 | 2.280000e+11 | Music | T-Series | 20082 | India | IN | Music | ... | 2000000.0 | 2006.0 | Mar | 13.0 | 28.1 | 1.366418e+09 | 5.36 | 471031528.0 | 20.593684 | 78.962880 |
1 | 2 | YouTube Movies | 170000000 | 0.000000e+00 | Film & Animation | youtubemovies | 1 | United States | US | Games | ... | NaN | 2006.0 | Mar | 5.0 | 88.2 | 3.282395e+08 | 14.70 | 270663028.0 | 37.090240 | -95.712891 |
2 | 3 | MrBeast | 166000000 | 2.836884e+10 | Entertainment | MrBeast | 741 | United States | US | Entertainment | ... | 8000000.0 | 2012.0 | Feb | 20.0 | 88.2 | 3.282395e+08 | 14.70 | 270663028.0 | 37.090240 | -95.712891 |
3 | 4 | Cocomelon - Nursery Rhymes | 162000000 | 1.640000e+11 | Education | Cocomelon - Nursery Rhymes | 966 | United States | US | Education | ... | 1000000.0 | 2006.0 | Sep | 1.0 | 88.2 | 3.282395e+08 | 14.70 | 270663028.0 | 37.090240 | -95.712891 |
4 | 5 | SET India | 159000000 | 1.480000e+11 | Shows | SET India | 116536 | India | IN | Entertainment | ... | 1000000.0 | 2006.0 | Sep | 20.0 | 28.1 | 1.366418e+09 | 5.36 | 471031528.0 | 20.593684 | 78.962880 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
990 | 991 | Natan por Aï¿ | 12300000 | 9.029610e+09 | Sports | Natan por Aï¿ | 1200 | Brazil | BR | Entertainment | ... | 700000.0 | 2017.0 | Feb | 12.0 | 51.3 | 2.125594e+08 | 12.08 | 183241641.0 | -14.235004 | -51.925280 |
991 | 992 | Free Fire India Official | 12300000 | 1.674410e+09 | People & Blogs | Free Fire India Official | 1500 | India | IN | Games | ... | 300000.0 | 2018.0 | Sep | 14.0 | 28.1 | 1.366418e+09 | 5.36 | 471031528.0 | 20.593684 | 78.962880 |
992 | 993 | Panda | 12300000 | 2.214684e+09 | NaN | HybridPanda | 2452 | United Kingdom | GB | Games | ... | 1000.0 | 2006.0 | Sep | 11.0 | 60.0 | 6.683440e+07 | 3.85 | 55908316.0 | 55.378051 | -3.435973 |
993 | 994 | RobTopGames | 12300000 | 3.741235e+08 | Gaming | RobTopGames | 39 | Sweden | SE | Games | ... | 100000.0 | 2012.0 | May | 9.0 | 67.0 | 1.028545e+07 | 6.48 | 9021165.0 | 60.128161 | 18.643501 |
994 | 995 | Make Joke Of | 12300000 | 2.129774e+09 | Comedy | Make Joke Of | 62 | India | IN | Comedy | ... | 100000.0 | 2017.0 | Aug | 1.0 | 28.1 | 1.366418e+09 | 5.36 | 471031528.0 | 20.593684 | 78.962880 |
995 rows × 28 columns
Population Distribution
= df["subscribers"]
subscribers
= np.mean(subscribers)
mean
= ff.create_distplot([subscribers], ["Subscribers"], bin_size=1000000, colors=("red",))
fig =f"Distribution of Subscribers (Mean: {int(mean)})", xaxis_title="Subscribers", yaxis_title="Density")
fig.update_layout(title_text fig.show()
= (30,50,100)
sample_size # sample with replacement
= [[] for _ in range(len(sample_size))]
mean_with_replacement # sample withount replacement
= [[] for _ in range(len(sample_size))]
mean_without_replacement
# calculating mean of sample with replacement, 500 samples
for s in range(500):
for i in range(len(sample_size)):
# random sampling with replacement
= np.random.choice(subscribers, sample_size[i], replace=True)
sample
mean_with_replacement[i].append(np.mean(sample))# random sampling without replacement
= np.random.choice(subscribers, sample_size[i], replace=False)
sample mean_without_replacement[i].append(np.mean(sample))
# with replacement
= pd.DataFrame({f"Sample_Size{sample_size[i]}" : mean_with_replacement[i] for i in range(len(sample_size))})
df_means_wr df_means_wr
Sample_Size30 | Sample_Size50 | Sample_Size100 | |
---|---|---|---|
0 | 2.181000e+07 | 23064000.0 | 23113000.0 |
1 | 2.439667e+07 | 21542000.0 | 21662000.0 |
2 | 2.002667e+07 | 30156000.0 | 23871000.0 |
3 | 2.223333e+07 | 19546000.0 | 25976000.0 |
4 | 2.373333e+07 | 26302000.0 | 24369000.0 |
... | ... | ... | ... |
495 | 1.946667e+07 | 25854000.0 | 20054000.0 |
496 | 3.431000e+07 | 22392000.0 | 22566000.0 |
497 | 2.102000e+07 | 26390000.0 | 23828000.0 |
498 | 2.321333e+07 | 22754000.0 | 21636000.0 |
499 | 2.286333e+07 | 22706000.0 | 23879000.0 |
500 rows × 3 columns
= ff.create_distplot(mean_with_replacement, [f"Sample_Size{sample_size[i]}" for i in range(len(sample_size))], bin_size=100000, colors=("red", "blue", "green"))
fig ="Distributions of Sample Means (With Replacement)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.update_layout(title_text fig.show()
# without replacement
= pd.DataFrame({f"Sample_Size{sample_size[i]}" : mean_without_replacement[i] for i in range(len(sample_size))})
df_means_wo df_means_wo
Sample_Size30 | Sample_Size50 | Sample_Size100 | |
---|---|---|---|
0 | 1.953667e+07 | 22990000.0 | 25088000.0 |
1 | 2.045000e+07 | 24594000.0 | 23897000.0 |
2 | 2.202000e+07 | 22940000.0 | 22299000.0 |
3 | 2.734333e+07 | 20158000.0 | 24075000.0 |
4 | 2.540667e+07 | 22112000.0 | 22228000.0 |
... | ... | ... | ... |
495 | 2.444000e+07 | 20702000.0 | 24035000.0 |
496 | 1.970000e+07 | 22114000.0 | 23646000.0 |
497 | 2.421000e+07 | 29664000.0 | 22453000.0 |
498 | 2.367667e+07 | 21016000.0 | 23258000.0 |
499 | 2.631000e+07 | 21296000.0 | 24449000.0 |
500 rows × 3 columns
= ff.create_distplot(mean_without_replacement, [f"Sample_Size{sample_size[i]}" for i in range(len(sample_size))], bin_size=100000, colors=("red", "blue", "green"))
fig ="Distributions of Sample Means (Without Replacement)", xaxis_title="Sample Mean", yaxis_title="Density")
fig.update_layout(title_text fig.show()
= []
means
for i in range(len(sample_size)):
means.append((np.mean(mean_with_replacement[i]), np.mean(mean_without_replacement[i])))
= np.array(means) means
= go.Figure()
fig = dict(tickmode = 'array',
fig.update_layout(xaxis = [i for i in range(len(sample_size))],
tickvals = [str(i) for i in sample_size]
ticktext
),='Means for different sample sizes',
title='Sample Size',
xaxis_title='Mean of Sample Means')
yaxis_title=[i+0.05 for i in range(len(sample_size))], y=means[:,1], name='Mean without Replacement', width=0.4, opacity=0.75))
fig.add_trace(go.Bar(x# fig.add_trace(go.Scatter(x=[i+0.05 for i in range(len(sample_size))], y=means[:,1], name='Standard Deviation estimate', line=dict(color='blue', width=2)))
=[i-0.1 for i in range(len(sample_size))], y=means[:,0], name='Mean with Replacement', width=0.4, opacity=0.75))
fig.add_trace(go.Bar(x# fig.add_trace(go.Scatter(x=[i-0.1 for i in range(len(sample_size))], y=means[:,0], name='Mean estimate', line=dict(color='yellow', width=2)))
=[i-0.5 for i in range(len(sample_size)+1)], y=[mean]*(len(sample_size)+1), mode='lines', name='True Mean', line=dict(color='purple', width=1.5)))
fig.add_trace(go.Scatter(x
fig.show()
Conclusions:
Simulated Data
The distribution of sample means is Normal. By visual inspection of the kernel density estimation (KDE) curve fit on the histogram, we can see that the sample means are normally distributed for large sample size, irrespective of the original population distribution.
For each distribution, the variance of sample means is inversely proportional to the sample size. The larger the sample size, the smaller the variance of sample means distribution. The confidence in estimating the population mean increases with the increase in sample size, which coincides with the central limit theorem.
For each distribution, the mean of the sample means is equal to the mean of the original population.
Comparing three different distributions, the sample means of uniform distribution is more spread out than the other two distributions. At the same time, the sample means of samples from a normal distribution is the most compact of the three. This is because the variance of uniform distribution is higher than the other two distributions.
Real Dataset
The original population distribution is skewed to the right. However, the sample means are normally distributed for large sample sizes, irrespective of the actual population distribution, as expected by the central limit theorem.
Difference between sampling with replacement and without replacement: When sampling is done with replacement, the sample means are more spread out than the sampling without replacement. This is because the same sample can be selected more than once when sampling is done with replacement. So, the mean can vary more than the sampling without replacement. When we sample without replacement, the same sample cannot be selected more than once. So, the mean is more stable than the sampling with replacement.
Comments
Sample means are highly dispersed for smaller sample sizes as possible values of random variables have equal probability.
With the increase in sample size, variance decreases.