引导问题

14.1. 引导问题#

根据分布的可加性,我们知道卡方分布是具备可加性的。具体来说,若 \(X_i\) 是独立同分布的卡方分布随机变量,即 \(X_i \sim \chi^2(1)\) 。于是,

\[ \begin{eqnarray*} S_{k} = \sum_{i=1}^{k}X_i \sim \chi^2(k) \end{eqnarray*} \]

我们考虑 1 个、2 个、10 个以及 30 个随机变量之和的密度函数见图 Fig. 14.1

_images/Lect13_sum_of_chisquares.png

Fig. 14.1 多个卡方分布之和的密度函数#

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats


def prompt_distribution() -> str:
    options = {
        "1": "chi2",
        "2": "uniform",
        "3": "exponential",
        "4": "poisson",
    }
    prompt = (
        "请选择基准分布(输入数字):\n"
        "1. Chi-square distribution (df=2)\n"
        "2. Uniform distribution on [0, 1]\n"
        "3. Exponential distribution (rate=1)\n"
        "4. Poisson distribution (λ=2)\n"
        "你的选择: "
    )
    while True:
        choice = input(prompt).strip()
        if choice in options:
            return options[choice]
        print("请输入 1-4 的数字。")


def read_parameters():
    print("CLT Demo: Comparing sums with different sample sizes")
    dist_key = prompt_distribution()
    return dist_key


def build_pdf(dist_key: str, n: int, x: np.ndarray):
    if dist_key == "chi2":
        return stats.chi2.pdf(x, df=n * 2)
    elif dist_key == "uniform":
        mean = 0.5 * n
        std = np.sqrt(n / 12)
        return stats.norm.pdf(x, loc=mean, scale=max(std, 1e-8))
    elif dist_key == "exponential":
        return stats.gamma.pdf(x, a=n, scale=1)
    else:  # poisson
        mean = 2 * n
        std = np.sqrt(2 * n)
        return stats.norm.pdf(x, loc=mean, scale=max(std, 1e-8))


def determine_range(dist_key: str, n_values):
    if dist_key == "chi2":
        x_min = 0
        x_max = max(12, 4 * max(n_values))
    elif dist_key == "uniform":
        mean = 0.5 * max(n_values)
        std = np.sqrt(max(n_values) / 12)
        x_min = mean - 6 * std
        x_max = mean + 6 * std
    elif dist_key == "exponential":
        x_min = 0
        x_max = max(12, 5 * max(n_values))
    else:  # poisson
        mean = 2 * max(n_values)
        std = np.sqrt(2 * max(n_values))
        x_min = mean - 6 * std
        x_max = mean + 6 * std
    return x_min, x_max


def plot_comparison(dist_key: str):
    n_values = [1, 5, 10, 30]
    x_min, x_max = determine_range(dist_key, n_values)
    x = np.linspace(x_min, x_max, 1200)

    fig, axes = plt.subplots(2, 2, figsize=(10, 7))
    axes = axes.flatten()

    for ax, n in zip(axes, n_values):
        pdf = build_pdf(dist_key, n, x)
        ax.plot(x, pdf, "b-", linewidth=2, label=f"n = {n}")
        ax.set_xlabel("x")
        ax.set_ylabel("Density")
        ax.set_title(f"Sum of {n} variables")
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(bottom=0)
        ax.grid(True, alpha=0.3)
        ax.legend()

    dist_label = {
        "chi2": "chi-square(2)",
        "uniform": "uniform[0,1]",
        "exponential": "exponential(1)",
        "poisson": "poisson(λ=2)",
    }[dist_key]

    fig.suptitle(f"CLT Illustration: summing {dist_label} variables", fontsize=14)
    fig.tight_layout(rect=[0, 0, 1, 0.96])
    plt.show()


def main():
    dist_key = read_parameters()
    plot_comparison(dist_key)


if __name__ == "__main__":
    main()

Remark

随着 \(k\) 的增加, \(S_k\) 的密度函数图像越来越接近正态分布曲线。

对于卡方分布 \(\chi^2(k)\) 而言,其期望和方差分别为

\[ E(S_k)=k, \quad \text{Var}(S_k) = 2k.\]

\(k\) 增加时, \(S_k\) 的密度函数的位置右移,且 \(p_{k}(s)\) 的方差也增大。这意味着这个分布的中心趋向 \(\infty\) ,其方差也趋向 \(\infty\) ,分布极不稳定。因此,直接讨论 \(S_k\) 的分布是有困难的。于是,在中心极限定理的研究中均对 \(S_k\) 进行标准化,即

\[ S_k^\ast = \frac{S_k - E(S_k)}{\sqrt{\text{Var}(S_k)}} \]

可以证明

\[\begin{split} \begin{eqnarray*} E(S_k^\ast) = 0\\ \text{Var}(S_k^\ast) = 1 \end{eqnarray*} \end{split}\]

一个很自然的问题是 \(S_k^{\ast}\) 的极限分布是否为标准正态分布 \(N(0,1)\) ?

Question

中心极限定理本身研究的是在什么条件下,随机变量之和的极限分布是正态分布。