13.5. Summary#

13.5.1. Terminology Review#

Use the flashcards below to help you review the terminology introduced in this chapter. \(~~~~~~~~~~~~~~~~~~~~~~~~\)

13.5.2. Key Take-Aways#

Jointly Distributed Random Variables

  • Jointly distributed random variables are functions on a common sample space, \(S\). Each maps to the real line, so a collection of \(n\) jointly distributed random variables map from \(S\) to a point in \(\mathbb{R}^n\).

  • For jointly distributed random variables, we can define joint PMFs, CDFs, and pdfs.

  • The marginal pdf of a random variable is the individual pdf of that random variable. It can be found by integrating out the other variable(s) in the joint pdf. For example, if \(X\) and \(Y\) are random variables with joint pdf \(f_{XY}(x,y)\), then the marginal pdf of \(X\) is \begin{equation*} f_X(x) = \int_{-\infty}^{\infty} f_{XY}(x,y)~dy. \end{equation*}o

  • The mean vector for a vector of random variables \(\mathbf{X} = \left[ X_0, X_1, \ldots, X_{n-1}\right]^T\) is the vector of means, \( \boldsymbol{\mu} = E\left[ \mathbf{X} \right] = \left[ \mu_0, \mu_1, \ldots, \mu_{n-1}\right]\), where \(\mu_i = E[X_i]\).

  • The covariance matrix is \(\mathbf{K} = E \left[ \left(\mathbf{X} -\boldsymbol{\mu} \right)\left(\mathbf{X} -\boldsymbol{\mu} \right)^T \right]\).

  • Jointly Normal random variables \(\mathbf{X} = \left[ X_0, X_1, \ldots, X_{n-1}\right]^T\) have joint density \begin{align*} f_\mathbf{X}(\mathbf{x})=\frac{1}{\sqrt{(2 \pi)^k \operatorname{det} \mathbf{K}}} \exp \left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \mathbf{K}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right), \end{align*} where \(\boldsymbol{\mu}\) is the mean vector and \(\mathbf{K}\) is the covariance matrix

  • For a pair of jointly distributed random variables, the joint pdf can be visualized using a surface plot or contours of equal probability density, which illustrate the region of \(\mathbb{R}^2\) where \(f_{XY}(x,y) =c \) for different values of \(c\).

  • For pairs of jointly Normal random variables, the contours of equal probability density are ellipses centered on the random vector’s mean.

Standardization and Linear Transforms

  • Standardization changes numerical features to have mean 0 and variance 1.

  • Without standardization, differences in how data is expressed, such as in terms of different units, can result in different features getting different levels of significance in later processing steps.

  • Projecting pairwise data onto a basis that is rotated relative to the standard axes can reduce or eliminate correlation among the data.

  • Consider a general linear transform \(\mathbf{A} \mathbf{X} + \mathbf{b}\), where \(\mathbf{A}\) is a constant matrix, \(\mathbf{X}\) is a random vector, and \(\mathbf{b}\) is a constant vector. Then the output mean vector and covariance matrix are given by \begin{align*} \boldsymbol{\mu}_Y &= \mathbf{A} \boldsymbol{\mu}_X + \mathbf{c}, \mbox{ and}\ \mathbf{K}_Y &= \mathbf{A} \mathbf{K}_X \mathbf{A}^T. \end{align*}

Eigenvalues and Eigenvectors

  • A vector \(\mathbf{v}\) is an eigenvector of matrix \(\mathbf{M}\) if \(\mathbf{M}\mathbf{v} = \lambda \mathbf{v}\) for some constant scalar \(\lambda\). I.e., the effect of the linear transformation \(\mathbf{M}\) on \(\mathbf{v}\) is only one of scaling. The value \(\lambda\) is the eigenvalue corresponding to \(\mathbf{v}\).

  • The eigenvalues of a matrix can be found by solving the characteristic equation, \(\det \left( \lambda \mathbf{I} - \mathbf{M} \right) = 0\).

  • The modal matrix of a matrix \(\mathbf{M}\) has the normalized eigenvectors of \(\mathbf{M}\) as its columns.

  • Eigendecomposition (or diagonalization) expresses a full-rank matrix \(\mathbf{M}\) as \(\mathbf{M} = \mathbf{U} \boldsymbol{\Lambda} \mathbf{U}^{-1}\), where \(\mathbf{U}\) is the modal matrix of \(\mathbf{M}\), and \(\boldsymbol{\Lambda}\) is a diagonal matrix of the corresponding eigenvalues of \(\mathbf{M}\).

  • The determinant of a matrix is equal to the matrix’s eigenvalues.

Decorrelating Random Vectors and Multi-Dimensional Data

  • Decorrelation is often applied to a data set to reduce the dependence across variables or features.

  • Decorrelation is often a first step before applying dimensionality reduction, in which the data is mapped to a smaller dimension.

  • Dimensionality reduction can be used to enable visualization of high-dimensional data, reduce the required computational complexity, or compress the data for more efficient storage and communication.

  • The discrete Karhunen-Loève Transform (KLT) decorrelates a vector of random variables. Given \(\mathbf{X}\) with non-singular covariance matrix \(\mathbf{K}_X\), the discrete KLT is \(\mathbf{Y} = \mathbf{U}^T \mathbf{X}\), where \(\mathbf{U}\) is the modal matrix of \(\mathbf{K}_X\). Then \(\mathbf{K}_Y = \boldsymbol{\Lambda}\).

  • For multi-dimensional numeric data, Principal Components Analysis (PCA) decorrelates the data using \(\mathbf{y} = \hat{\mathbf{U}}^T \mathbf{x}\), where \(\hat{\mathbf{U}}\) is the modal matrix of the covariance matrix of the data \(\mathbf{x}\).

Principal Components Analysis

  • PCA is usually used for dimensionality reduction.

  • The basis vectors create features that extract the most significant (i.e., the highest variance) information from the data.

  • A scree plot is a line plot that shows the eigenvectors of the covariance matrix of a data set. The eigenvectors are plotted in decreasing order as a function of that order.

  • Scree plots are non-increasing. They typically decrease quickly at first and then transition to decreasing slowly after an “elbow” in the curve.

  • The elbow can be used to determine how many features should be preserved at the output of PCA.

  • Explained variance quantifies the proportion of the total variance in the data that is preserved for a given number of retained PCA features.

  • Train-test split is used to evaluate whether a model has been overfitted to the training data by evaluating it on a separate set of test data points. The training and testing sets are created by randomly partitioning the original data set.

  • Scikit-learn has a function train_test_split in the model_selection submodule for easily performing train-test split.

  • Scikit-learn has a PCA class in the decomposition submodule that takes as input a number of features to preserve and returns an object. Given an object called pca that was created by sklearn.decomposition.PCA(), we can get the output features from PCA using pca.fit_transform().