This repository includes (future) informal posts displaying on my personal website. If there are errors in these posts, feel free to contact me 🤗
In this post, we dive into three essential convergence theorems in integration theory and explore how the Lebesgue integral handles limits so effectively.
High-dimensional statistics explores the complexities that arise when dealing with data sets where the number of features $d$ is large, often exceeding the number of samples $N$. Traditional statistical methods can struggle in these scenarios due to the curse of dimensionality, making it essential to develop specialized tools and techniques.
Expectation Maximization (EM) is a ubiquitous algorithm for performing maximum likelihood estimation. In this series, I present EM algorithm for GMMs' parameters.
The Kalman filter, a cornerstone in estimation theory, is a powerful algorithm that excels at inferring the hidden state of a system based on noisy measurements.
Inverse transform sampling is a method for generating random numbers from any probability distribution by using its inverse cumulative distribution $F^{-1}(x)$.
Expectation maximization is extremely useful when we have to deal with latent variable models (for example, number of mixture components in the mixture model).
Boole's Inequality, an upperbound on the probability of occurrence of at least one of a countable number of events under the context of individual chances of each event.
Bessel's Correction is a formula of an unbiased estimator for the population variance.
Lebesgue integrals can be visualized in a similar way to Riemannian sums.
Projecting our intuition from two- and three-dimensional spaces onto high-dimensional spaces can go wildly wrong.
Classical central limit theorems characterize the error in computing the mean of a set of independent random variables. The effective sample size helps generalize this to dependent/correlated sequences of random variables.
Complicated likelihoods with intractable normalizing constants are commonplace in many modern machine learning methods. Score matching is an approach to fit these models which circumvents the need to approximate these intractable constants.
Empirical Bayesian methods take a counterintuitive approach to the problem of choosing priors: selecting priors that are informed by the data itself.
The field of optimal transport is concerned with finding routes for the movement of mass that minimize cost. Here, we review two of the most popular framings of the OT problem and demonstrate some solutions with simple numerical examples.
Dirichlet process mixture models provide an attractive alternative to finite mixture models because they don't require the modeler to specify the number of components a priori.
Conjugate gradient descent is an approach to optimization that accounts for second-order structure of the objective function.
Constructing and evaluating positive semidefinite kernel functions is a major challenge in statistics and machine learning. By leveraging Bochner's theorem, we can approximate a kernel function by transforming samples from its spectral density.
Normalizing flows are a family of methods for flexibly approximating complex distributions. By combining ideas from probability theory, statistics, and deep learning, the learned distributions can be much more complex than traditional approaches to density estimation.
Martingales are a special type of stochastic process that are, in a sense, unpredictable.
Inducing point approximations for Gaussian processes can be formalized into a Bayesian model using a variational inference approach.
Most research in machine learning and computational statistics focuses on advancing methodology. However, a less-hyped topic — but an extremely important one — is the actual implementation of these methods using programming languages and compilers.
Stochastic variational inference (SVI) is a family of methods that exploits stochastic optimization techniques to speed up variational approaches and scale them to large datasets.
The natural gradient generalizes the classical gradient to account for non-Euclidean geometries.
Linear dimensionality reduction is a cornerstone of machine learning and statistics. Here we review a 2015 paper by Cunningham and Ghahramani that unifies this zoo by casting each of them as a special case of a very general optimization problem.
Inducing points provide a strategy for lowering the computational cost of Gaussian process prediction by closely modeling only a subset of the input space.
Minimizing the $\chi^2$ divergence between a true posterior and an approximate posterior is equivalent to minimizing an upper bound on the log marginal likelihood.
Here, we discuss and visualize the mode-seeking behavior of the reverse KL divergence.
Mixed models are effectively a special case of hierarchical models. In this post, I try to draw some connections between these jargon-filled modeling approaches.
Bayesian posterior inference requires the analyst to specify a full probabilistic model of the data generating process. Gibbs posteriors are a broader family of distributions that are intended to relax this requirement and to allow arbitrary loss functions.
Estimators based on sampling schemes can be 'Rao-Blackwellized' to reduce their variance.
Bayesian models provide a principled way to make inferences about underlying parameters. But under what conditions do those inferences converge to the truth?
'Recommending that scientists use Bayes' theorem is like giving the neighborhood kids the key to your F-16' and other critiques.
The power iteration algorithm is a numerical approach to computing the top eigenvector and eigenvalue of a matrix.
Matrix musings.
A brief review of shrinkage in ridge regression and a comparison to OLS.
The binomial model is a simple method for determining the prices of options.
BFGS is a second-order optimization method -- a close relative of Newton's method -- that approximates the Hessian of the objective function.
Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.
Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.
A brief review of Gaussian processes with simple visualizations.
A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.
Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.
Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.
The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.
Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q( heta)$ is taken to be a point mass.
Copulas are flexible statistical tools for modeling correlation structure between variables.
Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.
Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we'll give a brief overview and a simple example implementation.
Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we'll review some of the basic motivation behind MCMC and a couple of the most well-known methods.
There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.
Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley's paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.
Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often 'improprer' -- we review this issue here.
Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.
The Dirichlet process (DP) is one of the most common -- and one of the most simple -- prior distributions used in Bayesian nonparametric models. In this post, we'll review a couple different interpretations of DPs.
In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.
In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of 'boosting': creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we'll give a quick overview of boosting, and we'll review one of the most influential boosting algorithms, AdaBoost.
Statistical 'whitening' is a family of procedures for standardizing and decorrelating a set of variables. Here, we'll review this concept in a general sense, and see two specific examples.
Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they're useful.
As their name suggests, 'quasi-likelihoods' are quantities that aren't formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.
The representer theorem is a powerful result that implies a certain type of duality between solutions to function estimation problems.
Generalized linear models are flexible tools for modeling various response disributions. This post covers one common way of fitting them.
In this post, we cover a condition that is necessary and sufficient for the LASSO estimator to work correctly.
When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case --- our assumed models are usually approximations of the truth, but they're useful nonetheless.
The Gumbel max trick is a method for sampling from discrete distributions using only a deterministic function of the distributions' parameters.
Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.
Ridge regression --- a regularized variant of ordinary least squares --- is useful for dealing with collinearity and non-identifiability. Here, we'll explore some of the linear algebra behind it.
In this post we'll cover a simple algorithm for managing a portfolio of assets called the Universal Portfolio, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.
Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.
This post briefly covers a broad class of statistical estimators: M-estimators. We'll review the basic definition, some well-known special cases, and some of its asymptotic properties.
Maximum entropy distributions are those that are the 'least informative' (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we'll focus on the simple definition of maximum entropy distributions.
VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we'll firm up this definition and walk through a couple simple examples.
I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we'll look under the hood at one example in this post.
When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook *Statistical Inference* by Casella and Berger, but I'll step through the example in more detail.