Binh Ho

This repository includes (future) informal posts displaying on my personal website. If there are errors in these posts, feel free to contact me 🤗

2025

August 18

Moment generating function based bounds January 28
In the realm of probability and statistics, controlling uncertainty is paramount. How likely is a random variable to deviate significantly from its expected value? What tools do we have to quantify this risk? In this post, we explore moment-based and moment generating function (MGF)-based concentration inequalities — powerful techniques to bound tail probabilities.

2024

EM applications: Stochastic EM November 3
Expectation Maximization (EM) is a ubiquitous algorithm for performing maximum likelihood estimation. In this series, I present EM algorithm for GMMs' parameters.

Three convergence theorems in integration theory September 29
In this post, we dive into three essential convergence theorems in integration theory and explore how the Lebesgue integral handles limits so effectively.

A friendly introduction to High-Dimensional Statistics (with the Johnson-Lindenstrauss embedding) September 16
High-dimensional statistics explores the complexities that arise when dealing with data sets where the number of features $d$ is large, often exceeding the number of samples $N$. Traditional statistical methods can struggle in these scenarios due to the curse of dimensionality, making it essential to develop specialized tools and techniques.

EM applications: Gaussian Mixture Models September 1
Expectation Maximization (EM) is a ubiquitous algorithm for performing maximum likelihood estimation. In this series, I present EM algorithm for GMMs' parameters.

Notes on Time Series July 27

A quick look at Kalman filter July 20
The Kalman filter, a cornerstone in estimation theory, is a powerful algorithm that excels at inferring the hidden state of a system based on noisy measurements.

2023

Inference Gaussian Models: A Bayesian Approach July 9

Maximum A Posterior June 14

Multivariate Gaussian Sampling April 22

Inverse Transform Sampling April 21
Inverse transform sampling is a method for generating random numbers from any probability distribution by using its inverse cumulative distribution $F^{-1}(x)$.

Expectation-Maximization April 1
Expectation maximization is extremely useful when we have to deal with latent variable models (for example, number of mixture components in the mixture model).

Boole's Inequality March 8
Boole's Inequality, an upperbound on the probability of occurrence of at least one of a countable number of events under the context of individual chances of each event.

Bessel's Correction - Why Sample Variance Should Be Divided By N-1 February 23
Bessel's Correction is a formula of an unbiased estimator for the population variance.

Variational Inference - An Introduction January 28

2022

Boostrapping with Extension to Markov Chain October 20

Visualizing Lebesgue integration March 27
Lebesgue integrals can be visualized in a similar way to Riemannian sums.

The unintuitive nature of high-dimensional spaces February 6
Projecting our intuition from two- and three-dimensional spaces onto high-dimensional spaces can go wildly wrong.

2021

Effective sample size November 21
Classical central limit theorems characterize the error in computing the mean of a set of independent random variables. The effective sample size helps generalize this to dependent/correlated sequences of random variables.

Score matching November 20
Complicated likelihoods with intractable normalizing constants are commonplace in many modern machine learning methods. Score matching is an approach to fit these models which circumvents the need to approximate these intractable constants.

Empirical Bayes November 13
Empirical Bayesian methods take a counterintuitive approach to the problem of choosing priors: selecting priors that are informed by the data itself.

Monge and Kontorovich formulations of the Optimal Transport problem September 25
The field of optimal transport is concerned with finding routes for the movement of mass that minimize cost. Here, we review two of the most popular framings of the OT problem and demonstrate some solutions with simple numerical examples.

Dirichlet process mixture models August 29
Dirichlet process mixture models provide an attractive alternative to finite mixture models because they don't require the modeler to specify the number of components a priori.

Conjugate gradients July 24
Conjugate gradient descent is an approach to optimization that accounts for second-order structure of the objective function.

Approximating kernels with random projections July 10
Constructing and evaluating positive semidefinite kernel functions is a major challenge in statistics and machine learning. By leveraging Bochner's theorem, we can approximate a kernel function by transforming samples from its spectral density.

Normalizing flows July 8
Normalizing flows are a family of methods for flexibly approximating complex distributions. By combining ideas from probability theory, statistics, and deep learning, the learned distributions can be much more complex than traditional approaches to density estimation.

Martingales June 26
Martingales are a special type of stochastic process that are, in a sense, unpredictable.

Variational inference for Gaussian processes June 16
Inducing point approximations for Gaussian processes can be formalized into a Bayesian model using a variational inference approach.

Just-in-time compilation and JAX May 15
Most research in machine learning and computational statistics focuses on advancing methodology. However, a less-hyped topic — but an extremely important one — is the actual implementation of these methods using programming languages and compilers.

Stochastic variational inference April 24
Stochastic variational inference (SVI) is a family of methods that exploits stochastic optimization techniques to speed up variational approaches and scale them to large datasets.

Natural gradients April 13
The natural gradient generalizes the classical gradient to account for non-Euclidean geometries.

Unifying linear dimensionality reduction methods April 10
Linear dimensionality reduction is a cornerstone of machine learning and statistics. Here we review a 2015 paper by Cunningham and Ghahramani that unifies this zoo by casting each of them as a special case of a very general optimization problem.

Inducing points for Gaussian Processes March 27
Inducing points provide a strategy for lowering the computational cost of Gaussian process prediction by closely modeling only a subset of the input space.

$\chi$ divergence upper bound (CUBO) March 20
Minimizing the $\chi^2$ divergence between a true posterior and an approximate posterior is equivalent to minimizing an upper bound on the log marginal likelihood.

$KL(q \| p)$ is mode-seeking March 15
Here, we discuss and visualize the mode-seeking behavior of the reverse KL divergence.

Equivalence of mixed models and hierarchical models March 14
Mixed models are effectively a special case of hierarchical models. In this post, I try to draw some connections between these jargon-filled modeling approaches.

Gibbs posteriors February 28
Bayesian posterior inference requires the analyst to specify a full probabilistic model of the data generating process. Gibbs posteriors are a broader family of distributions that are intended to relax this requirement and to allow arbitrary loss functions.

Rao-Blackwellization February 27
Estimators based on sampling schemes can be 'Rao-Blackwellized' to reduce their variance.

Posterior consistency February 13
Bayesian models provide a principled way to make inferences about underlying parameters. But under what conditions do those inferences converge to the truth?

$\chi$ triangles February 5
Describing $\chi$ random variables as the lengths of vectors.

Critiques of Bayesian statistics January 31
'Recommending that scientists use Bayes' theorem is like giving the neighborhood kids the key to your F-16' and other critiques.

Power iteration method January 24
The power iteration algorithm is a numerical approach to computing the top eigenvector and eigenvalue of a matrix.

2020

Relationship between the multivariate normal, SVD, and Cholesky decomposition December 19
Matrix musings.

Shrinkage in ridge regression December 18
A brief review of shrinkage in ridge regression and a comparison to OLS.

Binomial model for options pricing December 6
The binomial model is a simple method for determining the prices of options.

BFGS November 27
BFGS is a second-order optimization method -- a close relative of Newton's method -- that approximates the Hessian of the objective function.

Tweedie distributions November 21
Tweedie distributions are a very general family of distributions that includes the Gaussian, Poisson, and Gamma (among many others) as special cases.

Scale mixtures of normals November 15
Here, we discuss two distributions which arise as scale mixtures of normals: the Laplace and the Student-$t$.

Gaussian process regression November 1
A brief review of Gaussian processes with simple visualizations.

Ito's lemma October 24
A sketch of the derivation for Ito's Lemma and a simple example.

Wiener and Ito processes October 12
A brief review three types of stochastic processes: Wiener processes, generalized Wiener processes, and Ito processes.

Slice sampling October 10
Slice sampling is a method for obtaining random samples from an arbitrary distribution. Here, we walk through the basic steps of slice sampling and present two visual examples.

Bayesian model averaging September 27
Bayesian model averaging provides a way to combine information across statistical models and account for the uncertainty embedded in each.

James-Stein estimator September 5
The James-Stein estimator dominates the MLE by sharing information across seemingly unrelated variables.

EM as a special case of variational inference August 29
Expectation maximization can be seen as a special case of variational inference when the approximating distribution for the parameters $q( heta)$ is taken to be a point mass.

Copulas and Sklar's Theorem August 22
Copulas are flexible statistical tools for modeling correlation structure between variables.

Schur complements August 19
Schur complements are quantities that arise often in linear algebra in the context of block matrix inversion. Here, we review the basics and show an application in statistics.

Hamiltonian Monte Carlo August 16
Hamiltonian Monte Carlo (HMC) is an MCMC method that borrows ideas from physics. Here, we'll give a brief overview and a simple example implementation.

Whirlwind tour of MCMC for posterior inference August 2
Markov Chain Monte Carlo (MCMC) methods encompass a broad class of tools for fitting Bayesian models. Here, we'll review some of the basic motivation behind MCMC and a couple of the most well-known methods.

Duality between maximum likelihood and maximum entropy July 25
There exists a duality between maximum likelihood estimation and finding the maximum entropy distribution subject to a set of linear constraints.

Lindley's paradox July 24
Bayesian and frequenist methods can lead people to very different conclusions. One instance of this is exemplified in Lindley's paradox, in which a hypothesis test arrives at opposite conclusions depending on whether a Bayesian or a frequentist test is used.

Improper priors July 18
Choosing a prior distribution is a philosophically and practically challenging part of Bayesian data analysis. Noninformative priors try to skirt this issue by placing equal weight on all possible parameter values; however, these priors are often 'improprer' -- we review this issue here.

Probabilistic PCA derivations July 11
Probabilistic PCA generalizes traditional PCA into a probabilistic model whose maximum likelihood estimate corresponds to the traditional version. Here, we give step-by-step derivations for some of the quantities of interest.

Dirichlet Processes: the basics June 28
The Dirichlet process (DP) is one of the most common -- and one of the most simple -- prior distributions used in Bayesian nonparametric models. In this post, we'll review a couple different interpretations of DPs.

Estimation and Inference in probabilistic models: A whirlwind tour June 21

Are random seeds hyperparameters? May 30
In many types of programming, random seeds are used to make computational results reproducible by generating a known set of random numbers. However, the choice of a random seed can affect results in non-trivial ways.

AdaBoost May 13
In prediction problems, we often fit one model, evaluate its performance, and test it on unseen data. But what if we could combine multiple models at once and leverage their combined performance? This is the spirit of 'boosting': creating an ensemble of learning algorithms, which perform better together than each does independently. Here, we'll give a quick overview of boosting, and we'll review one of the most influential boosting algorithms, AdaBoost.

Statistical whitening transformations May 3
Statistical 'whitening' is a family of procedures for standardizing and decorrelating a set of variables. Here, we'll review this concept in a general sense, and see two specific examples.

Common matrix decompositions April 17
Matrix decomposition methods factor a matrix $A$ into a product of two other matrices, $A = BC$. In this post, we review some of the most common matrix decompositions, and why they're useful.

Quasi-likelihoods April 2
As their name suggests, 'quasi-likelihoods' are quantities that aren't formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.

The representer theorem and kernel ridge regression March 7
The representer theorem is a powerful result that implies a certain type of duality between solutions to function estimation problems.

Newton's method and Fisher scoring for fitting GLMs March 4
Generalized linear models are flexible tools for modeling various response disributions. This post covers one common way of fitting them.

LASSO and the irrepresentable condition February 28
In this post, we cover a condition that is necessary and sufficient for the LASSO estimator to work correctly.

MLE under a misspecified model February 24
When we construct and analyze statistical estimators, we often assume that the model is correctly specified. However, in practice, this is rarely the case --- our assumed models are usually approximations of the truth, but they're useful nonetheless.

Gumbel max trick February 21
The Gumbel max trick is a method for sampling from discrete distributions using only a deterministic function of the distributions' parameters.

Control variates February 14
Control variates are a class of methods for reducing the variance of a generic Monte Carlo estimator.

The linear algebra of ridge regression February 9
Ridge regression --- a regularized variant of ordinary least squares --- is useful for dealing with collinearity and non-identifiability. Here, we'll explore some of the linear algebra behind it.

Universal Portfolios: A simple online learning algorithm January 25
In this post we'll cover a simple algorithm for managing a portfolio of assets called the Universal Portfolio, developed by Thomas Cover in the 90s. Although the method was developed in the context of finance, it applies more generally to the setting of online learning.

Consistency of MLE January 10
Maximum likelihood estimation (MLE) is one of the most popular and well-studied methods for creating statistical estimators. This post will review conditions under which the MLE is consistent.

2019

$M$-estimation December 31
This post briefly covers a broad class of statistical estimators: M-estimators. We'll review the basic definition, some well-known special cases, and some of its asymptotic properties.

Maximum entropy distributions December 13
Maximum entropy distributions are those that are the 'least informative' (i.e., have the greatest entropy) among a class of distributions with certain constraints. The principle of maximum entropy has roots across information theory, statistical mechanics, Bayesian probability, and philosophy. For this post, we'll focus on the simple definition of maximum entropy distributions.

Introduction to VC dimension November 17
VC dimension is a measure of the complexity of a statistical model. In essence, a model with a higher VC dimension is able to learn more complex mappings between data and labels. In this post, we'll firm up this definition and walk through a couple simple examples.

Generating random samples from probability distributions November 12
I take for granted that I can easily generate random samples from a variety of probability distributions in NumPy, R, and other statistical software. However, the process for generating these quantities is somewhat nontrivial, and we'll look under the hood at one example in this post.

Convergence in probability vs. almost sure convergence November 11
When thinking about the convergence of random quantities, two types of convergence that are often confused with one another are convergence in probability and almost sure convergence. Here, I give the definition of each and a simple example that illustrates the difference. The example comes from the textbook *Statistical Inference* by Casella and Berger, but I'll step through the example in more detail.