As their name suggests, “quasi-likelihoods” are quantities that aren’t formally likelihood functions, but can be used as replacements for formal likelihoods in more general settings.
A major reason that quasi-likelihoods are useful is that they don’t require the statistician to specify a full probability distribution for the data – instead, only the first and second moments must be specified. This allows the modeler to have more flexibility without having to impose unnecessary (and incorrect) assumptions.
Most results for quasi-likelihoods are in the context of the exponential family, which we review here first.
Recall the form of the exponential family of distributions:
\[f(y; \theta) = \exp\left\{ a(y) b(\theta) + c(\theta) + d(y) \right\}.\]for some functions $a$, $b$, $c$, and $d$. In canonical form, we have $a(y) = y$ and $b(\theta) = \theta$. The form of the likelihood then simplifies to
\[f(y; \theta) = \exp\left\{ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right\}\]where we call $\phi$ a “dispersion parameter”.
Generalized linear models (GLMs) extend the classical linear model to allow for various types of response distributions. For a response vector $\mathbf{Y}$, a matrix of covariates $\mathbf{X}$, and a vector of coefficients $\boldsymbol{\beta}$, a GLM has the model structure of
\[\mathbb{E}[\mathbf{Y} | \mathbf{X}] = \mu(\mathbf{X}) = g^{-1}(\mathbf{X} \boldsymbol{\beta})\]where $g(\cdot)$ is a link function. Equivalently, we have $g(\mu(\mathbf{X})) = \mathbf{X}\boldsymbol{\beta}$, which we typically define to be the parameter $\theta$: $\theta = g(\mu(\mathbf{X})) = g(\mu)$. We also define $\eta$ to be the linear predictor: $\eta = \mathbf{X} \boldsymbol{\beta}$.
Additionally, $\mathbf{Y} | \mathbf{X}$ is typically assumed to come from an exponential family distribution. For a sample of size $n$, the likelihood is then
\[L(\theta) = \prod\limits_{i=1}^n \exp\left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right\}.\]The corresponding log-likelihood is
\[\ell(\theta) = \sum\limits_{i=1}^n \left[\frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi)\right].\]To perform maximum likelihood estimation, we take the derivative w.r.t. the coefficients $\beta_1, \dots, \beta_p$ and set these expressions to zero. This can be accomplished with the chain rule:
\[\frac{\partial \ell_i}{\partial \beta_j} = \frac{\partial \ell_i}{\partial \theta_i} \cdot \frac{\partial \theta_i}{\partial \mu_i} \cdot \frac{\partial \mu_i}{\partial \eta_i} \cdot \frac{\partial \eta_i}{\partial \beta_j}.\]Breaking these partial derivatives down, we have
\begin{align} \frac{\partial \ell_i}{\partial \theta_i} &= \frac{y_i - b’(\theta_i)}{a(\phi)} = \frac{y_i - \mu_i}{a(\phi)} \\ \frac{\partial \theta_i}{\partial \mu_i} &= g’(\mu_i) = (b’‘(y_i))^{-1} = \frac{a(\phi)}{\text{Var}(y_i)} = \frac{1}{v(y_i)} \\ \frac{\partial \eta_i}{\partial \beta_j} &= X_{ij} \\ \end{align}
The expression for $\frac{\partial \ell_i}{\partial \theta_i}$ follows because $b’(\theta)$ is equal to the mean $\mu$ in the exponential family. We won’t specify $\frac{\partial \mu_i}{\partial \eta_j}$ further, just to allow for an arbitrary relationship between $\mu$ and $\eta$.
Putting these together, we have
\begin{align} \frac{\partial \ell_i}{\partial \beta_j} &= \frac{y_i - \mu_i}{a(\phi)} \cdot \frac{1}{v(y_i)} \cdot \frac{\partial \mu_i}{\partial \eta_j} \cdot X_{ij} \\ &= \frac{(y_i - \mu_i) X_{ij}}{a(\phi) v(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \eta_j} \end{align}
Setting these equations to zero gives us the score equations needed to perform MLE:
\[\sum\limits_{i=1}^n \frac{(y_i - \mu_i) X_{ij}}{a(\phi) v(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \eta_j} = 0, \;\;\; j = 1, \dots, p\]Notice that these equations depend on the specified probability distribution through $\mu_i$ and the variance function $v(\mu)$. Specifically, once a distribution is specified, this variance function is automatically specified as well.
We now turn to quasi-likelihoods, which allow us to avoid specifying an entire distribution. Instead, we directly specify the mean-variance relationship.
A quasi-likelihood is defined as
\[\int_y^\mu \frac{y - t}{a(\phi) V(t)} dt\]where $V$ is a function that relates the variance to the mean, and $a(\phi)$ is a (unknown) dispersion parameter. Differentiating, we have
\[\frac{\partial Q}{\partial \mu} = \frac{y - \mu}{a(\phi) V(\mu)}\]We can see already that this expression is equivalent to the typical likelihood case, $\frac{\partial \ell}{\partial \mu}$.
To get the full “quasi-score equations”, we have to perform one more step of the chain rule, by multiplying by $\frac{\partial \mu}{\partial \beta}$:
\[\sum\limits_{i=1}^n \frac{(y_i - \mu_i)}{a(\phi) v(\mu_i)} \cdot \frac{\partial \mu_i}{\partial \beta_j} = 0, \;\;\; j = 1, \dots, p\]Notice that the coefficient estimates $\hat{\beta}$ are the same for the full GLM case and for quasi-likelihoods. However, they will have different variance estimates.
To build intuition, it’s useful to note that many well-known distributions correspond to a particular choice of the mean-variance relationship in the world of quasi-likelihood world. The converse is not true: there are many quasi-likelihood specifications that do not map onto any known distributions.
Table 9.1 from Peter McCullagh’s book Generalized Linear Models shows several such examples:
The column $Q(\mu; y)$ here is calculated by computing the integral
\[\int_y^\mu \frac{y - t}{a(\phi) V(t)} dt\]for the specified function $V(\mu)$, and it is assumed $a(\phi) = 1$ for simplicity.
For example, in the Gaussian case we have $V(\mu) = 1$, which implies
\begin{align} \int_y^\mu \frac{y - t}{a(\phi) V(t)} dt &= \int_y^\mu \frac{y - t}{1 \cdot 1} dt \\ &= \int_y^\mu (y - t) dt \\ &= -\frac12 (y - t)^2 \big\rvert_y^\mu \\ &= -\frac12 (y - \mu)^2 \end{align}
In the Poisson case, we have $V(\mu) = \mu$, which implies
\begin{align} \int_y^\mu \frac{y - t}{a(\phi) V(t)} dt &= \int_y^\mu \frac{y - t}{1 \cdot t} dt \\ &= \int_y^\mu \left(\frac{y}{t} - 1\right) dt \\ &= y \log \mu - \mu \\ \end{align}
However, notice that the table also includes variance function with no corresponding closed-form likelihood. An interesting case of this is when $V(\mu) = \mu^\zeta$, where $\zeta \neq 0, 1, 2$. This power law mean-variance relationship occurs often in ecology through Taylor’s law, and more generally in statistics it appears in the family of distributions known as the Tweedie distribution.