Processing math: 100%

Binh Ho

Copulas and Sklar's Theorem

Copulas are flexible statistical tools for modeling correlation structure between variables.

Background

Consider p random variables, X1,,Xp. Specifying a distribution over these variables would allow us to directly model the covariance between them. However, specifying a proper model directly can become difficult or impossible in some cases. For example, when the variables have mixed data types (e.g., some are continuous and some are integer-valued), it’s not obvious how to specify a joint distribution over them. Furthermore, some distributions (e.g., the Poisson distribution) do not have a natural multivariate extension, like the Gaussian does.

Copulas solve these issues by dientangling two parts of the modeling decisions: specifying the joint distribution of the variables, and specifying the marginal distributions of the variables.

Sklar’s Theorem

Copulas have solid theoreteical foundations through Sklar’s Theorem (which was proven by Abe Sklar):

For any random variables X1,,Xp with joint CDF F(x1,,xp) and marginal CDFs Fj(x)=P(Xjx), there exists a copula such that F(x1,,xp)=C(F1(x1),,Fp(xp)). Furthermore, if each Fj(x) is continuous, then C is unique.

In essence, Sklar’s Theorem says that a joint CDF can be properly decomposed into the marginal CDFs and a copula that describes the variables’ dependence on one another.

Copulas

More practically, a copula model is specified by two things

Given these two quantities, a copula models the multivariate distribution function as

F(x1,,xp)=Cϕ(F1(x1),,Fp(xp)).

In other words, the overall CDF of the variables is separable into the marginal for each variable and their joint CDF. This is often written as

F(u1,,up)=Cϕ(u1,,up)

where uj=Fj(xj) is a uniformly-distributed variable because a valid CDF returns a number in [0,1]. So, a copula can be described as a function that maps [0,1]p to [0,1].

Likelihoods for copula models

To write down and compute the likelihood for one of these models, we first need to be able to obtain the corresponding PDF. In particular, we need to take the derivative of F(x1,,xp) with respect to each variable. This yields

uF(x1,,xp)=cϕ(u)fj(xj;θj)

where u=(u1,,up)=(F1(x1),,Fp(xp)), θ is a parameter vector containing the paramters of the CDF, and cϕ(u)=uCϕ(u). The overall likelihood for a sample across all p variables is then

L(x)=cϕ(u)pj=1fj(xj;θj).

Gaussian copula

In a Gaussian copula with Gaussian marginals, we define C as

C(u1,,up)=ΦC(Φ1(u1),,Φ1(up)|C)

where uj[0,1]p, Φ is the standard normal CDF, and ΦC is the p-dimensional Gaussian CDF with correlation matrix C.

Writing this out more fully, we have

F(x1,,xp)=Cϕ(u1,,up)=Cϕ(F1(x1),,Fp(xp))=ΦC(Φ1(F1(x1)),,Φ1(Fp(xp)))

where F1,,Fp are the marginal CDFs that can be specified by the modeler.

The density function of the Gaussian copula is then

dduCϕ(u)=dduΦC(u)dduΦ1(u)=ϕ(Φ1(u))1ϕ(Φ1(u))|C|1/2exp(12tC1t)exp(12tt)(t=Φ1(u))=|C|1/2exp(12t(C1I)t)

where ϕ() is the Gaussian pdf.

Copulas as latent variable models

Copula models also have an equivalent latent variable formulation. Let Z1,,Zp be latent variables distributed with the multivariate Gaussian structure

(Z1,,Zp)N(0,C).

Then the observed variable Xj is related to Zj by

Xj=F1j(Φ(Zj))

where Fj is the jth marginal CDF, and Φ is the standard normal CDF.

Sampling from copulas

We can use the LVM formulation above to sample from arbitrary copulas. Below is an example in Python to sample two correlated Poisson variables.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import multivariate_normal
from scipy.stats import poisson

# Covariance of RVs
cov_mat = np.array([
	[1.0, 0.7],
	[0.7, 1.0]])

n = 1000
p = 2

# Generate latent variables
Z = multivariate_normal.rvs(mean=np.zeros(p), cov=cov_mat, size=n)

# Pass through standard normal CDF
Z_tilde = norm.cdf(Z)

# Inverse of observed distribution function
X = poisson.ppf(q=Z_tilde, mu=10)

# Plot
plt.scatter(X[:, 0], X[:, 1])
plt.show()

This code generates two correlated Poisson variables:

Poisson RVs

References