Linear Mixed Model

Linear Mixed Model#

When we want to study a specific genetic variant as a fixed effect but know there are countless unmeasured factors affecting our trait, linear mixed models offer a practical solution: use a random effect based on genetic similarity to absorb all that unmeasured heterogeneity.

Graphical Summary#

Fig

Key Formula#

In the linear mixed model (mixed because it incorporate both the fixed and the random effects), which accurately represent non-independent data structures for a variant,

\[ \mathbf{Y} = \mathbf{X}\beta + \mathbf{g} + \boldsymbol{\epsilon} \]

where:

\(\mathbf{Y}\) is the \(N \times 1\) vector of phenotypes
\(\mathbf{X}\) is the \(N \times 1\) design matrix for fixed effects (e.g., the specific genetic variant we’re testing, plus measurable covariates like age and sex)
\(\beta\) is the scalar of genetic effect coefficient (unknown, to be estimated, generally fixed but also can be random)
\(\mathbf{g} \sim N(0,\sigma^2_g\mathbf{G})\) is the \(N \times 1\) vector of random effects that captures all the genetic factors we can’t or don’t want to model explicitly - population structure, polygenic background, family relationships, and countless unknown genetic influences
\(\boldsymbol{\epsilon}\) is the \(N \times 1\) vector of residual errors, where \(\boldsymbol{\epsilon} \sim N(0, \sigma^2_e\mathbf{I})\)

Technical Details#

Random Effects Decomposition: \(\mathbf{g}\) and \(\mathbf{Zu}\)#

For any genetic variant, we cannot model all contributing factors explicitly (age, sex, ancestry, thousands of genes, unknown factors, etc). Instead of the complete model:

\[ \mathbf{Y} = \text{genetic variant} + \text{age} + \text{sex} + \text{ancestry} + \text{gene}_1 + \ldots + \text{gene}_{20000} + \text{unknown factors} + \boldsymbol{\epsilon} \]

We approximate it as:

\[ \mathbf{Y} = \text{genetic variant} + \text{age} + \text{sex} + \mathbf{g} + \boldsymbol{\epsilon} \]

where \(\mathbf{g}\) soaks up the effects of ancestry, polygenic background, and all the other genetic factors we can’t explicitly model.

The random effects term g can be decomposed as:

\[ \mathbf{g} = \mathbf{Z} \mathbf{u} \]

where:

\(\mathbf{G}\) is the genetic relationship matrix (GRM) that measures genetic similarity between individuals, effectively capturing all the unmeasured genetic factors through genome-wide similarity patterns
\(\mathbf{Z}\) is the \(N \times M\) genotype matrix for \(M\) genetic variants
\(\mathbf{u} \sim N(0, \sigma_u^2\mathbf{I})\) is a \(M \times 1\) vector of random SNP effects
This formulation is formally known as the infinitesmal model

Popular LMM Methods in Statistical Genetics#

Method	Purpose	Key Innovation	Scale/Application
GCTA	Heritability estimation and association testing	Uses genome-wide SNPs to construct genetic relationship matrix (GRM)	\(h^2 = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e}\) estimation
GEMMA	Fast genome-wide association studies with population structure control	Efficient eigendecomposition for kinship correction	Computationally efficient for large-scale data
BOLT-LMM	Ultra-fast mixed model association testing	Bayesian sparse linear mixed model with Monte Carlo sampling	Hundreds of thousands of individuals
SAIGE	Association testing for binary and quantitative traits	Saddlepoint appeoximation for unbalanced case-control data	Large-scale biobanks with rare outcomes
REGENIE	Whole genome regression with prediction	Two-step ridge regression avoiding explicit mixed model	Ultra-large biobanks (millions of samples)

Example#

What happens when we acknowledge that our trait is influenced not just by one specific variant, but also by a “genetic background” from all the other variants we’re not explicitly testing? Let’s see how linear mixed models capture this reality.

We’ll use the same 5 individuals, but now we’ll model two sources of genetic influence: a fixed effect from one specific variant we’re studying, plus a random polygenic effect that represents the combined influence of all variants acting as genetic background. How much does each component contribute to the total genetic influence on our trait?

Setup#

# Clear the environment
rm(list = ls())
set.seed(13)
library(MASS) # For mvrnorm function
# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)

# Reshape into a matrix
N = 5
M = 3
geno_matrix <- matrix(genotypes, nrow = N, ncol = M, byrow = TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive model
Xraw_additive <- matrix(0, nrow = N, ncol = M) # count number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}

X <- scale(Xraw_additive, center = TRUE, scale = TRUE)