SDPR (Summary-Statistics-Based Dirichelt Process Regression for Polygenic Risk Prediction)
Source:R/regularized_regression.R
sdpr.RdThis function is a wrapper for the SDPR C++ implementation, which performs Markov Chain Monte Carlo (MCMC) for estimating effect sizes and heritability based on summary statistics and reference LD matrices.
Usage
sdpr(
bhat,
LD,
n,
per_variant_sample_size = NULL,
array = NULL,
a = 0.1,
c = 1,
M = 1000,
a0k = 0.5,
b0k = 0.5,
iter = 1000,
burn = 200,
thin = 5,
n_threads = 1,
opt_llk = 1,
verbose = TRUE
)Arguments
- bhat
A vector of marginal beta values for each SNP.
- LD
A list of LD matrices, where each matrix corresponds to a subset of SNPs.
- n
The total sample size of the GWAS.
- per_variant_sample_size
(Optional) A vector of sample sizes for each SNP. If NULL (default), it will be initialized to a vector of length equal to `bhat`, with all values set to `n`.
- array
(Optional) A vector of genotyping array information for each SNP. If NULL (default), it will be initialized to a vector of 1's with length equal to `bhat`.
- a
Factor to shrink the reference LD matrix. Default is 0.1.
- c
Factor to correct for the deflation. Default is 1.
- M
Max number of variance components. Default is 1000.
- a0k
Hyperparameter for inverse gamma distribution. Default is 0.5.
- b0k
Hyperparameter for inverse gamma distribution. Default is 0.5.
- iter
Number of iterations for MCMC. Default is 1000.
- burn
Number of burn-in iterations for MCMC. Default is 200.
- thin
Thinning interval for MCMC. Default is 5.
- n_threads
Number of threads to use. Default is 1.
- opt_llk
Which likelihood to evaluate. 1 for equation 6 (slightly shrink the correlation of SNPs) and 2 for equation 5 (SNPs genotyped on different arrays in a separate cohort). Default is 1.
- verbose
Whether to print verbose output. Default is true.
Note
This function is a wrapper for the SDPR C++ implementation, which is a rewritten and adopted version of the SDPR package. The original SDPR documentation is available at https://htmlpreview.github.io/?https://github.com/eldronzhou/SDPR/blob/main/doc/Manual.html
Examples
# Generate example data
set.seed(985115)
n <- 350
p <- 16
sigmasq_error <- 0.5
zeroes <- rbinom(p, 1, 0.6)
beta.true <- rnorm(p, 1, sd = 4)
beta.true[zeroes] <- 0
X <- cbind(matrix(rnorm(n * p), nrow = n))
X <- scale(X, center = TRUE, scale = FALSE)
y <- X %*% matrix(beta.true, ncol = 1) + rnorm(n, 0, sqrt(sigmasq_error))
y <- scale(y, center = TRUE, scale = FALSE)
# Calculate sufficient statistics
XtX <- t(X) %*% X
Xty <- t(X) %*% y
yty <- t(y) %*% y
# Set the prior
K <- 9
sigma0 <- c(0.001, .1, .5, 1, 5, 10, 20, 30, .005)
omega0 <- rep(1 / K, K)
# Calculate summary statistics
b.hat <- sapply(1:p, function(j) {
summary(lm(y ~ X[, j]))$coefficients[-1, 1]
})
s.hat <- sapply(1:p, function(j) {
summary(lm(y ~ X[, j]))$coefficients[-1, 2]
})
R.hat <- cor(X)
var_y <- var(y)
sigmasq_init <- 1.5
# Run SDPR
LD <- list(blk1 = R.hat)
out <- sdpr(b.hat, LD, n)
# In sample prediction correlations
cor(X %*% out$beta_est, y) #
#> [,1]
#> [1,] 0.9983291