Phenotype Data Simulation#
Goal#
Here we use two strategies to simulate phenotype data (Y matrix) based on individual-level genotype data (X matrix).
1. Total heritability (\(\phi_{total}\))#
Effect size are all set to be 1. This value actually doesn’t matter so much because phi will eventually control the variance. Under this case, all SNPs share the same effect size, and they altogether contribute to explain \(\phi_{total}\) (eg.0.5) variance of the total variance. To get Y, we assume a multivariate gaussian distribution \(\textbf{Y} \sim N(\textbf{X} \beta, \sigma^2)\), and \(\sigma^2\) can be estimated by the equation below.
\(\phi_{total} = \dfrac{var(X \beta)}{\sigma^2 + var(X \beta)}\)
\(\sigma^2 = \frac{var(X \beta)(1-\phi_{total})}{\phi_{total}}\)
2. SNP level heritability#
Assume there are total number of \(a\) causal variants. We assign \(\beta_1 = 1\) as the initialize setting.
For each causal variant we have:
\(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_2 \beta_2)}{Var(Y)} = ... = \frac{Var(X_a \beta_a)}{Var(Y)} = \phi_{SNP}\)
In that way, we have: if \(\beta_1^2 Var(X_1) = \beta_2^2Var(X_2) = ... = \beta_a^2 Var(X_a)\). Then we have \(\beta_2 = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_2)}}\), …, \(\beta_a = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_a)}}\).
Then we will have \(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_1 \beta_1)}{Var(X \beta) + \sigma^2} = \phi_{SNP}\). Therefore, we have \(\sigma^2 = \frac{Var(X_1 \beta_1)}{\phi_{SNP}} - Var(X \beta)\)
Input#
genofile
:PLINK file of real genotype data, consisting of .bed
, .bim
, and .fam
files.
The .bed file stores the binary-encoded genotype information for each individual across all SNPs, making the data storage efficient. The .bim file is a text file containing information about each SNP, including chromosome number, SNP ID, genetic distance, base-pair position, and the two alleles.
The other parameters can be found in simxQTL repo. https://github.com/StatFunGen/simxQTL
.
Output#
An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits)
Example output:
result = readRDS("../sample_999_real_simulation_3_ncausal_5_trait.rds")
dim(result$X)
dim(result$Y)
- 1153
- 10534
- 1153
- 5
str(result$variant)
List of 5
$ : int [1:3] 196 5616 6566
$ : int [1:2] 5616 6566
$ : int [1:2] 196 6566
$ : int [1:2] 196 5616
$ : int [1:2] 196 6566
In this region, we simulated 3 causal variants (196, 5616, 6566). Each causal variant is distributed by distribution in 5 traits.
Simulation code#
Step 1: Randomly select regions#
Select gene regions to simulate - randomly select one gene from each 1,329 different TADB blocks.
Some genes might exist in more than one TADB block. Total unique gene number after random selection: 1287.
library(tidyverse)
mapper = read_tsv("./Data/gene_cis_TADB_mapper.tsv")
df_selected <- mapper %>% filter(`#chr` != "chrX") %>%
group_by(TADB_index) %>%
slice_sample(n = 1) %>%
ungroup()
genes_selected = df_selected %>% pull(gene_id)
genes_selected %>% length()
#genes_selected %>% save_tsv("/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv")
genes_selected = read_tsv("./Data/selected_gene_id.tsv", col_names = "gene")$gene
mapper %>% filter(gene_id %in% genes_selected) %>% pull(TADB_index) %>% unique() %>% length()
Rows: 126063 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): #chr, gene_id, gene_name, TADB_index, TADB_id
dbl (6): gene_cis_start, gene_cis_end, TSS, end.x, TADB_start, TADB_end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1329 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
genes_selected %>% unique() %>% length()
Step 2: Make a copy these selected genotypes to another folder#
import os
import shutil
# Path to the file with gene names
gene_file_path = './Data/selected_gene_id.tsv'
source_folder = '/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files'
destination_folder = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype'
with open(gene_file_path, 'r') as file:
gene_names = [line.strip() for line in file]
if not os.path.exists(destination_folder):
os.makedirs(destination_folder)
for filename in os.listdir(source_folder):
if any(gene_name in filename for gene_name in gene_names):
# Copy the file to the destination folder
shutil.copy(os.path.join(source_folder, filename), destination_folder)
print('Files copied successfully.')
Step 3: Simulation#
1. Figure 2A simulation (2 trait)#
[real_simulation_2trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 2
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
parameter: container = ""
# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
# --- Load necessary libraries ---
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# install simulation package
# devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
# BiocManager::install("StatFunGen/pecotmr")
library("pecotmr")
library("simxQTL")
# --- Read genotype data from PLINK files ---
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
# --- Extract gene name from file path (e.g., "ENSG00000123456") ---
gene_name = str_extract(simu_file, "ENSG[0-9]+")
# --- Load gene-TSS mapping file ---
gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")
# --- Find TSS (transcription start site) of the gene ---
TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
# --- Keep SNPs within ±1.5Mb window around TSS ---
keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))
geno$bed <- geno$bed[, keep_index]
# --- Filter SNPs by missing rate and minor allele frequency (MAF) ---
imiss <- 0.1 # Allow up to 10% missingness
maf <- 0.05 # Require MAF > 5%
Xmat <- filter_X(geno$bed, imiss, maf)
# --- Set simulation parameters ---
ncausal <- ${n_causal}
ntrait <- ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
# --- Select causal variants ---
if (indep) {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected causal variants are approximately independent (LD < 0.3)
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are not perfectly correlated
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
}
}
}
# --- Initialize effect size matrix ---
B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)
# --- Simulate effect sizes for each trait ---
for (i in 1:2) {
causal_index <- vars
beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] <- beta
}
# --- Save list of causal variants per trait ---
variant <- list()
for (i in 1:ncol(B)) {
variant[[i]] <- which(B[, i] != 0)
}
# --- Simulate multi-trait phenotypes ---
phenotype <- sim_multi_traits(
G = Xmat,
B = B,
h2g = ${h2g},
is_h2g_total = ${"TRUE" if total_h2g else "FALSE"}
)
phenotype <- phenotype$P # Extract phenotype matrix
# --- Bundle output data into a list ---
data <- list()
data[["X"]] <- Xmat # Genotype matrix
data[["Y"]] <- phenotype # Phenotype matrix
data[["variant"]] <- variant # Causal variants
# --- Save simulation output as .rds ---
saveRDS(data, ${_output:r})
2. Figure 2A simulation (5 trait)#
[real_simulation_5trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 5
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
parameter: container = ""
# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
# --- Load necessary libraries ---
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# install simulation package
# devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
# BiocManager::install("StatFunGen/pecotmr")
library("pecotmr")
library("simxQTL")
# --- Read genotype data ---
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
# --- Extract gene name ---
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")
# --- Filter SNPs within ±1.5Mb of TSS ---
TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
geno$bed <- geno$bed[, keep_index]
# --- Filter SNPs by missing rate and MAF ---
imiss <- 0.1
maf <- 0.05
Xmat <- filter_X(geno$bed, imiss, maf)
# --- Set simulation parameters ---
ncausal <- ${n_causal}
ntrait <- ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
# --- Select causal variants ---
if (indep) {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are approximately independent (LD < 0.3)
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are not perfectly correlated
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
}
}
}
# --- Load predefined proportions for assigning variant sharing across traits ---
prop <- readRDS("./Data/trait6_prop.rds")[, 1:4]
proportions <- prop[ncausal, ]
# --- Sample variant names based on trait-sharing proportions ---
sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))
# --- Initialize phenotype list and effect size matrix ---
phenotype <- list()
B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)
# --- Configuration matrix for causal variants across traits ---
config <- matrix(0, nrow = ntrait, ncol = ncausal)
# --- Assign causal variants to traits ---
for (i in 1:ncausal) {
coloc_trait <- sample(1:ntrait, sampled_name[i])
config[coloc_trait, i] <- 1
}
# --- Simulate effect sizes and phenotypes for each trait ---
for (i in 1:nrow(config)) {
beta <- B[, i, drop = FALSE]
index <- which(config[i, ] == 1)
if (length(index) > 0) {
causal_index <- vars[index]
beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] <- beta
pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] <- pheno_single$P
} else {
# No causal variants for this trait: simulate under null
pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] <- pheno_single$P
}
}
# --- Identify causal variants for each trait ---
variant <- list()
for (i in 1:ncol(B)) {
variant[[i]] <- which(B[, i] != 0)
}
# --- Combine phenotype matrices into one dataframe ---
X <- Xmat
Y <- bind_cols(phenotype)
colnames(Y) <- paste0("Trait", 1:ntrait)
# --- Bundle and save simulation output ---
data <- list()
data[["X"]] <- Xmat
data[["Y"]] <- Y
data[["variant"]] <- variant
saveRDS(data, ${_output:r})
3. Figure 2A simulation (10 trait)#
[real_simulation_10trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 10
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
parameter: container = ""
# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
# --- Load necessary libraries ---
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# install simulation package
# devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
# BiocManager::install("StatFunGen/pecotmr")
library("pecotmr")
library("simxQTL")
# --- Read genotype data ---
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
# --- Extract gene name ---
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")
# --- Filter SNPs within ±1.5Mb of TSS ---
TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
geno$bed = geno$bed[, keep_index]
# --- Filter SNPs by missing rate and MAF ---
imiss = 0.1
maf = 0.05
Xmat = filter_X(geno$bed, imiss, maf)
# --- Set simulation parameters ---
ncausal = ${n_causal}
ntrait = ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
# --- Select causal variants ---
if (indep) {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are approximately independent (LD < 0.3)
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are not perfectly correlated
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
}
}
}
# --- Load predefined trait-sharing proportions ---
prop = readRDS("./Data/trait10_prop.rds")
proportions = prop[ncausal,]
# --- Sample number of traits shared by each variant ---
sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))
# --- Initialize phenotype list and effect size matrix ---
phenotype = list()
B = matrix(0, nrow = ncol(Xmat), ncol = ntrait)
# --- Configuration matrix for variant sharing across traits ---
config = matrix(0, nrow = ntrait, ncol = ncausal)
# --- Assign causal variants to traits ---
for (i in 1:ncausal) {
coloc_trait = sample(c(1:ntrait), sampled_name[i])
config[coloc_trait, i] = 1
}
# --- Simulate effect sizes and phenotypes ---
for (i in 1:nrow(config)) {
beta = B[, i, drop = FALSE]
index = which(config[i,] == 1)
if (length(index) > 0) {
causal_index = vars[index]
beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] = beta
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
} else {
# No causal variant for this trait: simulate under null
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
}
}
# --- Identify causal variants for each trait ---
variant = list()
for (i in 1:ncol(B)) {
variant[[i]] = which(B[, i] != 0)
}
# --- Combine phenotype matrices ---
X = Xmat
Y = bind_cols(phenotype)
colnames(Y) = paste0("Trait", c(1:ntrait))
# --- Bundle and save output ---
data = list()
data[["X"]] = Xmat
data[["Y"]] = Y
data[["variant"]] = variant
saveRDS(data, ${_output:r})
4. Figure 2A simulation (20 trait)#
[real_simulation_20trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 20
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
parameter: container = ""
# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
# --- Load necessary libraries ---
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# install simulation package
# devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
# BiocManager::install("StatFunGen/pecotmr")
library("pecotmr")
library("simxQTL")
# --- Read genotype data ---
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
# --- Extract gene name ---
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")
# --- Filter SNPs within ±1.5Mb of TSS ---
TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
geno$bed = geno$bed[, keep_index]
# --- Filter SNPs by missing rate and MAF ---
imiss = 0.1
maf = 0.05
Xmat = filter_X(geno$bed, imiss, maf)
# --- Set simulation parameters ---
ncausal = ${n_causal}
ntrait = ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
# --- Select causal variants ---
if (indep) {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are approximately independent (LD < 0.3)
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
LD_vars = 1 # Initialize LD check
if (ncausal == 1) {
# Only one causal variant needed
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Ensure selected variants are not perfectly correlated
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
}
}
}
# --- Load predefined trait-sharing proportions ---
prop = readRDS("./Data/trait17_prop.rds")
proportions = prop[ncausal,]
# --- Sample number of traits shared by each variant ---
sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))
# --- Initialize phenotype list and effect size matrix ---
phenotype = list()
B = matrix(0, nrow = ncol(Xmat), ncol = ntrait)
# --- Configuration matrix for variant sharing across traits ---
config = matrix(0, nrow = ntrait, ncol = ncausal)
# --- Assign causal variants to traits ---
for (i in 1:ncausal) {
coloc_trait = sample(c(1:ntrait), sampled_name[i])
config[coloc_trait, i] = 1
}
# --- Simulate effect sizes and phenotypes ---
for (i in 1:nrow(config)) {
beta = B[, i, drop = FALSE]
index = which(config[i, ] == 1)
if (length(index) > 0) {
causal_index = vars[index]
beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] = beta
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
} else {
# No causal variant for this trait: simulate under null
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
}
}
# --- Identify causal variants for each trait ---
variant = list()
for (i in 1:ncol(B)) {
variant[[i]] = which(B[, i] != 0)
}
# --- Combine phenotype matrices ---
X = Xmat
Y = bind_cols(phenotype)
colnames(Y) = paste0("Trait", c(1:ntrait))
# --- Bundle and save output ---
data = list()
data[["X"]] = Xmat
data[["Y"]] = Y
data[["variant"]] = variant
saveRDS(data, ${_output:r})
5. Job submission bash file#
2 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_2trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
5 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_5trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3 4; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
10 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_10trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3 4 5; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
20 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_20trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3 4 5; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
#sbatch ${output_script}
done
7. Supplementary job submission#
work_dir="/home/hs3393/cb_Mar/simulation_data/all_share_simulation/"
job="all_share_simulation"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent --n_trait 20 \
--cwd /home/hs3393/cb_Mar/simulation_data/JOB/result
EOF
for ncausal in 1 2 3 4; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done