Phenotype data simulation#
Goal#
Here we use two strategies to simulate phenotype data (Y matrix) based on individual-level genotype data (X matrix).
Input#
genofile
: plink file of real genotyope, /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files/*.bim
The other parameters can be found in simxQTL repo. https://github.com/StatFunGen/simxQTL
.
Output#
An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits)
Example output:
result = readRDS("/home/hs3393/cb_Mar/simulation_data/real_simulation_5trait/sample_999_real_simulation_3_ncausal_5_trait.rds")
dim(result$X)
dim(result$Y)
- 1153
- 10534
- 1153
- 5
str(result$variant)
List of 5
$ : int [1:3] 196 5616 6566
$ : int [1:2] 5616 6566
$ : int [1:2] 196 6566
$ : int [1:2] 196 5616
$ : int [1:2] 196 6566
In this region, we simulated 3 causal variants (196, 5616, 6566). Each causal variant is distributed by distribution in 5 traits.
Simulation Strategy#
Total heritability (\(\phi_{total}\))#
Effect size are all set to be 1. This value actually doesn’t matter so much because phi will eventually control the variance. Under this case, all SNPs share the same effect size, and they altogether contribute to explain \(\phi_{total}\) (eg.0.5) variance of the total variance. To get Y, we assume a multivariate gaussian distribution \(\textbf{Y} \sim N(\textbf{X} \beta, \sigma^2)\), and \(\sigma^2\) can be estimated by the equation below.
\(\phi_{total} = \dfrac{var(X \beta)}{\sigma^2 + var(X \beta)}\)
\(\sigma^2 = \frac{var(X \beta)(1-\phi_{total})}{\phi_{total}}\)
SNP level heritability#
Assume there are total number of \(a\) causal variants. We assign \(\beta_1 = 1\) as the initialize setting.
For each causal variant we have:
\(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_2 \beta_2)}{Var(Y)} = ... = \frac{Var(X_a \beta_a)}{Var(Y)} = \phi_{SNP}\)
In that way, we have: if \(\beta_1^2 Var(X_1) = \beta_2^2Var(X_2) = ... = \beta_a^2 Var(X_a)\). Then we have \(\beta_2 = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_2)}}\), …, \(\beta_a = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_a)}}\).
Then we will have \(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_1 \beta_1)}{Var(X \beta) + \sigma^2} = \phi_{SNP}\). Therefore, we have \(\sigma^2 = \frac{Var(X_1 \beta_1)}{\phi_{SNP}} - Var(X \beta)\)
Simulation code#
Step 1: Randomly select regions#
Select gene regions to simulate - randomly select one gene from each 1,329 different TADB blocks.
Some genes might exist in more than one TADB block. Total unique gene number after random selection: 1287.
library(tidyverse)
mapper = read_tsv("/home/hs3393/fungen-xqtl-analysis/resource//gene_cis_TADB_mapper.tsv")
df_selected <- mapper %>% filter(`#chr` != "chrX") %>%
group_by(TADB_index) %>%
slice_sample(n = 1) %>%
ungroup()
genes_selected = df_selected %>% pull(gene_id)
genes_selected %>% length()
#genes_selected %>% save_tsv("/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv")
genes_selected = read_tsv("/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv", col_names = "gene")$gene
mapper %>% filter(gene_id %in% genes_selected) %>% pull(TADB_index) %>% unique() %>% length()
Rows: 126063 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): #chr, gene_id, gene_name, TADB_index, TADB_id
dbl (6): gene_cis_start, gene_cis_end, TSS, end.x, TADB_start, TADB_end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1329 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
genes_selected %>% unique() %>% length()
Step 2: Make a copy these selected genotypes to another folder#
import os
import shutil
# Path to the file with gene names
gene_file_path = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv'
source_folder = '/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files'
destination_folder = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype'
with open(gene_file_path, 'r') as file:
gene_names = [line.strip() for line in file]
if not os.path.exists(destination_folder):
os.makedirs(destination_folder)
for filename in os.listdir(source_folder):
if any(gene_name in filename for gene_name in gene_names):
# Copy the file to the destination folder
shutil.copy(os.path.join(source_folder, filename), destination_folder)
print('Files copied successfully.')
Step 3: Simulation#
1. Figure 2A simulation (2 trait)#
[real_simulation_2trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 2
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# source some functions to read matrix and inpute the missing data
source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
source("~/cloud_colocalization/simulation_code/misc.R")
# read the plink file
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
# Extract the TSS position for the given gene
TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
# Define the genomic region around the TSS (+/- 1.5Mb) and filter SNPs
keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))
geno$bed <- geno$bed[, keep_index]
# Set thresholds for filtering SNPs
imiss <- 0.1 # Maximum missing rate allowed (10%)
maf <- 0.05 # Minimum minor allele frequency (5%)
# Filter SNPs based on missing rate and MAF
Xmat <- filter_X(geno$bed, imiss, maf)
# Define the number of causal variants and traits
ncausal <- ${n_causal}
ntrait <- ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
# Select causal variants randomly
# If we want to limit the LD between causal variants, un-annotate the code below to limit < 0.3 LD between variants
if (indep) {
LD_vars = 1 # Initialize LD_vars
if (ncausal == 1) {
# If only one causal variant, just sample it
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Repeat sampling until selected variables are quasi independent
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
vars = sample(1:ncol(Xmat), size = ncausal)
}
# Initialize effect size matrix (B) with zeros
B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)
# Simulate effect sizes for the first two traits
for (i in 1:2) {
causal_index <- vars
beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] <- beta
}
# save causal variants per trait
variant <- list()
for (i in 1:ncol(B)) {
variant[[i]] <- which(B[, i] != 0)
}
# Simulate multi-trait phenotypes
phenotype <- sim_multi_traits(
G = Xmat,
B = B,
h2g = ${h2g},
is_h2g_total = ${"TRUE" if total_h2g else "FALSE"}
)
phenotype <- phenotype$P
data <- list()
data[["X"]] <- Xmat
data[["Y"]] <- phenotype
data[["variant"]] <- variant
saveRDS(data, ${_output:r})
2. Figure 2A simulation (5 trait)#
[real_simulation_5trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 5
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# source some functions to read matrix and inpute the missing data
source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
source("~/cloud_colocalization/simulation_code/misc.R")
# read the plink file
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
# filter by distance with. TSS
TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
geno$bed <- geno$bed[, keep_index]
# Filter out columns with missing rate > 0.1
imiss <- 0.1
# Filter out columns with MAF < 0.05
maf <- 0.05
# Apply filtering
Xmat <- filter_X(geno$bed, imiss, maf)
# Only keep the first 4000 variants
ncausal <- ${n_causal}
ntrait <- ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
if (indep) {
LD_vars = 1 # Initialize LD_vars
if (ncausal == 1) {
# If only one causal variant, just sample it
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Repeat sampling until selected variables are quasi independent
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
vars = sample(1:ncol(Xmat), size = ncausal)
}
# Load predefined proportions for causal variant sampling
prop <- readRDS("/home/hs3393/cloud_colocalization/simulation_code/trait6_prop.rds")[, 1:4]
proportions <- prop[ncausal, ]
# Sample variant names based on predefined proportions
sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))
# Initialize phenotype list and effect size matrix
phenotype <- list()
B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)
# Configuration matrix for trait-variant relationships
config <- matrix(0, nrow = ntrait, ncol = ncausal)
# Assign causal variants to traits
for (i in 1:ncausal) {
coloc_trait <- sample(1:ntrait, sampled_name[i])
config[coloc_trait, i] <- 1
}
# Simulate effect sizes and phenotypes
for (i in 1:nrow(config)) {
beta <- B[, i, drop = FALSE]
index <- which(config[i, ] == 1)
if (length(index) > 0) {
causal_index <- vars[index]
beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] <- beta
pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] <- pheno_single$P
} else {
pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] <- pheno_single$P
}
}
# Identify causal variants for each trait
variant <- list()
for (i in 1:ncol(B)) {
variant[[i]] <- which(B[, i] != 0)
}
# Combine phenotype data
X <- Xmat
Y <- bind_cols(phenotype)
colnames(Y) <- paste0("Trait", 1:ntrait)
# Store results in a list
data <- list()
data[["X"]] <- Xmat
data[["Y"]] <- Y
data[["variant"]] <- variant
# Save results
saveRDS(data, ${_output:r})
3. Figure 2A simulation (10 trait)#
[real_simulation_10trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 10
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# source some functions to read matrix and inpute the missing data
source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
source("~/cloud_colocalization/simulation_code/misc.R")
# read the plink file
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
# filter by distance with. TSS
TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
geno$bed = geno$bed[,keep_index]
# filter out columns with missing rate > 0.1
imiss = 0.1
# filter out columns with MAF < 0.05
maf = 0.05
Xmat = filter_X(geno$bed, imiss, maf)
# only keep the first 4000 variants
ncausal = ${n_causal}
ntrait = ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
if (indep) {
LD_vars = 1 # Initialize LD_vars
if (ncausal == 1) {
# If only one causal variant, just sample it
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Repeat sampling until selected variables are quasi independent
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
vars = sample(1:ncol(Xmat), size = ncausal)
}
prop = readRDS("/home/hs3393/cloud_colocalization/simulation_code/trait10_prop.rds")
proportions = prop[ncausal,]
sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))
phenotype = list()
B = matrix(0, nrow = ncol(Xmat), ncol = ntrait)
config = matrix(0, nrow = ntrait, ncol = ncausal)
for(i in c(1:ncausal)){
coloc_trait = sample(c(1:ntrait), sampled_name[i])
config[coloc_trait, i] = 1
}
for(i in 1:nrow(config)){
beta = B[,i, drop = FALSE]
index = which(config[i,] == 1)
if(length(index) > 0){
causal_index = vars[index]
beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] = beta
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
}else{
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
}
}
variant = list()
for(i in 1:ncol(B)){
variant[[i]] = which(B[,i] != 0)
}
X = Xmat
Y = bind_cols(phenotype)
colnames(Y) = paste0("Trait", c(1:ntrait))
data = list()
data[["X"]] = Xmat
data[["Y"]] = Y
data[["variant"]] = variant
saveRDS(data, ${_output:r})
4. Figure 2A simulation (20 trait)#
[real_simulation_20trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 20
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
library("MASS")
library("plink2R")
library("dplyr")
library("readr")
library("tidyverse")
# source some functions to read matrix and inpute the missing data
source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
source("~/cloud_colocalization/simulation_code/misc.R")
# read the plink file
simu_file = ${_input:r}
geno <- read_plink(${_input:nr})
gene_name = str_extract(simu_file, "ENSG[0-9]+")
gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
# filter by distance with. TSS
TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
geno$bed = geno$bed[,keep_index]
# filter out columns with missing rate > 0.1
imiss = 0.1
# filter out columns with MAF < 0.05
maf = 0.05
Xmat = filter_X(geno$bed, imiss, maf)
# only keep the first 4000 variants
ncausal = ${n_causal}
ntrait = ${n_trait}
indep = ${"TRUE" if independent else "FALSE"}
if (indep) {
LD_vars = 1 # Initialize LD_vars
if (ncausal == 1) {
# If only one causal variant, just sample it
vars = sample(1:ncol(Xmat), size = ncausal)
} else {
# Repeat sampling until selected variables are quasi independent
while (length(LD_vars != 0)) {
vars = sample(1:ncol(Xmat), size = ncausal)
cor_mat = cor(Xmat[, vars])
LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
}
}
} else {
vars = sample(1:ncol(Xmat), size = ncausal)
}
prop = readRDS("/home/hs3393/cloud_colocalization/simulation_code/trait17_prop.rds")
proportions = prop[ncausal,]
sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))
phenotype = list()
B = matrix(0, nrow = ncol(Xmat), ncol = ntrait)
config = matrix(0, nrow = ntrait, ncol = ncausal)
for(i in c(1:ncausal)){
coloc_trait = sample(c(1:ntrait), sampled_name[i])
config[coloc_trait, i] = 1
}
for(i in 1:nrow(config)){
beta = B[,i, drop = FALSE]
index = which(config[i,] == 1)
if(length(index) > 0){
causal_index = vars[index]
beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
B[, i] = beta
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
}else{
pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
phenotype[[i]] = pheno_single$P
}
}
variant = list()
for(i in 1:ncol(B)){
variant[[i]] = which(B[,i] != 0)
}
X = Xmat
Y = bind_cols(phenotype)
colnames(Y) = paste0("Trait", c(1:ntrait))
data = list()
data[["X"]] = Xmat
data[["Y"]] = Y
data[["variant"]] = variant
saveRDS(data, ${_output:r})
5. Job submission bash file#
2 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_2trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
5 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_5trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3 4; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
10 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_10trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3 4 5; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
20 trait#
work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_20trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
--cwd /home/hs3393/cb_Mar/simulation_data/
EOF
for ncausal in 1 2 3 4 5; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done
7. Supplementary job submission#
work_dir="/home/hs3393/cb_Mar/simulation_data/all_share_simulation/"
job="all_share_simulation"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code
# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err
source ~/mamba_activate.sh
sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
--genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
--n_causal CAUSAL --mem 30G --h2g 0.05 --independent --n_trait 20 \
--cwd /home/hs3393/cb_Mar/simulation_data/JOB/result
EOF
for ncausal in 1 2 3 4; do
base_sh="base_script"
output_script="${job}_causal_${ncausal}.sh"
cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"| sed "s|CAUSAL|${ncausal}|g" > ${output_script}
sbatch ${output_script}
done