Phenotype data simulation

Phenotype data simulation#

Goal#

Here we use two strategies to simulate phenotype data (Y matrix) based on individual-level genotype data (X matrix).

Input#

genofile: plink file of real genotyope, /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files/*.bim

The other parameters can be found in simxQTL repo. https://github.com/StatFunGen/simxQTL.

Output#

An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits)

Example output:

result = readRDS("/home/hs3393/cb_Mar/simulation_data/real_simulation_5trait/sample_999_real_simulation_3_ncausal_5_trait.rds")

dim(result$X)

dim(result$Y)

1153
10534

1153
5

str(result$variant)

List of 5
 $ : int [1:3] 196 5616 6566
 $ : int [1:2] 5616 6566
 $ : int [1:2] 196 6566
 $ : int [1:2] 196 5616
 $ : int [1:2] 196 6566

In this region, we simulated 3 causal variants (196, 5616, 6566). Each causal variant is distributed by distribution in 5 traits.

Simulation Strategy#

Total heritability (\(\phi_{total}\))#

Effect size are all set to be 1. This value actually doesn’t matter so much because phi will eventually control the variance. Under this case, all SNPs share the same effect size, and they altogether contribute to explain \(\phi_{total}\) (eg.0.5) variance of the total variance. To get Y, we assume a multivariate gaussian distribution \(\textbf{Y} \sim N(\textbf{X} \beta, \sigma^2)\), and \(\sigma^2\) can be estimated by the equation below.

\(\phi_{total} = \dfrac{var(X \beta)}{\sigma^2 + var(X \beta)}\)

\(\sigma^2 = \frac{var(X \beta)(1-\phi_{total})}{\phi_{total}}\)

SNP level heritability#

Assume there are total number of \(a\) causal variants. We assign \(\beta_1 = 1\) as the initialize setting.
For each causal variant we have:

\(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_2 \beta_2)}{Var(Y)} = ... = \frac{Var(X_a \beta_a)}{Var(Y)} = \phi_{SNP}\)

In that way, we have: if \(\beta_1^2 Var(X_1) = \beta_2^2Var(X_2) = ... = \beta_a^2 Var(X_a)\). Then we have \(\beta_2 = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_2)}}\), …, \(\beta_a = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_a)}}\).
Then we will have \(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_1 \beta_1)}{Var(X \beta) + \sigma^2} = \phi_{SNP}\). Therefore, we have \(\sigma^2 = \frac{Var(X_1 \beta_1)}{\phi_{SNP}} - Var(X \beta)\)

Simulation code#

Step 1: Randomly select regions#

Select gene regions to simulate - randomly select one gene from each 1,329 different TADB blocks.

Some genes might exist in more than one TADB block. Total unique gene number after random selection: 1287.

library(tidyverse)

mapper = read_tsv("/home/hs3393/fungen-xqtl-analysis/resource//gene_cis_TADB_mapper.tsv")

df_selected <- mapper %>% filter(`#chr` != "chrX")  %>%
  group_by(TADB_index) %>%
  slice_sample(n = 1) %>%
  ungroup()

genes_selected = df_selected %>% pull(gene_id) 

genes_selected %>% length()

#genes_selected %>% save_tsv("/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv")

genes_selected = read_tsv("/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv", col_names = "gene")$gene

mapper %>% filter(gene_id %in% genes_selected)  %>% pull(TADB_index) %>% unique() %>% length()

Rows: 126063 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): #chr, gene_id, gene_name, TADB_index, TADB_id
dbl (6): gene_cis_start, gene_cis_end, TSS, end.x, TADB_start, TADB_end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1329

Rows: 1329 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1329

genes_selected %>% unique() %>% length()

1287

Step 2: Make a copy these selected genotypes to another folder#

import os
import shutil

# Path to the file with gene names
gene_file_path = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv'

source_folder = '/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files'

destination_folder = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype'

with open(gene_file_path, 'r') as file:
    gene_names = [line.strip() for line in file]

if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)

for filename in os.listdir(source_folder):
    if any(gene_name in filename for gene_name in gene_names):
        # Copy the file to the destination folder
        shutil.copy(os.path.join(source_folder, filename), destination_folder)

print('Files copied successfully.')

Step 3: Simulation#

1. Figure 2A simulation (2 trait)#

[real_simulation_2trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 2
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R:  expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
    # source some functions to read matrix and inpute the missing data
    source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
    source("~/cloud_colocalization/simulation_code/misc.R")
    # read the plink file
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")

    # Extract the TSS position for the given gene
    TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]

    # Define the genomic region around the TSS (+/- 1.5Mb) and filter SNPs
    keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))
    geno$bed <- geno$bed[, keep_index]

    # Set thresholds for filtering SNPs
    imiss <- 0.1  # Maximum missing rate allowed (10%)
    maf <- 0.05   # Minimum minor allele frequency (5%)

    # Filter SNPs based on missing rate and MAF
    Xmat <- filter_X(geno$bed, imiss, maf)

    # Define the number of causal variants and traits
    ncausal <- ${n_causal}
    ntrait <- ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    # Select causal variants randomly
    # If we want to limit the LD between causal variants, un-annotate the code below to limit < 0.3 LD between variants   
    if (indep) {
        LD_vars = 1  # Initialize LD_vars

        if (ncausal == 1) {
            # If only one causal variant, just sample it
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Repeat sampling until selected variables are quasi independent
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        vars = sample(1:ncol(Xmat), size = ncausal)
    }


    # Initialize effect size matrix (B) with zeros
    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # Simulate effect sizes for the first two traits
    for (i in 1:2) {
        causal_index <- vars
        beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
        B[, i] <- beta
    }

    # save causal variants per trait
    variant <- list()
    for (i in 1:ncol(B)) {
        variant[[i]] <- which(B[, i] != 0)
    }

    # Simulate multi-trait phenotypes
    phenotype <- sim_multi_traits(
        G = Xmat,
        B = B,
        h2g = ${h2g},
        is_h2g_total = ${"TRUE" if total_h2g else "FALSE"}
    )
    phenotype <- phenotype$P

    data <- list()
    data[["X"]] <- Xmat
    data[["Y"]] <- phenotype
    data[["variant"]] <- variant

    saveRDS(data, ${_output:r})

2. Figure 2A simulation (5 trait)#

[real_simulation_5trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 5
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R:  expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
    # source some functions to read matrix and inpute the missing data
    source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
    source("~/cloud_colocalization/simulation_code/misc.R")
    # read the plink file
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
    # filter by distance with. TSS
    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
    geno$bed <- geno$bed[, keep_index]

    # Filter out columns with missing rate > 0.1
    imiss <- 0.1

    # Filter out columns with MAF < 0.05
    maf <- 0.05

    # Apply filtering
    Xmat <- filter_X(geno$bed, imiss, maf)

    # Only keep the first 4000 variants
    ncausal <- ${n_causal}
    ntrait <- ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    if (indep) {
        LD_vars = 1  # Initialize LD_vars

        if (ncausal == 1) {
            # If only one causal variant, just sample it
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Repeat sampling until selected variables are quasi independent
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        vars = sample(1:ncol(Xmat), size = ncausal)
    }
    # Load predefined proportions for causal variant sampling
    prop <- readRDS("/home/hs3393/cloud_colocalization/simulation_code/trait6_prop.rds")[, 1:4]
    proportions <- prop[ncausal, ]

    # Sample variant names based on predefined proportions
    sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))

    # Initialize phenotype list and effect size matrix
    phenotype <- list()
    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # Configuration matrix for trait-variant relationships
    config <- matrix(0, nrow = ntrait, ncol = ncausal)

    # Assign causal variants to traits
    for (i in 1:ncausal) {
        coloc_trait <- sample(1:ntrait, sampled_name[i])
        config[coloc_trait, i] <- 1
    }

    # Simulate effect sizes and phenotypes
    for (i in 1:nrow(config)) {
        beta <- B[, i, drop = FALSE]
        index <- which(config[i, ] == 1)

        if (length(index) > 0) {
            causal_index <- vars[index]
            beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
            B[, i] <- beta
            pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] <- pheno_single$P
        } else {
            pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] <- pheno_single$P
        }
    }

    # Identify causal variants for each trait
    variant <- list()
    for (i in 1:ncol(B)) {
        variant[[i]] <- which(B[, i] != 0)
    }

    # Combine phenotype data
    X <- Xmat
    Y <- bind_cols(phenotype)
    colnames(Y) <- paste0("Trait", 1:ntrait)

    # Store results in a list
    data <- list()
    data[["X"]] <- Xmat
    data[["Y"]] <- Y
    data[["variant"]] <- variant

    # Save results
    saveRDS(data, ${_output:r})

3. Figure 2A simulation (10 trait)#

[real_simulation_10trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 10
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R:  expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
    # source some functions to read matrix and inpute the missing data
    source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
    source("~/cloud_colocalization/simulation_code/misc.R")
    # read the plink file
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
    # filter by distance with. TSS
    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
    geno$bed = geno$bed[,keep_index]
    # filter out columns with missing rate > 0.1
    imiss = 0.1
    # filter out columns with MAF < 0.05
    maf = 0.05
    Xmat = filter_X(geno$bed, imiss, maf)
    # only keep the first 4000 variants
    ncausal = ${n_causal}
    ntrait = ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}
    if (indep) {
        LD_vars = 1  # Initialize LD_vars

        if (ncausal == 1) {
            # If only one causal variant, just sample it
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Repeat sampling until selected variables are quasi independent
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        vars = sample(1:ncol(Xmat), size = ncausal)
    }
  
    prop = readRDS("/home/hs3393/cloud_colocalization/simulation_code/trait10_prop.rds")
    proportions = prop[ncausal,]
    sampled_name <- as.numeric(sample(names(proportions), size =  ncausal, prob = proportions, replace = TRUE))
    phenotype = list()
    B = matrix(0, nrow =  ncol(Xmat), ncol = ntrait)
    config = matrix(0, nrow =  ntrait, ncol = ncausal)
    for(i in c(1:ncausal)){
          coloc_trait = sample(c(1:ntrait), sampled_name[i])
          config[coloc_trait, i] = 1
    }
    for(i in 1:nrow(config)){
        beta = B[,i, drop = FALSE]
        index = which(config[i,] == 1)
        if(length(index) > 0){
            causal_index = vars[index]
            beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
            B[, i] = beta
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        }else{
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        
        }
    }
    variant = list()
    for(i in 1:ncol(B)){
        variant[[i]] = which(B[,i] != 0)
    }
    X = Xmat
    Y = bind_cols(phenotype)
    colnames(Y) = paste0("Trait", c(1:ntrait))
    data = list()
    data[["X"]] = Xmat
    data[["Y"]] = Y
    data[["variant"]] = variant
    saveRDS(data, ${_output:r})

4. Figure 2A simulation (20 trait)#

[real_simulation_20trait]
parameter: genofile = paths
# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
# specify the number of causal variants
parameter: n_trait = 20
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'
R:  expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
    # source some functions to read matrix and inpute the missing data
    source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
    source("~/cloud_colocalization/simulation_code/misc.R")
    # read the plink file
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")
    # filter by distance with. TSS
    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
    geno$bed = geno$bed[,keep_index]
    # filter out columns with missing rate > 0.1
    imiss = 0.1
    # filter out columns with MAF < 0.05
    maf = 0.05
    Xmat = filter_X(geno$bed, imiss, maf)
    # only keep the first 4000 variants
    ncausal = ${n_causal}
    ntrait = ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}
    if (indep) {
        LD_vars = 1  # Initialize LD_vars

        if (ncausal == 1) {
            # If only one causal variant, just sample it
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Repeat sampling until selected variables are quasi independent
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        vars = sample(1:ncol(Xmat), size = ncausal)
    }
  
    prop = readRDS("/home/hs3393/cloud_colocalization/simulation_code/trait17_prop.rds")
    proportions = prop[ncausal,]
    sampled_name <- as.numeric(sample(names(proportions), size =  ncausal, prob = proportions, replace = TRUE))
    phenotype = list()
    B = matrix(0, nrow =  ncol(Xmat), ncol = ntrait)
    config = matrix(0, nrow =  ntrait, ncol = ncausal)
    for(i in c(1:ncausal)){
          coloc_trait = sample(c(1:ntrait), sampled_name[i])
          config[coloc_trait, i] = 1
    }
    for(i in 1:nrow(config)){
        beta = B[,i, drop = FALSE]
        index = which(config[i,] == 1)
        if(length(index) > 0){
            causal_index = vars[index]
            beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
            B[, i] = beta
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        }else{
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        
        }
    }
    variant = list()
    for(i in 1:ncol(B)){
        variant[[i]] = which(B[,i] != 0)
    }
    X = Xmat
    Y = bind_cols(phenotype)
    colnames(Y) = paste0("Trait", c(1:ntrait))
    data = list()
    data[["X"]] = Xmat
    data[["Y"]] = Y
    data[["variant"]] = variant
    saveRDS(data, ${_output:r})

5. Job submission bash file#

2 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_2trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
    --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
    --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
    --cwd /home/hs3393/cb_Mar/simulation_data/

EOF

for ncausal in 1 2 3; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

5 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_5trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
        --cwd /home/hs3393/cb_Mar/simulation_data/
EOF

for ncausal in 1 2 3 4; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

10 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_10trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
        --cwd /home/hs3393/cb_Mar/simulation_data/
EOF

for ncausal in 1 2 3 4 5; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

20 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_20trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
        --cwd /home/hs3393/cb_Mar/simulation_data/
EOF

for ncausal in 1 2 3 4 5; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

6. Supplementary: all shared senario#

[all_share_simulation]
parameter: genofile = paths
# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "40G"
parameter: numThreads = 1
parameter: n_causal = 1
# specify the number of causal variants
parameter: n_trait = 50
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
# specify the number of traits (phenotypes)
parameter: container = ""
input: genofile, group_by = 1
output: f'{cwd:a}/sample_{_index}_h2g_{h2g}_ncausal_{n_causal}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R:  expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
    # source some functions to read matrix and inpute the missing data
    source("~/cloud_colocalization/simulation_code/simulate_linreg.R")
    source("~/cloud_colocalization/simulation_code/misc.R")
    # read the plink file
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv")

    # Extract the TSS position for the given gene
    TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]

    # Define the genomic region around the TSS (+/- 1.5Mb) and filter SNPs
    keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))
    geno$bed <- geno$bed[, keep_index]

    # Set thresholds for filtering SNPs
    imiss <- 0.1  # Maximum missing rate allowed (10%)
    maf <- 0.05   # Minimum minor allele frequency (5%)

    # Filter SNPs based on missing rate and MAF
    Xmat <- filter_X(geno$bed, imiss, maf)

    # Define the number of causal variants and traits
    ncausal <- ${n_causal}
    ntrait <- ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}
    if (indep) {
        LD_vars = 1  # Initialize LD_vars

        if (ncausal == 1) {
            # If only one causal variant, just sample it
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Repeat sampling until selected variables are quasi independent
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        vars = sample(1:ncol(Xmat), size = ncausal)
    }

    # Initialize effect size matrix (B) with zeros
    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # Simulate effect sizes for the first two traits
    for (i in 1:ntrait) {
        causal_index <- vars
        beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
        B[, i] <- beta
    }

    # save causal variants per trait
    variant <- list()
    for (i in 1:ncol(B)) {
        variant[[i]] <- which(B[, i] != 0)
    }

    # Simulate multi-trait phenotypes
    phenotype <- sim_multi_traits(
        G = Xmat,
        B = B,
        h2g = ${h2g},
        is_h2g_total = ${"TRUE" if total_h2g else "FALSE"}
    )
    phenotype <- phenotype$P

    data <- list()
    data[["X"]] <- Xmat
    data[["Y"]] <- phenotype
    data[["variant"]] <- variant

    saveRDS(data, ${_output:r})

7. Supplementary job submission#

work_dir="/home/hs3393/cb_Mar/simulation_data/all_share_simulation/"
job="all_share_simulation"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent --n_trait 20 \
        --cwd /home/hs3393/cb_Mar/simulation_data/JOB/result
EOF

for ncausal in 1 2 3 4; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done