Phenotype Data Simulation

Phenotype Data Simulation#

Goal#

Here we use two strategies to simulate phenotype data (Y matrix) based on individual-level genotype data (X matrix).

1. Total heritability (\(\phi_{total}\))#

Effect size are all set to be 1. This value actually doesn’t matter so much because phi will eventually control the variance. Under this case, all SNPs share the same effect size, and they altogether contribute to explain \(\phi_{total}\) (eg.0.5) variance of the total variance. To get Y, we assume a multivariate gaussian distribution \(\textbf{Y} \sim N(\textbf{X} \beta, \sigma^2)\), and \(\sigma^2\) can be estimated by the equation below.

\(\phi_{total} = \dfrac{var(X \beta)}{\sigma^2 + var(X \beta)}\)

\(\sigma^2 = \frac{var(X \beta)(1-\phi_{total})}{\phi_{total}}\)

2. SNP level heritability#

Assume there are total number of \(a\) causal variants. We assign \(\beta_1 = 1\) as the initialize setting.
For each causal variant we have:

\(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_2 \beta_2)}{Var(Y)} = ... = \frac{Var(X_a \beta_a)}{Var(Y)} = \phi_{SNP}\)

In that way, we have: if \(\beta_1^2 Var(X_1) = \beta_2^2Var(X_2) = ... = \beta_a^2 Var(X_a)\). Then we have \(\beta_2 = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_2)}}\), …, \(\beta_a = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_a)}}\).
Then we will have \(\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_1 \beta_1)}{Var(X \beta) + \sigma^2} = \phi_{SNP}\). Therefore, we have \(\sigma^2 = \frac{Var(X_1 \beta_1)}{\phi_{SNP}} - Var(X \beta)\)

Input#

genofile:PLINK file of real genotype data, consisting of .bed, .bim, and .fam files.

The .bed file stores the binary-encoded genotype information for each individual across all SNPs, making the data storage efficient. The .bim file is a text file containing information about each SNP, including chromosome number, SNP ID, genetic distance, base-pair position, and the two alleles.

The other parameters can be found in simxQTL repo. https://github.com/StatFunGen/simxQTL.

Output#

An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits)

Example output:

result = readRDS("../sample_999_real_simulation_3_ncausal_5_trait.rds")

dim(result$X)

dim(result$Y)

1153
10534

1153
5

str(result$variant)

List of 5
 $ : int [1:3] 196 5616 6566
 $ : int [1:2] 5616 6566
 $ : int [1:2] 196 6566
 $ : int [1:2] 196 5616
 $ : int [1:2] 196 6566

In this region, we simulated 3 causal variants (196, 5616, 6566). Each causal variant is distributed by distribution in 5 traits.

Simulation code#

Step 1: Randomly select regions#

Select gene regions to simulate - randomly select one gene from each 1,329 different TADB blocks.

Some genes might exist in more than one TADB block. Total unique gene number after random selection: 1287.

library(tidyverse)

mapper = read_tsv("./Data/gene_cis_TADB_mapper.tsv")

df_selected <- mapper %>% filter(`#chr` != "chrX")  %>%
  group_by(TADB_index) %>%
  slice_sample(n = 1) %>%
  ungroup()

genes_selected = df_selected %>% pull(gene_id) 

genes_selected %>% length()

#genes_selected %>% save_tsv("/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv")

genes_selected = read_tsv("./Data/selected_gene_id.tsv", col_names = "gene")$gene

mapper %>% filter(gene_id %in% genes_selected)  %>% pull(TADB_index) %>% unique() %>% length()

Rows: 126063 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): #chr, gene_id, gene_name, TADB_index, TADB_id
dbl (6): gene_cis_start, gene_cis_end, TSS, end.x, TADB_start, TADB_end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1329

Rows: 1329 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1329

genes_selected %>% unique() %>% length()

1287

Step 2: Make a copy these selected genotypes to another folder#

import os
import shutil

# Path to the file with gene names
gene_file_path = './Data/selected_gene_id.tsv'

source_folder = '/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files'

destination_folder = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype'

with open(gene_file_path, 'r') as file:
    gene_names = [line.strip() for line in file]

if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)

for filename in os.listdir(source_folder):
    if any(gene_name in filename for gene_name in gene_names):
        # Copy the file to the destination folder
        shutil.copy(os.path.join(source_folder, filename), destination_folder)

print('Files copied successfully.')

Step 3: Simulation#

1. Figure 2A simulation (2 trait)#

[real_simulation_2trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 2
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
parameter: container = ""

# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'

# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 

    # --- Load necessary libraries ---
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
  
    # install simulation package
    # devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
    # BiocManager::install("StatFunGen/pecotmr")
    library("pecotmr")
    library("simxQTL")

    # --- Read genotype data from PLINK files ---
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})

    # --- Extract gene name from file path (e.g., "ENSG00000123456") ---
    gene_name = str_extract(simu_file, "ENSG[0-9]+")

    # --- Load gene-TSS mapping file ---
    gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")

    # --- Find TSS (transcription start site) of the gene ---
    TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]

    # --- Keep SNPs within ±1.5Mb window around TSS ---
    keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))
    geno$bed <- geno$bed[, keep_index]

    # --- Filter SNPs by missing rate and minor allele frequency (MAF) ---
    imiss <- 0.1  # Allow up to 10% missingness
    maf <- 0.05   # Require MAF > 5%
    Xmat <- filter_X(geno$bed, imiss, maf)

    # --- Set simulation parameters ---
    ncausal <- ${n_causal}
    ntrait <- ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    # --- Select causal variants ---
    if (indep) {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected causal variants are approximately independent (LD < 0.3)
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are not perfectly correlated
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
            }
        }
    }

    # --- Initialize effect size matrix ---
    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # --- Simulate effect sizes for each trait ---
    for (i in 1:2) {
        causal_index <- vars
        beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
        B[, i] <- beta
    }

    # --- Save list of causal variants per trait ---
    variant <- list()
    for (i in 1:ncol(B)) {
        variant[[i]] <- which(B[, i] != 0)
    }

    # --- Simulate multi-trait phenotypes ---
    phenotype <- sim_multi_traits(
        G = Xmat,
        B = B,
        h2g = ${h2g},
        is_h2g_total = ${"TRUE" if total_h2g else "FALSE"}
    )
    phenotype <- phenotype$P  # Extract phenotype matrix

    # --- Bundle output data into a list ---
    data <- list()
    data[["X"]] <- Xmat       # Genotype matrix
    data[["Y"]] <- phenotype  # Phenotype matrix
    data[["variant"]] <- variant  # Causal variants

    # --- Save simulation output as .rds ---
    saveRDS(data, ${_output:r})

2. Figure 2A simulation (5 trait)#

[real_simulation_5trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 5
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: share_pattern = "all"
parameter: independent = False
parameter: container = ""

# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'

# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 

    # --- Load necessary libraries ---
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
  
    # install simulation package
    # devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
    # BiocManager::install("StatFunGen/pecotmr")
    library("pecotmr")
    library("simxQTL")

    # --- Read genotype data ---
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})

    # --- Extract gene name ---
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")

    # --- Filter SNPs within ±1.5Mb of TSS ---
    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
    geno$bed <- geno$bed[, keep_index]

    # --- Filter SNPs by missing rate and MAF ---
    imiss <- 0.1
    maf <- 0.05
    Xmat <- filter_X(geno$bed, imiss, maf)

    # --- Set simulation parameters ---
    ncausal <- ${n_causal}
    ntrait <- ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    # --- Select causal variants ---
    if (indep) {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are approximately independent (LD < 0.3)
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are not perfectly correlated
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
            }
        }
    }

    # --- Load predefined proportions for assigning variant sharing across traits ---
    prop <- readRDS("./Data/trait6_prop.rds")[, 1:4]
    proportions <- prop[ncausal, ]

    # --- Sample variant names based on trait-sharing proportions ---
    sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))

    # --- Initialize phenotype list and effect size matrix ---
    phenotype <- list()
    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # --- Configuration matrix for causal variants across traits ---
    config <- matrix(0, nrow = ntrait, ncol = ncausal)

    # --- Assign causal variants to traits ---
    for (i in 1:ncausal) {
        coloc_trait <- sample(1:ntrait, sampled_name[i])
        config[coloc_trait, i] <- 1
    }

    # --- Simulate effect sizes and phenotypes for each trait ---
    for (i in 1:nrow(config)) {
        beta <- B[, i, drop = FALSE]
        index <- which(config[i, ] == 1)

        if (length(index) > 0) {
            causal_index <- vars[index]
            beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
            B[, i] <- beta
            pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] <- pheno_single$P
        } else {
            # No causal variants for this trait: simulate under null
            pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] <- pheno_single$P
        }
    }

    # --- Identify causal variants for each trait ---
    variant <- list()
    for (i in 1:ncol(B)) {
        variant[[i]] <- which(B[, i] != 0)
    }

    # --- Combine phenotype matrices into one dataframe ---
    X <- Xmat
    Y <- bind_cols(phenotype)
    colnames(Y) <- paste0("Trait", 1:ntrait)

    # --- Bundle and save simulation output ---
    data <- list()
    data[["X"]] <- Xmat
    data[["Y"]] <- Y
    data[["variant"]] <- variant

    saveRDS(data, ${_output:r})

3. Figure 2A simulation (10 trait)#

[real_simulation_10trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 10
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
parameter: container = ""

# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'

# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 

    # --- Load necessary libraries ---
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
  
    # install simulation package
    # devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
    # BiocManager::install("StatFunGen/pecotmr")
    library("pecotmr")
    library("simxQTL")

    # --- Read genotype data ---
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})

    # --- Extract gene name ---
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")

    # --- Filter SNPs within ±1.5Mb of TSS ---
    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
    geno$bed = geno$bed[, keep_index]

    # --- Filter SNPs by missing rate and MAF ---
    imiss = 0.1
    maf = 0.05
    Xmat = filter_X(geno$bed, imiss, maf)

    # --- Set simulation parameters ---
    ncausal = ${n_causal}
    ntrait = ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    # --- Select causal variants ---
    if (indep) {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are approximately independent (LD < 0.3)
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are not perfectly correlated
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
            }
        }
    }

    # --- Load predefined trait-sharing proportions ---
    prop = readRDS("./Data/trait10_prop.rds")
    proportions = prop[ncausal,]

    # --- Sample number of traits shared by each variant ---
    sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))

    # --- Initialize phenotype list and effect size matrix ---
    phenotype = list()
    B = matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # --- Configuration matrix for variant sharing across traits ---
    config = matrix(0, nrow = ntrait, ncol = ncausal)

    # --- Assign causal variants to traits ---
    for (i in 1:ncausal) {
        coloc_trait = sample(c(1:ntrait), sampled_name[i])
        config[coloc_trait, i] = 1
    }

    # --- Simulate effect sizes and phenotypes ---
    for (i in 1:nrow(config)) {
        beta = B[, i, drop = FALSE]
        index = which(config[i,] == 1)

        if (length(index) > 0) {
            causal_index = vars[index]
            beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
            B[, i] = beta
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        } else {
            # No causal variant for this trait: simulate under null
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        }
    }

    # --- Identify causal variants for each trait ---
    variant = list()
    for (i in 1:ncol(B)) {
        variant[[i]] = which(B[, i] != 0)
    }

    # --- Combine phenotype matrices ---
    X = Xmat
    Y = bind_cols(phenotype)
    colnames(Y) = paste0("Trait", c(1:ntrait))

    # --- Bundle and save output ---
    data = list()
    data[["X"]] = Xmat
    data[["Y"]] = Y
    data[["variant"]] = variant

    saveRDS(data, ${_output:r})

4. Figure 2A simulation (20 trait)#

[real_simulation_20trait]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "30G"
parameter: numThreads = 1
parameter: n_trait = 20
parameter: n_causal = 1
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
parameter: container = ""

# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'

# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 

    # --- Load necessary libraries ---
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
  
    # install simulation package
    # devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
    # BiocManager::install("StatFunGen/pecotmr")
    library("pecotmr")
    library("simxQTL")

    # --- Read genotype data ---
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})

    # --- Extract gene name ---
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")

    # --- Filter SNPs within ±1.5Mb of TSS ---
    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)
    geno$bed = geno$bed[, keep_index]

    # --- Filter SNPs by missing rate and MAF ---
    imiss = 0.1
    maf = 0.05
    Xmat = filter_X(geno$bed, imiss, maf)

    # --- Set simulation parameters ---
    ncausal = ${n_causal}
    ntrait = ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    # --- Select causal variants ---
    if (indep) {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are approximately independent (LD < 0.3)
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are not perfectly correlated
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)  
                cor_mat = cor(Xmat[, vars]) 
                LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
            }
        }
    }

    # --- Load predefined trait-sharing proportions ---
    prop = readRDS("./Data/trait17_prop.rds")
    proportions = prop[ncausal,]

    # --- Sample number of traits shared by each variant ---
    sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))

    # --- Initialize phenotype list and effect size matrix ---
    phenotype = list()
    B = matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # --- Configuration matrix for variant sharing across traits ---
    config = matrix(0, nrow = ntrait, ncol = ncausal)

    # --- Assign causal variants to traits ---
    for (i in 1:ncausal) {
        coloc_trait = sample(c(1:ntrait), sampled_name[i])
        config[coloc_trait, i] = 1
    }

    # --- Simulate effect sizes and phenotypes ---
    for (i in 1:nrow(config)) {
        beta = B[, i, drop = FALSE]
        index = which(config[i, ] == 1)

        if (length(index) > 0) {
            causal_index = vars[index]
            beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
            B[, i] = beta
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        } else {
            # No causal variant for this trait: simulate under null
            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)
            phenotype[[i]] = pheno_single$P
        }
    }

    # --- Identify causal variants for each trait ---
    variant = list()
    for (i in 1:ncol(B)) {
        variant[[i]] = which(B[, i] != 0)
    }

    # --- Combine phenotype matrices ---
    X = Xmat
    Y = bind_cols(phenotype)
    colnames(Y) = paste0("Trait", c(1:ntrait))

    # --- Bundle and save output ---
    data = list()
    data[["X"]] = Xmat
    data[["Y"]] = Y
    data[["variant"]] = variant

    saveRDS(data, ${_output:r})

5. Job submission bash file#

2 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_2trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
    --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
    --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
    --cwd /home/hs3393/cb_Mar/simulation_data/

EOF

for ncausal in 1 2 3; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

5 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_5trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
        --cwd /home/hs3393/cb_Mar/simulation_data/
EOF

for ncausal in 1 2 3 4; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

10 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_10trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
        --cwd /home/hs3393/cb_Mar/simulation_data/
EOF

for ncausal in 1 2 3 4 5; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done

20 trait#

work_dir="/home/hs3393/cb_Mar/simulation_data/"
job="real_simulation_20trait"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \
        --cwd /home/hs3393/cb_Mar/simulation_data/
EOF

for ncausal in 1 2 3 4 5; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    #sbatch ${output_script}
done

6. Supplementary: all shared senario#

[all_share_simulation]
# ======= Define parameters =======
parameter: genofile = paths
parameter: cwd = path("output")
parameter: job_size = 30
parameter: walltime = "100h"
parameter: mem = "40G"
parameter: numThreads = 1
parameter: n_causal = 1
parameter: n_trait = 50
parameter: h2g = 0.05
parameter: total_h2g = False
parameter: independent = False
parameter: share_pattern = "all"
parameter: container = ""

# ======= Workflow input/output =======
input: genofile, group_by = 1
output: f'{cwd:a}/sample_{_index}_h2g_{h2g}_ncausal_{n_causal}.rds'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'

# ======= Main R block =======
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container 

    # --- Load necessary libraries ---
    library("MASS")
    library("plink2R")
    library("dplyr")
    library("readr")
    library("tidyverse")
  
    # install simulation package
    # devtools::install_github("StatFunGen/simxQTL", build_vignettes = FALSE)
    # BiocManager::install("StatFunGen/pecotmr")
    library("pecotmr")
    library("simxQTL")

    # --- Read genotype data ---
    simu_file = ${_input:r}
    geno <- read_plink(${_input:nr})

    # --- Extract gene name ---
    gene_name = str_extract(simu_file, "ENSG[0-9]+")
    gene_tss_map = read_tsv("./Data/gene_cis_TADB_mapper.tsv")

    # --- Filter SNPs within ±1.5Mb of TSS ---
    TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]
    keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))
    geno$bed <- geno$bed[, keep_index]

    # --- Filter SNPs by missing rate and MAF ---
    imiss <- 0.1
    maf <- 0.05
    Xmat <- filter_X(geno$bed, imiss, maf)

    # --- Set simulation parameters ---
    ncausal <- ${n_causal}
    ntrait <- ${n_trait}
    indep = ${"TRUE" if independent else "FALSE"}

    # --- Select causal variants ---
    if (indep) {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            # Only one causal variant needed
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Ensure selected variants are approximately independent (LD < 0.3)
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)
                cor_mat = cor(Xmat[, vars])
                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)
            }
        }
    } else {
        LD_vars = 1  # Initialize LD check

        if (ncausal == 1) {
            vars = sample(1:ncol(Xmat), size = ncausal)
        } else {
            # Avoid perfectly correlated variants (|cor| = 1)
            while (length(LD_vars != 0)) {
                vars = sample(1:ncol(Xmat), size = ncausal)
                cor_mat = cor(Xmat[, vars])
                LD_vars = which(colSums(abs(cor_mat) == 1) > 1)
            }
        }
    }

    # --- Initialize effect size matrix ---
    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)

    # --- Simulate effect sizes for all traits (shared causal variants) ---
    for (i in 1:ntrait) {
        causal_index <- vars
        beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
        B[, i] <- beta
    }

    # --- Identify causal variants for each trait ---
    variant <- list()
    for (i in 1:ncol(B)) {
        variant[[i]] <- which(B[, i] != 0)
    }

    # --- Simulate multi-trait phenotypes ---
    phenotype <- sim_multi_traits(
        G = Xmat,
        B = B,
        h2g = ${h2g},
        is_h2g_total = ${"TRUE" if total_h2g else "FALSE"}
    )
    phenotype <- phenotype$P

    # --- Bundle and save output ---
    data <- list()
    data[["X"]] <- Xmat
    data[["Y"]] <- phenotype
    data[["variant"]] <- variant

    saveRDS(data, ${_output:r})

7. Supplementary job submission#

work_dir="/home/hs3393/cb_Mar/simulation_data/all_share_simulation/"
job="all_share_simulation"
mkdir -p ${work_dir}
mkdir -p ${work_dir}/code
mkdir -p ${work_dir}/log
cd ${work_dir}/code

# Create the base_script file and write the bash code into it
cat << 'EOF' > base_script
#!/bin/bash -l
# NOTE the -l flag!
#
#SBATCH -t 10:00:00
#SBATCH --mem=30000
#SBATCH -J JOB
#SBATCH -o WORK_DIR/log/JOB."%j".out
#SBATCH -e WORK_DIR/log/JOB."%j".err

source ~/mamba_activate.sh

sos run ~/colocboost-paper/Simulation_Studies/1_Phenotype_Simulation.ipynb JOB \
        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \
        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent --n_trait 20 \
        --cwd /home/hs3393/cb_Mar/simulation_data/JOB/result
EOF

for ncausal in 1 2 3 4; do
    base_sh="base_script"
    output_script="${job}_causal_${ncausal}.sh"
    cat ${base_sh}| sed "s|WORK_DIR|${work_dir}|g" |sed "s|JOB|${job}|g"|  sed "s|CAUSAL|${ncausal}|g"  > ${output_script}
    sbatch ${output_script}
done