{ "cells": [ { "cell_type": "markdown", "id": "64d84e81-a286-4825-8940-b122f55be68b", "metadata": { "kernel": "SoS" }, "source": [ "# Secondary simulations\n", "\n", "This notebook include data simulation of 50 traits, and complex configurations suggested in HyprColoc paper." ] }, { "cell_type": "markdown", "id": "b8ec89af-da53-456f-bbfc-58175ab5e0b5", "metadata": { "kernel": "SoS" }, "source": [ "## Goal\n", "\n", "Because we don't have 50 traits to estimate and reflect the true configurations, we used a different approach: for each causal variant,\n", "we randomly select 10-25 traits to colocalize on that variant.\n", "\n", "We also simulated 10 traits complex configuration cases described and extended from Hyprcoloc paper." ] }, { "cell_type": "markdown", "id": "d795264f-41f2-48ad-aa11-a6e019d25a9b", "metadata": { "kernel": "SoS", "tags": [] }, "source": [ "## Input\n", "\n", "`genofile`: plink file of real genotyope, `/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files/*.bim`\n", "\n", "The other parameters can be found in simxQTL repo. `https://github.com/StatFunGen/simxQTL`.\n", " \n", "## Output\n", "\n", "An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits) \n", "\n", "Example output: " ] }, { "cell_type": "code", "execution_count": 5, "id": "3199cef0-90b6-4489-8aff-eb28f65ec34f", "metadata": { "kernel": "R", "tags": [] }, "outputs": [], "source": [ "result = readRDS(\"/home/hs3393/cb_Mar/simulation_data/simulation_551rand_complex/sample_39_h2g_0.05_10trait_cluster_5+5+1rand.rds\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "9312870e-902e-4293-83a6-cfe9c341633a", "metadata": { "kernel": "R", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. \n", "
    1. 3620
    2. 5240
    \n", "
  2. \n", "\t
  3. 3620
  4. \n", "\t
  5. 3620
  6. \n", "\t
  7. \n", "
    1. 3620
    2. 5240
    \n", "
  8. \n", "\t
  9. 3620
  10. \n", "\t
  11. \n", "
    1. 250
    2. 5240
    \n", "
  12. \n", "\t
  13. 250
  14. \n", "\t
  15. 250
  16. \n", "\t
  17. 250
  18. \n", "\t
  19. \n", "
    1. 250
    2. 5240
    \n", "
  20. \n", "
\n" ], "text/latex": [ "\\begin{enumerate}\n", "\\item \\begin{enumerate*}\n", "\\item 3620\n", "\\item 5240\n", "\\end{enumerate*}\n", "\n", "\\item 3620\n", "\\item 3620\n", "\\item \\begin{enumerate*}\n", "\\item 3620\n", "\\item 5240\n", "\\end{enumerate*}\n", "\n", "\\item 3620\n", "\\item \\begin{enumerate*}\n", "\\item 250\n", "\\item 5240\n", "\\end{enumerate*}\n", "\n", "\\item 250\n", "\\item 250\n", "\\item 250\n", "\\item \\begin{enumerate*}\n", "\\item 250\n", "\\item 5240\n", "\\end{enumerate*}\n", "\n", "\\end{enumerate}\n" ], "text/markdown": [ "1. 1. 3620\n", "2. 5240\n", "\n", "\n", "\n", "2. 3620\n", "3. 3620\n", "4. 1. 3620\n", "2. 5240\n", "\n", "\n", "\n", "5. 3620\n", "6. 1. 250\n", "2. 5240\n", "\n", "\n", "\n", "7. 250\n", "8. 250\n", "9. 250\n", "10. 1. 250\n", "2. 5240\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "[[1]]\n", "[1] 3620 5240\n", "\n", "[[2]]\n", "[1] 3620\n", "\n", "[[3]]\n", "[1] 3620\n", "\n", "[[4]]\n", "[1] 3620 5240\n", "\n", "[[5]]\n", "[1] 3620\n", "\n", "[[6]]\n", "[1] 250 5240\n", "\n", "[[7]]\n", "[1] 250\n", "\n", "[[8]]\n", "[1] 250\n", "\n", "[[9]]\n", "[1] 250\n", "\n", "[[10]]\n", "[1] 250 5240\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "result$variant" ] }, { "cell_type": "markdown", "id": "c067f072-6983-4079-9d0b-fab6a54569b5", "metadata": { "kernel": "R" }, "source": [ "In this region, we simulated **3** causal variants (250, 3620, 5240). Each causal variant is distributed in **10** traits." ] }, { "cell_type": "markdown", "id": "3c8deff5-b934-458e-ad12-929dda6c36df", "metadata": { "kernel": "SoS", "tags": [] }, "source": [ "## Simulation code" ] }, { "cell_type": "code", "execution_count": null, "id": "0b14c7fb-57a8-4402-8175-cfe9bbcf7bb0", "metadata": { "kernel": "SoS", "tags": [] }, "outputs": [], "source": [ "[simulation_50trait]\n", "parameter: genofile = paths\n", "# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.\n", "parameter: cwd = path(\"output\")\n", "parameter: job_size = 30\n", "parameter: walltime = \"100h\"\n", "parameter: mem = \"30G\"\n", "parameter: numThreads = 1\n", "parameter: independent = False\n", "# for each variant, how many traits it randomly colocalize at\n", "parameter: n_trait = 50\n", "parameter: h2g = 0.05\n", "parameter: ncausal = 5\n", "parameter: share_pattern = \"all\"\n", "# specify the number of traits (phenotypes)\n", "parameter: container = \"\"\n", "input: genofile, group_by = 1\n", "output: f'{cwd:a}/{step_name}/sample_{_index}_h2g_{h2g}_50_trait_ncausal_{ncausal}.rds'\n", "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", "R: expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n", " library(\"MASS\")\n", " library(\"plink2R\")\n", " library(\"dplyr\")\n", " library(\"readr\")\n", " library(\"tidyverse\")\n", " # source some functions to read matrix and inpute the missing data\n", " source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n", " source(\"~/cloud_colocalization/simulation_code/misc.R\")\n", " # read the plink file\n", " simu_file = ${_input:r}\n", " geno <- read_plink(${_input:nr})\n", " gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n", " gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n", " # filter by distance with. TSS\n", " TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n", " keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n", " geno$bed = geno$bed[,keep_index]\n", " # filter out columns with missing rate > 0.1\n", " imiss = 0.1\n", " # filter out columns with MAF < 0.05\n", " maf = 0.05\n", " Xmat = filter_X(geno$bed, imiss, maf)\n", " ncausal = ${ncausal}\n", " indep = ${\"TRUE\" if independent else \"FALSE\"}\n", " \n", " if (indep) {\n", " LD_vars = 1 # Initialize LD_vars\n", "\n", " if (ncausal == 1) {\n", " # If only one causal variant, just sample it\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " } else {\n", " # Repeat sampling until selected variables are quasi independent\n", " while (length(LD_vars != 0)) {\n", " vars = sample(1:ncol(Xmat), size = ncausal) \n", " cor_mat = cor(Xmat[, vars]) \n", " LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n", " }\n", " }\n", " } else {\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " }\n", "\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " B = matrix(0, nrow = ncol(Xmat), ncol = 50)\n", " \n", " trait_number_vector = sample(x = c(10:25), size = ncausal, replace = TRUE)\n", " \n", " trait_list = list()\n", " for(i in 1:ncausal){\n", " trait_list[[i]] = sample(x = 1:50, size = trait_number_vector[i])\n", " \n", " }\n", " \n", " \n", " \n", " phenotype = list()\n", " for(i in 1:50){\n", " index = which(unlist(lapply(trait_list, function(x) i %in% x)))\n", " if(length(index) > 0){ \n", " beta = sim_beta_fix_variant(G = Xmat, causal_index = vars[index], is_h2g_total = FALSE)\n", " B[, i] = beta\n", " pheno_single = sim_multi_traits(G = Xmat, B = B[,i, drop = FALSE], h2g = 0.05, is_h2g_total = FALSE)\n", " phenotype[[i]] = pheno_single$P\n", " }else{\n", " pheno_single = sim_multi_traits(G = Xmat, B = B[,i, drop = FALSE], h2g = 0.05, is_h2g_total = FALSE)\n", " phenotype[[i]] = pheno_single$P\n", " \n", " }\n", "\n", " }\n", " variant = list()\n", " for(i in 1:ncol(B)){\n", " variant[[i]] = which(B[,i] != 0)\n", " }\n", " X = Xmat\n", " Y = bind_cols(phenotype)\n", " colnames(Y) = paste0(\"Trait\", c(1:50))\n", " data = list()\n", " data[[\"X\"]] = Xmat\n", " data[[\"Y\"]] = Y\n", " data[[\"variant\"]] = variant\n", " saveRDS(data, ${_output:r})" ] }, { "cell_type": "markdown", "id": "c6db48ae-0df7-4a97-9a8e-eee894698ce4", "metadata": { "kernel": "SoS" }, "source": [ "## Phenotype Simulation" ] }, { "cell_type": "code", "execution_count": null, "id": "42ea6b19-0f12-43ea-b0ab-86e7324f73f7", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n", "job=\"simulation_50trait\"\n", "mkdir -p ${work_dir}\n", "mkdir -p ${work_dir}/code\n", "mkdir -p ${work_dir}/log\n", "cd ${work_dir}/code\n", "\n", "# Create the base_script file and write the bash code into it\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 10:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n", "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n", " --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n", " --ncausal CAUSAL --mem 30G --h2g 0.05 --independent \\\n", " --cwd /home/hs3393/cb_Mar/simulation_data/\n", "EOF\n", "\n", "for ncausal in 1 2 3 4 5; do\n", " base_sh=\"base_script\"\n", " output_script=\"${job}_causal_${ncausal}.sh\"\n", " cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"| sed \"s|CAUSAL|${ncausal}|g\" > ${output_script}\n", " sbatch ${output_script}\n", "done" ] }, { "cell_type": "markdown", "id": "42176543-bef9-4fad-9b66-72b74e81c1e3", "metadata": { "kernel": "SoS" }, "source": [ "## 50 trait: Run ColocBoost" ] }, { "cell_type": "code", "execution_count": null, "id": "662e1430-abb8-40eb-908c-325ed87bdaf3", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "data_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n", "job=\"simulation_50trait\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/\"\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 80:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR/JOB\n", "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n", " --simufile $(find -type f -name '*_ncausal_NCAUSAL*.rds') \\\n", " --mem 40G --trait 50 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "for ncausal in 1 2 3 4 5; do\n", " output_script=\"ncausal_${ncausal}.sh\"\n", " cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|NCAUSAL|${ncausal}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", " sbatch ${output_script}\n", "done" ] }, { "cell_type": "markdown", "id": "7eb53040-3036-47d1-8e31-8067b6f52204", "metadata": { "kernel": "SoS" }, "source": [ "## 50 trait: Run Hyprcoloc" ] }, { "cell_type": "code", "execution_count": null, "id": "4044ef4d-7c0b-46d4-b119-b06a9abde27f", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "data_dir=\"/home/hs3393/cb_Mar/simulation_data/simulation_50trait/\"\n", "job=\"simulation_50trait\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 80:00:00\n", "#SBATCH --mem=20000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n", " --simufile $(find -type f -name '*_ncausal_NCAUSAL*.rds') \\\n", " --mem 20G --trait 50 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "for ncausal in 1 2 3 4 5; do\n", " output_script=\"ncausal_${ncausal}.sh\"\n", " cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|NCAUSAL|${ncausal}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", " sbatch ${output_script}\n", "done" ] }, { "cell_type": "markdown", "id": "6e42bbf2-2c53-460e-9d6b-e6a8cf9c88f8", "metadata": { "kernel": "SoS" }, "source": [ "## 50 trait: Colocboost summary" ] }, { "cell_type": "code", "execution_count": null, "id": "15dd0249-ec4e-4322-8365-e8d742460ef4", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "data_dir=\"/home/hs3393/cb_Mar/simulation_result/simulation_50trait/\"\n", "mkdir -p ${data_dir}/summary\n", "cd ${data_dir}/summary\n", "\n", "cat << 'EOF' > summary_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 8:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J sum\n", "#SBATCH -o DATA_DIR/log/summary.\"%j\".out\n", "#SBATCH -e DATA_DIR/log/summary.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/4.Result_Summary.ipynb coloc_summary \\\n", " --folder DATA_DIR/result \\\n", " --cwd DATA_DIR/summary\n", "EOF\n", "\n", "\n", "base_script=\"summary_script\"\n", "output_script=\"summary.sh\"\n", "cat ${base_script}| sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n" ] }, { "cell_type": "markdown", "id": "413263e6-31c6-4e6b-9e2b-063ab44a9293", "metadata": { "kernel": "SoS" }, "source": [ "## 50 trait, Hyprcoloc summary" ] }, { "cell_type": "code", "execution_count": null, "id": "f838cdd0-4456-481e-ae9f-6ff832bac0c3", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "data_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/simulation_50trait/\"\n", "mkdir -p ${data_dir}/summary\n", "cd ${data_dir}/summary\n", "\n", "cat << 'EOF' > summary_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 8:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J sum\n", "#SBATCH -o DATA_DIR/log/summary.\"%j\".out\n", "#SBATCH -e DATA_DIR/log/summary.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/4.Result_Summary.ipynb coloc_summary \\\n", " --folder DATA_DIR/result \\\n", " --cwd DATA_DIR/summary\n", "EOF\n", "\n", "\n", "base_script=\"summary_script\"\n", "output_script=\"summary.sh\"\n", "cat ${base_script}| sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n" ] }, { "cell_type": "markdown", "id": "3f3d5e75-0fd1-4c8e-8bf0-3bc2eaf54c1a", "metadata": { "kernel": "SoS" }, "source": [ "## Complex simulation" ] }, { "cell_type": "code", "execution_count": null, "id": "6e7bd40f-c0ad-4811-9646-de373b7ac43b", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "[simulation_55_complex]\n", "parameter: genofile = paths\n", "# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.\n", "parameter: cwd = path(\"output\")\n", "parameter: job_size = 30\n", "parameter: walltime = \"100h\"\n", "parameter: mem = \"30G\"\n", "parameter: numThreads = 1\n", "# specify the number of causal variants\n", "parameter: n_trait = 10\n", "parameter: h2g = 0.05\n", "parameter: total_h2g = False\n", "parameter: share_pattern = \"all\"\n", "parameter: independent = False\n", "# specify the number of traits (phenotypes)\n", "parameter: container = \"\"\n", "input: genofile, group_by = 1\n", "output: f'{cwd:a}/{step_name}/sample_{_index}_h2g_{h2g}_10trait_cluster_5+5.rds'\n", "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", "R: expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n", " library(\"MASS\")\n", " library(\"plink2R\")\n", " library(\"dplyr\")\n", " library(\"readr\")\n", " library(\"tidyverse\")\n", " # source some functions to read matrix and inpute the missing data\n", " source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n", " source(\"~/cloud_colocalization/simulation_code/misc.R\")\n", " # read the plink file\n", " simu_file = ${_input:r}\n", " geno <- read_plink(${_input:nr})\n", " gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n", " gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n", " # filter by distance with. TSS\n", " TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n", " keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n", " geno$bed = geno$bed[,keep_index]\n", " # filter out columns with missing rate > 0.1\n", " imiss = 0.1\n", " # filter out columns with MAF < 0.05\n", " maf = 0.05\n", " Xmat = filter_X(geno$bed, imiss, maf)\n", " indep = ${\"TRUE\" if independent else \"FALSE\"}\n", " if(indep){ LD_thresh = 0.3} else{ LD_thresh = 1}\n", " LD_vars = 1\n", " while(length(LD_vars != 0)){\n", " \n", " B1 = sim_beta(G = Xmat, ncausal = 1, ntrait = 5, \n", " is_h2g_total = FALSE, shared_pattern = \"all\")\n", " B2 = sim_beta(G = Xmat, ncausal = 1, ntrait = 5, \n", " is_h2g_total = FALSE, shared_pattern = \"all\")\n", " B = cbind(B1, B2)\n", " variant = list()\n", " for(i in 1:ncol(B)){\n", " variant[[i]] = which(B[,i] != 0)\n", " }\n", "\n", " var_mat = unique(unlist(variant))\n", " cor_mat = cor(Xmat[,var_mat]) \n", " LD_vars = which(colSums(abs(cor_mat) > LD_thresh) > 1)\n", " }\n", " phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n", " phenotype = phenotype$P\n", " X = Xmat\n", " Y = phenotype\n", " data = list()\n", " data[[\"X\"]] = Xmat\n", " data[[\"Y\"]] = Y\n", " data[[\"variant\"]] = variant\n", " saveRDS(data, ${_output:r})" ] }, { "cell_type": "code", "execution_count": null, "id": "13abddbc-b9e1-4f93-ab13-c5bca02776a8", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "[simulation_3322_complex]\n", "parameter: genofile = paths\n", "# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.\n", "parameter: cwd = path(\"output\")\n", "parameter: job_size = 30\n", "parameter: walltime = \"100h\"\n", "parameter: mem = \"30G\"\n", "parameter: numThreads = 1\n", "# specify the number of causal variants\n", "parameter: n_trait = 10\n", "parameter: h2g = 0.05\n", "parameter: total_h2g = False\n", "parameter: share_pattern = \"all\"\n", "parameter: independent = False\n", "# specify the number of traits (phenotypes)\n", "parameter: container = \"\"\n", "input: genofile, group_by = 1\n", "output: f'{cwd:a}/{step_name}//sample_{_index}_h2g_{h2g}_10trait_cluster_3+3+2+2.rds'\n", "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", "R: expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n", " library(\"MASS\")\n", " library(\"plink2R\")\n", " library(\"dplyr\")\n", " library(\"readr\")\n", " library(\"tidyverse\")\n", " # source some functions to read matrix and inpute the missing data\n", " source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n", " source(\"~/cloud_colocalization/simulation_code/misc.R\")\n", " # read the plink file\n", " simu_file = ${_input:r}\n", " geno <- read_plink(${_input:nr})\n", " gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n", " gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n", " # filter by distance with. TSS\n", " TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n", " keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n", " geno$bed = geno$bed[,keep_index]\n", " # filter out columns with missing rate > 0.1\n", " imiss = 0.1\n", " # filter out columns with MAF < 0.05\n", " maf = 0.05\n", " Xmat = filter_X(geno$bed, imiss, maf)\n", " indep = ${\"TRUE\" if independent else \"FALSE\"}\n", " if(indep){ LD_thresh = 0.3} else{ LD_thresh = 1}\n", " LD_vars = 1\n", " while(length(LD_vars != 0)){\n", " B1 = sim_beta(G = Xmat, ncausal = 1, ntrait = 3, \n", " is_h2g_total = FALSE, shared_pattern = \"all\")\n", " B2 = sim_beta(G = Xmat, ncausal = 1, ntrait = 3, \n", " is_h2g_total = FALSE, shared_pattern = \"all\")\n", " B3 = sim_beta(G = Xmat, ncausal = 1, ntrait = 2, \n", " is_h2g_total = FALSE, shared_pattern = \"all\")\n", " B4 = sim_beta(G = Xmat, ncausal = 1, ntrait = 2, \n", " is_h2g_total = FALSE, shared_pattern = \"all\")\n", " B = cbind(B1, B2, B3, B4)\n", " variant = list()\n", " for(i in 1:ncol(B)){\n", " variant[[i]] = which(B[,i] != 0)\n", " }\n", "\n", " var_mat = unique(unlist(variant))\n", " cor_mat = cor(Xmat[,var_mat]) \n", " LD_vars = which(colSums(abs(cor_mat) > LD_thresh) > 1)\n", " }\n", " phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n", " phenotype = phenotype$P\n", " X = Xmat\n", " Y = phenotype\n", " data = list()\n", " data[[\"X\"]] = Xmat\n", " data[[\"Y\"]] = Y\n", " data[[\"variant\"]] = variant\n", " saveRDS(data, ${_output:r})" ] }, { "cell_type": "code", "execution_count": null, "id": "e7dcb799-93f9-4d81-b0a0-cea7739fe8d7", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "[simulation_551rand_complex]\n", "parameter: genofile = paths\n", "# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.\n", "parameter: cwd = path(\"output\")\n", "parameter: job_size = 30\n", "parameter: walltime = \"100h\"\n", "parameter: mem = \"30G\"\n", "parameter: numThreads = 1\n", "# specify the number of causal variants\n", "parameter: n_trait = 10\n", "parameter: h2g = 0.05\n", "parameter: total_h2g = False\n", "parameter: share_pattern = \"all\"\n", "parameter: independent = False\n", "# specify the number of traits (phenotypes)\n", "parameter: container = \"\"\n", "input: genofile, group_by = 1\n", "output: f'{cwd:a}/{step_name}//sample_{_index}_h2g_{h2g}_10trait_cluster_5+5+1rand.rds'\n", "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", "R: expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n", " library(\"MASS\")\n", " library(\"plink2R\")\n", " library(\"dplyr\")\n", " library(\"readr\")\n", " library(\"tidyverse\")\n", " # source some functions to read matrix and inpute the missing data\n", " source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n", " source(\"~/cloud_colocalization/simulation_code/misc.R\")\n", " # read the plink file\n", " simu_file = ${_input:r}\n", " geno <- read_plink(${_input:nr})\n", " gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n", " gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n", " # filter by distance with. TSS\n", " TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n", " keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n", " geno$bed = geno$bed[,keep_index]\n", " # filter out columns with missing rate > 0.1\n", " imiss = 0.1\n", " # filter out columns with MAF < 0.05\n", " maf = 0.05\n", " Xmat = filter_X(geno$bed, imiss, maf)\n", " \n", " LD_vars = 1\n", " if (indep) {\n", " LD_vars = 1 # Initialize LD_vars\n", "\n", " if (ncausal == 1) {\n", " # If only one causal variant, just sample it\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " } else {\n", " # Repeat sampling until selected variables are quasi independent\n", " while (length(LD_vars != 0)) {\n", " vars = sample(1:ncol(Xmat), size = 3) \n", " cor_mat = cor(Xmat[, vars]) \n", " LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n", " }\n", " }\n", " } else {\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " }\n", " \n", " B = matrix(0, nrow = ncol(Xmat), ncol = 10)\n", " var_vec = list()\n", " rand_var_trait = sample(x = 1:10, size = 4)\n", " for(i in 1:5){\n", " if(i %in% rand_var_trait){\n", " var_vec[[i]] = c(vars[1], vars[3])\n", "\n", " }else{\n", " var_vec[[i]] = c(vars[1])\n", " }\n", " beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n", " B[, i] = beta\n", " }\n", " for(i in 6:10){\n", " if(i %in% rand_var_trait){\n", " var_vec[[i]] = c(vars[2], vars[3])\n", "\n", " }else{\n", " var_vec[[i]] = c(vars[2])\n", " }\n", " beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n", " B[, i] = beta\n", "\n", " }\n", "\n", " variant = list()\n", "\n", " for(i in 1:ncol(B)){\n", " variant[[i]] = which(B[,i] != 0)\n", " }\n", " phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n", " phenotype = phenotype$P\n", " X = Xmat\n", " Y = phenotype\n", " data = list()\n", " data[[\"X\"]] = Xmat\n", " data[[\"Y\"]] = Y\n", " data[[\"variant\"]] = variant\n", " saveRDS(data, ${_output:r})\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "46e8d6a7-616d-4206-860e-a4c83947ef7f", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "[simulation_552rand_complex]\n", "parameter: genofile = paths\n", "# pheno_file: give genotype file (in plink),we can read the gentype matrix. These files are separated by TADs.\n", "parameter: cwd = path(\"output\")\n", "parameter: job_size = 30\n", "parameter: walltime = \"100h\"\n", "parameter: mem = \"30G\"\n", "parameter: numThreads = 1\n", "# specify the number of causal variants\n", "parameter: n_trait = 10\n", "parameter: h2g = 0.05\n", "parameter: total_h2g = False\n", "parameter: share_pattern = \"all\"\n", "parameter: independent = False\n", "# specify the number of traits (phenotypes)\n", "parameter: container = \"\"\n", "input: genofile, group_by = 1\n", "output: f'{cwd:a}/{step_name}/sample_{_index}_h2g_{h2g}_10trait_cluster_5+5+2rand.rds'\n", "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n", "R: expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n", " library(\"MASS\")\n", " library(\"plink2R\")\n", " library(\"dplyr\")\n", " library(\"readr\")\n", " library(\"tidyverse\")\n", " # source some functions to read matrix and inpute the missing data\n", " source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n", " source(\"~/cloud_colocalization/simulation_code/misc.R\")\n", " # read the plink file\n", " # read the plink file\n", " simu_file = ${_input:r}\n", " geno <- read_plink(${_input:nr})\n", " gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n", " gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n", " # filter by distance with. TSS\n", " TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n", " keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n", " geno$bed = geno$bed[,keep_index]\n", " # filter out columns with missing rate > 0.1\n", " imiss = 0.1\n", " # filter out columns with MAF < 0.05\n", " maf = 0.05\n", " Xmat = filter_X(geno$bed, imiss, maf)\n", " \n", " LD_vars = 1\n", " if (indep) {\n", " LD_vars = 1 # Initialize LD_vars\n", "\n", " if (ncausal == 1) {\n", " # If only one causal variant, just sample it\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " } else {\n", " # Repeat sampling until selected variables are quasi independent\n", " while (length(LD_vars != 0)) {\n", " vars = sample(1:ncol(Xmat), size = 4) \n", " cor_mat = cor(Xmat[, vars]) \n", " LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n", " }\n", " }\n", " } else {\n", " vars = sample(1:ncol(Xmat), size = ncausal)\n", " }\n", " \n", " B = matrix(0, nrow = ncol(Xmat), ncol = 10)\n", " var_vec = list()\n", " rand_var_trait1 = sample(x = 1:10, size = 4)\n", " rand_var_trait2 = sample(x = 1:10, size = 4)\n", " for(i in 1:5){\n", " var_vec[[i]] = c(vars[1])\n", " if(i %in% rand_var_trait1){\n", " var_vec[[i]] = c(var_vec[[i]], vars[3])\n", "\n", " }\n", " if(i %in% rand_var_trait2){\n", " var_vec[[i]] = c(var_vec[[i]], vars[4])\n", " }\n", " \n", " beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n", " B[, i] = beta\n", " }\n", " for(i in 6:10){\n", " var_vec[[i]] = c(vars[2])\n", " if(i %in% rand_var_trait1){\n", " var_vec[[i]] = c(var_vec[[i]], vars[3])\n", "\n", " }\n", " if(i %in% rand_var_trait2){\n", " var_vec[[i]] = c(var_vec[[i]], vars[4])\n", " }\n", " \n", " beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n", " B[, i] = beta\n", " }\n", "\n", " variant = list()\n", "\n", " for(i in 1:ncol(B)){\n", " variant[[i]] = which(B[,i] != 0)\n", " }\n", " phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n", " phenotype = phenotype$P\n", " X = Xmat\n", " Y = phenotype\n", " data = list()\n", " data[[\"X\"]] = Xmat\n", " data[[\"Y\"]] = Y\n", " data[[\"variant\"]] = variant\n", " saveRDS(data, ${_output:r})" ] }, { "cell_type": "markdown", "id": "b3b90f13-7a6e-4313-a3c8-ed961e735e13", "metadata": { "kernel": "SoS" }, "source": [ "## Complex simulation - bash submission" ] }, { "cell_type": "code", "execution_count": null, "id": "66484a46-4fbd-4aab-aec6-9c99db6f1fec", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n", "job=\"simulation_55_complex\"\n", "mkdir -p ${work_dir}\n", "mkdir -p ${work_dir}/code\n", "mkdir -p ${work_dir}/log\n", "cd ${work_dir}/code\n", "\n", "# Create the base_script file and write the bash code into it\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 10:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n", "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n", " --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n", " --mem 30G --h2g 0.05 --independent \\\n", " --cwd /home/hs3393/cb_Mar/simulation_data/\n", "EOF\n", "\n", "base_sh=\"base_script\"\n", "output_script=\"${job}.sh\"\n", "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n", "sbatch ${output_script}" ] }, { "cell_type": "code", "execution_count": null, "id": "341f4d04-65e8-44a5-920b-a2434c6cd777", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n", "job=\"simulation_3322_complex\"\n", "mkdir -p ${work_dir}\n", "mkdir -p ${work_dir}/code\n", "mkdir -p ${work_dir}/log\n", "cd ${work_dir}/code\n", "\n", "# Create the base_script file and write the bash code into it\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 10:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n", "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n", " --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n", " --mem 30G --h2g 0.05 --independent \\\n", " --cwd /home/hs3393/cb_Mar/simulation_data/\n", "EOF\n", "\n", "base_sh=\"base_script\"\n", "output_script=\"${job}.sh\"\n", "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n", "sbatch ${output_script}" ] }, { "cell_type": "code", "execution_count": null, "id": "50cc8c23-634f-49dc-aefe-7e6faa11d3a0", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n", "job=\"simulation_551rand_complex\"\n", "mkdir -p ${work_dir}\n", "mkdir -p ${work_dir}/code\n", "mkdir -p ${work_dir}/log\n", "cd ${work_dir}/code\n", "\n", "# Create the base_script file and write the bash code into it\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 10:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n", "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n", " --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n", " --mem 30G --h2g 0.05 --independent \\\n", " --cwd /home/hs3393/cb_Mar/simulation_data/\n", "EOF\n", "\n", "base_sh=\"base_script\"\n", "output_script=\"${job}.sh\"\n", "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n", "sbatch ${output_script}" ] }, { "cell_type": "code", "execution_count": null, "id": "c1ae576e-5806-4882-8f28-704474ad5aac", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n", "job=\"simulation_552rand_complex\"\n", "mkdir -p ${work_dir}\n", "mkdir -p ${work_dir}/code\n", "mkdir -p ${work_dir}/log\n", "cd ${work_dir}/code\n", "\n", "# Create the base_script file and write the bash code into it\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 10:00:00\n", "#SBATCH --mem=30000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n", "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n", "\n", "source ~/mamba_activate.sh\n", "\n", "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n", " --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n", " --mem 30G --h2g 0.05 --independent \\\n", " --cwd /home/hs3393/cb_Mar/simulation_data/\n", "EOF\n", "\n", "base_sh=\"base_script\"\n", "output_script=\"${job}.sh\"\n", "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n", "sbatch ${output_script}" ] }, { "cell_type": "markdown", "id": "de94d812-a998-4348-9714-3a41b0d75ceb", "metadata": { "kernel": "SoS" }, "source": [ "## Run Colocboost" ] }, { "cell_type": "code", "execution_count": null, "id": "6ef1d1d2-5a30-4fb5-9fe7-59c5e0ebc5e7", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_3322_complex/\"\n", "job=\"simulation_3322_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n", "\n", "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_55_complex/\"\n", "job=\"simulation_55_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n", "\n", "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_551rand_complex/\"\n", "job=\"simulation_551rand_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n", "\n", "data_dir=\"/home/hs3393/cb_Mar/simulation_data/simulation_552rand_complex/\"\n", "job=\"simulation_552rand_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n" ] }, { "cell_type": "markdown", "id": "632c5bb4-5e74-4a10-be27-0a8821f41039", "metadata": { "kernel": "SoS" }, "source": [ "## Run Hyprcoloc" ] }, { "cell_type": "code", "execution_count": null, "id": "abd463f2-30ee-43f3-a502-431ec581f64e", "metadata": { "kernel": "SoS" }, "outputs": [], "source": [ "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_3322_complex/\"\n", "job=\"simulation_3322_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n", "\n", "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_55_complex/\"\n", "job=\"simulation_55_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n", "\n", "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_551rand_complex/\"\n", "job=\"simulation_551rand_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n", "\n", "data_dir=\"/home/hs3393/cb_Mar/simulation_data/simulation_552rand_complex/\"\n", "job=\"simulation_552rand_complex\"\n", "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n", "#!/bin/bash\n", "\n", "mkdir -p ${work_dir}/${job}/code\n", "mkdir -p ${work_dir}/${job}/log\n", "mkdir -p ${work_dir}/${job}/result\n", "\n", "cd ${work_dir}/${job}/code\n", "\n", "cat << 'EOF' > base_script\n", "#!/bin/bash -l\n", "# NOTE the -l flag!\n", "#\n", "#SBATCH -t 100:00:00\n", "#SBATCH --mem=40000\n", "#SBATCH -J JOB\n", "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n", "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n", "\n", "source /home/hs3393/mamba_activate.sh\n", "module load Singularity\n", "\n", "cd DATA_DIR\n", "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n", " --simufile $(find -type f -name '*.rds') \\\n", " --mem 40G --trait 10 \\\n", " --cwd WORK_DIR/JOB/result\n", "EOF\n", "\n", "\n", "base_script=\"base_script\"\n", "output_script=\"job_${job}.sh\"\n", "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\" > ${output_script}\n", "sbatch ${output_script}\n" ] } ], "metadata": { "kernelspec": { "display_name": "SoS", "language": "sos", "name": "sos" }, "language_info": { "codemirror_mode": "sos", "file_extension": ".sos", "mimetype": "text/x-sos", "name": "sos", "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", "pygments_lexer": "sos" }, "sos": { "kernels": [ [ "R", "ir", "R", "", "" ], [ "SoS", "sos", "", "", "sos" ] ], "version": "0.24.3" } }, "nbformat": 4, "nbformat_minor": 5 }