{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "64d84e81-a286-4825-8940-b122f55be68b",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "# Secondary simulations\n",
    "\n",
    "This notebook include data simulation of 50 traits, and complex configurations suggested in HyprColoc paper."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8ec89af-da53-456f-bbfc-58175ab5e0b5",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Goal\n",
    "\n",
    "Because we don't have 50 traits to estimate and reflect the true configurations, we used a different approach: for each causal variant,\n",
    "we randomly select 10-25 traits to colocalize on that variant.\n",
    "\n",
    "We also simulated 10 traits complex configuration cases described and extended from Hyprcoloc paper."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d795264f-41f2-48ad-aa11-a6e019d25a9b",
   "metadata": {
    "kernel": "SoS",
    "tags": []
   },
   "source": [
    "## Input\n",
    "\n",
    "`genofile`: plink file of real genotyope, `/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files/*.bim`\n",
    "\n",
    "The other parameters can be found in simxQTL repo. `https://github.com/StatFunGen/simxQTL`.\n",
    "    \n",
    "## Output\n",
    "\n",
    "An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits) \n",
    "\n",
    "Example output:    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3199cef0-90b6-4489-8aff-eb28f65ec34f",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [],
   "source": [
    "result = readRDS(\"/home/hs3393/cb_Mar/simulation_data/simulation_551rand_complex/sample_39_h2g_0.05_10trait_cluster_5+5+1rand.rds\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9312870e-902e-4293-83a6-cfe9c341633a",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<ol>\n",
       "\t<li><style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>3620</li><li>5240</li></ol>\n",
       "</li>\n",
       "\t<li>3620</li>\n",
       "\t<li>3620</li>\n",
       "\t<li><style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>3620</li><li>5240</li></ol>\n",
       "</li>\n",
       "\t<li>3620</li>\n",
       "\t<li><style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>250</li><li>5240</li></ol>\n",
       "</li>\n",
       "\t<li>250</li>\n",
       "\t<li>250</li>\n",
       "\t<li>250</li>\n",
       "\t<li><style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>250</li><li>5240</li></ol>\n",
       "</li>\n",
       "</ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate}\n",
       "\\item \\begin{enumerate*}\n",
       "\\item 3620\n",
       "\\item 5240\n",
       "\\end{enumerate*}\n",
       "\n",
       "\\item 3620\n",
       "\\item 3620\n",
       "\\item \\begin{enumerate*}\n",
       "\\item 3620\n",
       "\\item 5240\n",
       "\\end{enumerate*}\n",
       "\n",
       "\\item 3620\n",
       "\\item \\begin{enumerate*}\n",
       "\\item 250\n",
       "\\item 5240\n",
       "\\end{enumerate*}\n",
       "\n",
       "\\item 250\n",
       "\\item 250\n",
       "\\item 250\n",
       "\\item \\begin{enumerate*}\n",
       "\\item 250\n",
       "\\item 5240\n",
       "\\end{enumerate*}\n",
       "\n",
       "\\end{enumerate}\n"
      ],
      "text/markdown": [
       "1. 1. 3620\n",
       "2. 5240\n",
       "\n",
       "\n",
       "\n",
       "2. 3620\n",
       "3. 3620\n",
       "4. 1. 3620\n",
       "2. 5240\n",
       "\n",
       "\n",
       "\n",
       "5. 3620\n",
       "6. 1. 250\n",
       "2. 5240\n",
       "\n",
       "\n",
       "\n",
       "7. 250\n",
       "8. 250\n",
       "9. 250\n",
       "10. 1. 250\n",
       "2. 5240\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[[1]]\n",
       "[1] 3620 5240\n",
       "\n",
       "[[2]]\n",
       "[1] 3620\n",
       "\n",
       "[[3]]\n",
       "[1] 3620\n",
       "\n",
       "[[4]]\n",
       "[1] 3620 5240\n",
       "\n",
       "[[5]]\n",
       "[1] 3620\n",
       "\n",
       "[[6]]\n",
       "[1]  250 5240\n",
       "\n",
       "[[7]]\n",
       "[1] 250\n",
       "\n",
       "[[8]]\n",
       "[1] 250\n",
       "\n",
       "[[9]]\n",
       "[1] 250\n",
       "\n",
       "[[10]]\n",
       "[1]  250 5240\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "result$variant"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c067f072-6983-4079-9d0b-fab6a54569b5",
   "metadata": {
    "kernel": "R"
   },
   "source": [
    "In this region, we simulated **3** causal variants (250, 3620, 5240). Each causal variant is distributed in **10** traits."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c8deff5-b934-458e-ad12-929dda6c36df",
   "metadata": {
    "kernel": "SoS",
    "tags": []
   },
   "source": [
    "## Simulation code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b14c7fb-57a8-4402-8175-cfe9bbcf7bb0",
   "metadata": {
    "kernel": "SoS",
    "tags": []
   },
   "outputs": [],
   "source": [
    "[simulation_50trait]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "parameter: independent = False\n",
    "# for each variant, how many traits it randomly colocalize at\n",
    "parameter: n_trait = 50\n",
    "parameter: h2g = 0.05\n",
    "parameter: ncausal = 5\n",
    "parameter: share_pattern = \"all\"\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_h2g_{h2g}_50_trait_ncausal_{ncausal}.rds'\n",
    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "    ncausal = ${ncausal}\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "  \n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = ncausal)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "\n",
    "    vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    B = matrix(0, nrow =  ncol(Xmat), ncol = 50)\n",
    "  \n",
    "    trait_number_vector = sample(x = c(10:25), size = ncausal, replace = TRUE)\n",
    "  \n",
    "    trait_list = list()\n",
    "    for(i in 1:ncausal){\n",
    "      trait_list[[i]] = sample(x = 1:50, size =  trait_number_vector[i])\n",
    "      \n",
    "    }\n",
    "      \n",
    "  \n",
    "    \n",
    "    phenotype = list()\n",
    "    for(i in 1:50){\n",
    "        index = which(unlist(lapply(trait_list, function(x) i %in% x)))\n",
    "         if(length(index) > 0){            \n",
    "            beta = sim_beta_fix_variant(G = Xmat, causal_index = vars[index], is_h2g_total = FALSE)\n",
    "            B[, i] = beta\n",
    "            pheno_single = sim_multi_traits(G = Xmat, B = B[,i, drop = FALSE], h2g = 0.05, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] = pheno_single$P\n",
    "          }else{\n",
    "            pheno_single = sim_multi_traits(G = Xmat, B = B[,i, drop = FALSE], h2g = 0.05, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] = pheno_single$P\n",
    "  \n",
    "          }\n",
    "\n",
    "    }\n",
    "    variant = list()\n",
    "    for(i in 1:ncol(B)){\n",
    "        variant[[i]] = which(B[,i] != 0)\n",
    "    }\n",
    "    X = Xmat\n",
    "    Y = bind_cols(phenotype)\n",
    "    colnames(Y) = paste0(\"Trait\", c(1:50))\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6db48ae-0df7-4a97-9a8e-eee894698ce4",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Phenotype Simulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42ea6b19-0f12-43ea-b0ab-86e7324f73f7",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"simulation_50trait\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --ncausal CAUSAL --mem 30G --h2g 0.05 --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "for ncausal in 1 2 3 4 5; do\n",
    "    base_sh=\"base_script\"\n",
    "    output_script=\"${job}_causal_${ncausal}.sh\"\n",
    "    cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"|  sed \"s|CAUSAL|${ncausal}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42176543-bef9-4fad-9b66-72b74e81c1e3",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## 50 trait: Run ColocBoost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "662e1430-abb8-40eb-908c-325ed87bdaf3",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"simulation_50trait\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/\"\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 80:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR/JOB\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n",
    "    --simufile $(find -type f -name '*_ncausal_NCAUSAL*.rds') \\\n",
    "    --mem 40G --trait 50 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "for ncausal in 1 2 3 4 5; do\n",
    "    output_script=\"ncausal_${ncausal}.sh\"\n",
    "    cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|NCAUSAL|${ncausal}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eb53040-3036-47d1-8e31-8067b6f52204",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## 50 trait: Run Hyprcoloc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4044ef4d-7c0b-46d4-b119-b06a9abde27f",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data/simulation_50trait/\"\n",
    "job=\"simulation_50trait\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 80:00:00\n",
    "#SBATCH --mem=20000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n",
    "    --simufile $(find -type f -name '*_ncausal_NCAUSAL*.rds') \\\n",
    "    --mem 20G --trait 50 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "for ncausal in 1 2 3 4 5; do\n",
    "    output_script=\"ncausal_${ncausal}.sh\"\n",
    "    cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|NCAUSAL|${ncausal}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e42bbf2-2c53-460e-9d6b-e6a8cf9c88f8",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## 50 trait: Colocboost summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15dd0249-ec4e-4322-8365-e8d742460ef4",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "data_dir=\"/home/hs3393/cb_Mar/simulation_result/simulation_50trait/\"\n",
    "mkdir -p ${data_dir}/summary\n",
    "cd ${data_dir}/summary\n",
    "\n",
    "cat << 'EOF' > summary_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 8:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J sum\n",
    "#SBATCH -o DATA_DIR/log/summary.\"%j\".out\n",
    "#SBATCH -e DATA_DIR/log/summary.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/4.Result_Summary.ipynb coloc_summary \\\n",
    "    --folder DATA_DIR/result \\\n",
    "    --cwd DATA_DIR/summary\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"summary_script\"\n",
    "output_script=\"summary.sh\"\n",
    "cat ${base_script}|  sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "413263e6-31c6-4e6b-9e2b-063ab44a9293",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## 50 trait, Hyprcoloc summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f838cdd0-4456-481e-ae9f-6ff832bac0c3",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "data_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/simulation_50trait/\"\n",
    "mkdir -p ${data_dir}/summary\n",
    "cd ${data_dir}/summary\n",
    "\n",
    "cat << 'EOF' > summary_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 8:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J sum\n",
    "#SBATCH -o DATA_DIR/log/summary.\"%j\".out\n",
    "#SBATCH -e DATA_DIR/log/summary.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/4.Result_Summary.ipynb coloc_summary \\\n",
    "    --folder DATA_DIR/result \\\n",
    "    --cwd DATA_DIR/summary\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"summary_script\"\n",
    "output_script=\"summary.sh\"\n",
    "cat ${base_script}|  sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f3d5e75-0fd1-4c8e-8bf0-3bc2eaf54c1a",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Complex simulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e7bd40f-c0ad-4811-9646-de373b7ac43b",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[simulation_55_complex]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 10\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: share_pattern = \"all\"\n",
    "parameter: independent = False\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_h2g_{h2g}_10trait_cluster_5+5.rds'\n",
    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "    if(indep){ LD_thresh = 0.3} else{ LD_thresh = 1}\n",
    "    LD_vars = 1\n",
    "    while(length(LD_vars != 0)){\n",
    "      \n",
    "        B1 = sim_beta(G = Xmat, ncausal = 1, ntrait = 5, \n",
    "                     is_h2g_total = FALSE, shared_pattern = \"all\")\n",
    "        B2 = sim_beta(G = Xmat, ncausal = 1, ntrait = 5, \n",
    "                     is_h2g_total = FALSE, shared_pattern = \"all\")\n",
    "        B = cbind(B1, B2)\n",
    "        variant = list()\n",
    "        for(i in 1:ncol(B)){\n",
    "          variant[[i]] = which(B[,i] != 0)\n",
    "        }\n",
    "\n",
    "        var_mat = unique(unlist(variant))\n",
    "        cor_mat = cor(Xmat[,var_mat]) \n",
    "        LD_vars = which(colSums(abs(cor_mat) > LD_thresh) > 1)\n",
    "    }\n",
    "    phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n",
    "    phenotype = phenotype$P\n",
    "    X = Xmat\n",
    "    Y = phenotype\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13abddbc-b9e1-4f93-ab13-c5bca02776a8",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[simulation_3322_complex]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 10\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: share_pattern = \"all\"\n",
    "parameter: independent = False\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}//sample_{_index}_h2g_{h2g}_10trait_cluster_3+3+2+2.rds'\n",
    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "    if(indep){ LD_thresh = 0.3} else{ LD_thresh = 1}\n",
    "    LD_vars = 1\n",
    "    while(length(LD_vars != 0)){\n",
    "        B1 = sim_beta(G = Xmat, ncausal = 1, ntrait = 3, \n",
    "                     is_h2g_total = FALSE, shared_pattern = \"all\")\n",
    "        B2 = sim_beta(G = Xmat, ncausal = 1, ntrait = 3, \n",
    "                     is_h2g_total = FALSE, shared_pattern = \"all\")\n",
    "        B3 = sim_beta(G = Xmat, ncausal = 1, ntrait = 2, \n",
    "                     is_h2g_total = FALSE, shared_pattern = \"all\")\n",
    "        B4 = sim_beta(G = Xmat, ncausal = 1, ntrait = 2, \n",
    "                     is_h2g_total = FALSE, shared_pattern = \"all\")\n",
    "        B = cbind(B1, B2, B3, B4)\n",
    "        variant = list()\n",
    "        for(i in 1:ncol(B)){\n",
    "          variant[[i]] = which(B[,i] != 0)\n",
    "        }\n",
    "\n",
    "        var_mat = unique(unlist(variant))\n",
    "        cor_mat = cor(Xmat[,var_mat]) \n",
    "        LD_vars = which(colSums(abs(cor_mat) > LD_thresh) > 1)\n",
    "    }\n",
    "    phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n",
    "    phenotype = phenotype$P\n",
    "    X = Xmat\n",
    "    Y = phenotype\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7dcb799-93f9-4d81-b0a0-cea7739fe8d7",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[simulation_551rand_complex]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 10\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: share_pattern = \"all\"\n",
    "parameter: independent = False\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}//sample_{_index}_h2g_{h2g}_10trait_cluster_5+5+1rand.rds'\n",
    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "  \n",
    "    LD_vars = 1\n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = 3)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "  \n",
    "    B = matrix(0, nrow =  ncol(Xmat), ncol = 10)\n",
    "    var_vec = list()\n",
    "    rand_var_trait = sample(x = 1:10, size = 4)\n",
    "    for(i in 1:5){\n",
    "        if(i %in% rand_var_trait){\n",
    "            var_vec[[i]] = c(vars[1], vars[3])\n",
    "\n",
    "        }else{\n",
    "            var_vec[[i]] = c(vars[1])\n",
    "    }\n",
    "        beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n",
    "        B[, i] = beta\n",
    "    }\n",
    "    for(i in 6:10){\n",
    "    if(i %in% rand_var_trait){\n",
    "        var_vec[[i]] = c(vars[2], vars[3])\n",
    "\n",
    "    }else{\n",
    "        var_vec[[i]] = c(vars[2])\n",
    "    }\n",
    "    beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n",
    "    B[, i] = beta\n",
    "\n",
    "    }\n",
    "\n",
    "    variant = list()\n",
    "\n",
    "    for(i in 1:ncol(B)){\n",
    "        variant[[i]] = which(B[,i] != 0)\n",
    "    }\n",
    "    phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n",
    "    phenotype = phenotype$P\n",
    "    X = Xmat\n",
    "    Y = phenotype\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})\n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46e8d6a7-616d-4206-860e-a4c83947ef7f",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[simulation_552rand_complex]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 10\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: share_pattern = \"all\"\n",
    "parameter: independent = False\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_h2g_{h2g}_10trait_cluster_5+5+2rand.rds'\n",
    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "  \n",
    "    LD_vars = 1\n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = 4)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "  \n",
    "    B = matrix(0, nrow =  ncol(Xmat), ncol = 10)\n",
    "    var_vec = list()\n",
    "    rand_var_trait1 = sample(x = 1:10, size = 4)\n",
    "    rand_var_trait2 = sample(x = 1:10, size = 4)\n",
    "    for(i in 1:5){\n",
    "        var_vec[[i]] = c(vars[1])\n",
    "        if(i %in% rand_var_trait1){\n",
    "            var_vec[[i]] = c(var_vec[[i]], vars[3])\n",
    "\n",
    "        }\n",
    "        if(i %in% rand_var_trait2){\n",
    "            var_vec[[i]] = c(var_vec[[i]], vars[4])\n",
    "        }\n",
    "            \n",
    "        beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n",
    "        B[, i] = beta\n",
    "    }\n",
    "    for(i in 6:10){\n",
    "        var_vec[[i]] = c(vars[2])\n",
    "        if(i %in% rand_var_trait1){\n",
    "            var_vec[[i]] = c(var_vec[[i]], vars[3])\n",
    "\n",
    "        }\n",
    "        if(i %in% rand_var_trait2){\n",
    "            var_vec[[i]] = c(var_vec[[i]], vars[4])\n",
    "        }\n",
    "            \n",
    "        beta = sim_beta_fix_variant(G = Xmat, causal_index = var_vec[[i]], is_h2g_total = FALSE)\n",
    "        B[, i] = beta\n",
    "    }\n",
    "\n",
    "    variant = list()\n",
    "\n",
    "    for(i in 1:ncol(B)){\n",
    "        variant[[i]] = which(B[,i] != 0)\n",
    "    }\n",
    "    phenotype = sim_multi_traits(G = Xmat, B = B, h2g = ${h2g}, is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"})\n",
    "    phenotype = phenotype$P\n",
    "    X = Xmat\n",
    "    Y = phenotype\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3b90f13-7a6e-4313-a3c8-ed961e735e13",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Complex simulation - bash submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66484a46-4fbd-4aab-aec6-9c99db6f1fec",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"simulation_55_complex\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --mem 30G --h2g 0.05  --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "base_sh=\"base_script\"\n",
    "output_script=\"${job}.sh\"\n",
    "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n",
    "sbatch ${output_script}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "341f4d04-65e8-44a5-920b-a2434c6cd777",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"simulation_3322_complex\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --mem 30G --h2g 0.05 --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "base_sh=\"base_script\"\n",
    "output_script=\"${job}.sh\"\n",
    "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n",
    "sbatch ${output_script}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50cc8c23-634f-49dc-aefe-7e6faa11d3a0",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"simulation_551rand_complex\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --mem 30G --h2g 0.05  --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "base_sh=\"base_script\"\n",
    "output_script=\"${job}.sh\"\n",
    "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n",
    "sbatch ${output_script}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1ae576e-5806-4882-8f28-704474ad5aac",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"simulation_552rand_complex\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/5.Simulation_secondary.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --mem 30G --h2g 0.05 --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "base_sh=\"base_script\"\n",
    "output_script=\"${job}.sh\"\n",
    "cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\" > ${output_script}\n",
    "sbatch ${output_script}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de94d812-a998-4348-9714-3a41b0d75ceb",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Run Colocboost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ef1d1d2-5a30-4fb5-9fe7-59c5e0ebc5e7",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_3322_complex/\"\n",
    "job=\"simulation_3322_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n",
    "\n",
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_55_complex/\"\n",
    "job=\"simulation_55_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n",
    "\n",
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_551rand_complex/\"\n",
    "job=\"simulation_551rand_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n",
    "\n",
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data/simulation_552rand_complex/\"\n",
    "job=\"simulation_552rand_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/2.Run_Colocboost.ipynb colocboost \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "632c5bb4-5e74-4a10-be27-0a8821f41039",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Run Hyprcoloc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abd463f2-30ee-43f3-a502-431ec581f64e",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_3322_complex/\"\n",
    "job=\"simulation_3322_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n",
    "\n",
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_55_complex/\"\n",
    "job=\"simulation_55_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n",
    "\n",
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data//simulation_551rand_complex/\"\n",
    "job=\"simulation_551rand_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n",
    "\n",
    "data_dir=\"/home/hs3393/cb_Mar/simulation_data/simulation_552rand_complex/\"\n",
    "job=\"simulation_552rand_complex\"\n",
    "work_dir=\"/home/hs3393/cb_Mar/simulation_result/hyprcoloc/complex_simulation/\"\n",
    "#!/bin/bash\n",
    "\n",
    "mkdir -p ${work_dir}/${job}/code\n",
    "mkdir -p ${work_dir}/${job}/log\n",
    "mkdir -p ${work_dir}/${job}/result\n",
    "\n",
    "cd ${work_dir}/${job}/code\n",
    "\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 100:00:00\n",
    "#SBATCH --mem=40000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/JOB/log/JOB.%j.out\n",
    "#SBATCH -e WORK_DIR/JOB/log/JOB.%j.err\n",
    "\n",
    "source /home/hs3393/mamba_activate.sh\n",
    "module load Singularity\n",
    "\n",
    "cd DATA_DIR\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/3.Other_Methods.ipynb hyprcoloc_set \\\n",
    "    --simufile $(find -type f -name '*.rds') \\\n",
    "    --mem 40G --trait 10 \\\n",
    "    --cwd WORK_DIR/JOB/result\n",
    "EOF\n",
    "\n",
    "\n",
    "base_script=\"base_script\"\n",
    "output_script=\"job_${job}.sh\"\n",
    "cat ${base_script}| sed \"s|WORK_DIR|${work_dir}|g\" | sed \"s|JOB|${job}|g\" | sed \"s|DATA_DIR|${data_dir}|g\"  > ${output_script}\n",
    "sbatch ${output_script}\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "SoS",
   "language": "sos",
   "name": "sos"
  },
  "language_info": {
   "codemirror_mode": "sos",
   "file_extension": ".sos",
   "mimetype": "text/x-sos",
   "name": "sos",
   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
   "pygments_lexer": "sos"
  },
  "sos": {
   "kernels": [
    [
     "R",
     "ir",
     "R",
     "",
     ""
    ],
    [
     "SoS",
     "sos",
     "",
     "",
     "sos"
    ]
   ],
   "version": "0.24.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}