{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4ab9c2ef-b677-4d31-a979-f89eef9bad3f",
   "metadata": {
    "kernel": "SoS",
    "tags": []
   },
   "source": [
    "# Phenotype data simulation\n",
    "\n",
    "## Goal\n",
    "\n",
    "Here we use two strategies to simulate phenotype data (Y matrix) based on individual-level genotype data (X matrix)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36576543-2bc8-4a16-aee7-a2e46584202e",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Input\n",
    "\n",
    "`genofile`: plink file of real genotyope, `/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files/*.bim`\n",
    "\n",
    "The other parameters can be found in simxQTL repo. `https://github.com/StatFunGen/simxQTL`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46219b10-544a-4799-8e20-c53ee667ab53",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Output\n",
    "\n",
    "An rds matrix, with genotype matrix X (dimension: m * n, m: number of sample, n: number of SNP ) and phenotype (trait) matrix (dimension: m * a, m : number of samples, a: number of simulated traits) \n",
    "\n",
    "Example output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d7af7a25-eb58-4629-b550-c1aa87f2249e",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [],
   "source": [
    "result = readRDS(\"/home/hs3393/cb_Mar/simulation_data/real_simulation_5trait/sample_999_real_simulation_3_ncausal_5_trait.rds\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8320e406-bb6f-416d-89f3-72349f5ebc36",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>1153</li><li>10534</li></ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 1153\n",
       "\\item 10534\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 1153\n",
       "2. 10534\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1]  1153 10534"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<style>\n",
       ".list-inline {list-style: none; margin:0; padding: 0}\n",
       ".list-inline>li {display: inline-block}\n",
       ".list-inline>li:not(:last-child)::after {content: \"\\00b7\"; padding: 0 .5ex}\n",
       "</style>\n",
       "<ol class=list-inline><li>1153</li><li>5</li></ol>\n"
      ],
      "text/latex": [
       "\\begin{enumerate*}\n",
       "\\item 1153\n",
       "\\item 5\n",
       "\\end{enumerate*}\n"
      ],
      "text/markdown": [
       "1. 1153\n",
       "2. 5\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "[1] 1153    5"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "dim(result$X)\n",
    "\n",
    "dim(result$Y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6a51fe20-fe97-463a-b5c3-a9980dc3932d",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "List of 5\n",
      " $ : int [1:3] 196 5616 6566\n",
      " $ : int [1:2] 5616 6566\n",
      " $ : int [1:2] 196 6566\n",
      " $ : int [1:2] 196 5616\n",
      " $ : int [1:2] 196 6566\n"
     ]
    }
   ],
   "source": [
    "str(result$variant)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c067a16-f7e1-4d5f-9bdd-a28e833ce6b2",
   "metadata": {
    "kernel": "R"
   },
   "source": [
    "In this region, we simulated **3** causal variants (196, 5616, 6566). Each causal variant is distributed by distribution in **5** traits."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94328d35-ea80-4497-b6b3-ad9eb9af28c5",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Simulation Strategy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3761df65-8ec2-4efc-b11f-a91fbe54aa96",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### Total heritability ($\\phi_{total}$)\n",
    "\n",
    "Effect size are all set to be 1. This value actually doesn't matter so much because phi will eventually control the variance. Under this case, all SNPs share the same effect size, and they altogether contribute to explain $\\phi_{total}$ (eg.0.5) variance of the total variance. To get Y, we assume a multivariate gaussian distribution $\\textbf{Y} \\sim N(\\textbf{X} \\beta, \\sigma^2)$, and $\\sigma^2$ can be estimated by the equation below.\n",
    "\n",
    "$\\phi_{total} = \\dfrac{var(X \\beta)}{\\sigma^2 + var(X \\beta)}$\n",
    "\n",
    "$\\sigma^2 = \\frac{var(X \\beta)(1-\\phi_{total})}{\\phi_{total}}$ "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90feaff5-da51-40eb-86c4-a14f03c46919",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### SNP level heritability\n",
    "\n",
    "1. Assume there are total number of $a$ causal variants. We assign $\\beta_1 = 1$ as the initialize setting.\n",
    "\n",
    "2. For each causal variant we have:\n",
    "\n",
    "$\\frac{Var(X_1 \\beta_1)}{Var(Y)} = \\frac{Var(X_2 \\beta_2)}{Var(Y)} = ... = \\frac{Var(X_a \\beta_a)}{Var(Y)} = \\phi_{SNP}$\n",
    "\n",
    "3. In that way, we have: if $\\beta_1^2 Var(X_1) = \\beta_2^2Var(X_2) = ... = \\beta_a^2 Var(X_a)$. Then we have $\\beta_2 = \\sqrt{\\frac{\\beta_1^2 Var(X_1)}{Var(X_2)}}$, ..., $\\beta_a = \\sqrt{\\frac{\\beta_1^2 Var(X_1)}{Var(X_a)}}$.\n",
    "\n",
    "5. Then we will have\n",
    "$\\frac{Var(X_1 \\beta_1)}{Var(Y)} = \\frac{Var(X_1 \\beta_1)}{Var(X \\beta) + \\sigma^2} = \\phi_{SNP}$. Therefore, we have $\\sigma^2 = \\frac{Var(X_1 \\beta_1)}{\\phi_{SNP}} - Var(X \\beta)$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aca5a021-7999-4b56-ab94-434f8e216440",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "# Simulation code"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e76def0c-ed4f-48fe-95bd-ba0c55bcecf6",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Step 1: Randomly select regions\n",
    "\n",
    "Select gene regions to simulate - randomly select one gene from each 1,329 different TADB blocks. \n",
    "\n",
    "Some genes might exist in more than one TADB block. Total unique gene number after random selection: 1287."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "id": "a86c30a1-e83e-4e43-9f13-9e767b731f71",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1mRows: \u001b[22m\u001b[34m126063\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m11\u001b[39m\n",
      "\u001b[36m──\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m────────────────────────────────────────────────────────\u001b[39m\n",
      "\u001b[1mDelimiter:\u001b[22m \"\\t\"\n",
      "\u001b[31mchr\u001b[39m (5): #chr, gene_id, gene_name, TADB_index, TADB_id\n",
      "\u001b[32mdbl\u001b[39m (6): gene_cis_start, gene_cis_end, TSS, end.x, TADB_start, TADB_end\n",
      "\n",
      "\u001b[36mℹ\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n",
      "\u001b[36mℹ\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "1329"
      ],
      "text/latex": [
       "1329"
      ],
      "text/markdown": [
       "1329"
      ],
      "text/plain": [
       "[1] 1329"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1mRows: \u001b[22m\u001b[34m1329\u001b[39m \u001b[1mColumns: \u001b[22m\u001b[34m1\u001b[39m\n",
      "\u001b[36m──\u001b[39m \u001b[1mColumn specification\u001b[22m \u001b[36m────────────────────────────────────────────────────────\u001b[39m\n",
      "\u001b[1mDelimiter:\u001b[22m \"\\t\"\n",
      "\u001b[31mchr\u001b[39m (1): gene\n",
      "\n",
      "\u001b[36mℹ\u001b[39m Use `spec()` to retrieve the full column specification for this data.\n",
      "\u001b[36mℹ\u001b[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "1329"
      ],
      "text/latex": [
       "1329"
      ],
      "text/markdown": [
       "1329"
      ],
      "text/plain": [
       "[1] 1329"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "library(tidyverse)\n",
    "\n",
    "mapper = read_tsv(\"/home/hs3393/fungen-xqtl-analysis/resource//gene_cis_TADB_mapper.tsv\")\n",
    "\n",
    "\n",
    "df_selected <- mapper %>% filter(`#chr` != \"chrX\")  %>%\n",
    "  group_by(TADB_index) %>%\n",
    "  slice_sample(n = 1) %>%\n",
    "  ungroup()\n",
    "\n",
    "\n",
    "genes_selected = df_selected %>% pull(gene_id) \n",
    "\n",
    "genes_selected %>% length()\n",
    "\n",
    "#genes_selected %>% save_tsv(\"/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv\")\n",
    "\n",
    "genes_selected = read_tsv(\"/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv\", col_names = \"gene\")$gene\n",
    "\n",
    "mapper %>% filter(gene_id %in% genes_selected)  %>% pull(TADB_index) %>% unique() %>% length()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "id": "274c0434-137f-40f7-9f09-031b3f51026a",
   "metadata": {
    "kernel": "R",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "1287"
      ],
      "text/latex": [
       "1287"
      ],
      "text/markdown": [
       "1287"
      ],
      "text/plain": [
       "[1] 1287"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "genes_selected %>% unique() %>% length()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c0fd962-aaf7-4068-adc6-8e83150fa98e",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "## Step 2: Make a copy these selected genotypes to another folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a9b3987-1c69-4997-974a-f51e0a83bf32",
   "metadata": {
    "kernel": "Python3",
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import shutil\n",
    "\n",
    "# Path to the file with gene names\n",
    "gene_file_path = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/selected_gene_id.tsv'\n",
    "\n",
    "source_folder = '/mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/plink_by_gene/extended_cis_before_winsorize_plink_files'\n",
    "\n",
    "destination_folder = '/home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype'\n",
    "\n",
    "with open(gene_file_path, 'r') as file:\n",
    "    gene_names = [line.strip() for line in file]\n",
    "\n",
    "if not os.path.exists(destination_folder):\n",
    "    os.makedirs(destination_folder)\n",
    "\n",
    "for filename in os.listdir(source_folder):\n",
    "    if any(gene_name in filename for gene_name in gene_names):\n",
    "        # Copy the file to the destination folder\n",
    "        shutil.copy(os.path.join(source_folder, filename), destination_folder)\n",
    "\n",
    "print('Files copied successfully.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "750122bd-607e-4d45-a30b-ebe4977daba8",
   "metadata": {
    "kernel": "Python3"
   },
   "source": [
    "## Step 3: Simulation\n",
    "\n",
    "### 1. Figure 2A simulation (2 trait)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bbde1db-e09f-40c1-bb33-f55a7a718af7",
   "metadata": {
    "kernel": "SoS",
    "tags": []
   },
   "outputs": [],
   "source": [
    "[real_simulation_2trait]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 2\n",
    "parameter: n_causal = 1\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: share_pattern = \"all\"\n",
    "parameter: independent = False\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "\n",
    "    # Extract the TSS position for the given gene\n",
    "    TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "\n",
    "    # Define the genomic region around the TSS (+/- 1.5Mb) and filter SNPs\n",
    "    keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))\n",
    "    geno$bed <- geno$bed[, keep_index]\n",
    "\n",
    "    # Set thresholds for filtering SNPs\n",
    "    imiss <- 0.1  # Maximum missing rate allowed (10%)\n",
    "    maf <- 0.05   # Minimum minor allele frequency (5%)\n",
    "\n",
    "    # Filter SNPs based on missing rate and MAF\n",
    "    Xmat <- filter_X(geno$bed, imiss, maf)\n",
    "\n",
    "    # Define the number of causal variants and traits\n",
    "    ncausal <- ${n_causal}\n",
    "    ntrait <- ${n_trait}\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "\n",
    "    # Select causal variants randomly\n",
    "    # If we want to limit the LD between causal variants, un-annotate the code below to limit < 0.3 LD between variants   \n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = ncausal)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "\n",
    "\n",
    "    # Initialize effect size matrix (B) with zeros\n",
    "    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)\n",
    "\n",
    "    # Simulate effect sizes for the first two traits\n",
    "    for (i in 1:2) {\n",
    "        causal_index <- vars\n",
    "        beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)\n",
    "        B[, i] <- beta\n",
    "    }\n",
    "\n",
    "    # save causal variants per trait\n",
    "    variant <- list()\n",
    "    for (i in 1:ncol(B)) {\n",
    "        variant[[i]] <- which(B[, i] != 0)\n",
    "    }\n",
    "\n",
    "    # Simulate multi-trait phenotypes\n",
    "    phenotype <- sim_multi_traits(\n",
    "        G = Xmat,\n",
    "        B = B,\n",
    "        h2g = ${h2g},\n",
    "        is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"}\n",
    "    )\n",
    "    phenotype <- phenotype$P\n",
    "\n",
    "    data <- list()\n",
    "    data[[\"X\"]] <- Xmat\n",
    "    data[[\"Y\"]] <- phenotype\n",
    "    data[[\"variant\"]] <- variant\n",
    "\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1692baa7-33b8-45da-8eb3-219ace2f3526",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### 2. Figure 2A simulation (5 trait)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c73a1829-8d26-484c-9ba0-6416d341a75b",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[real_simulation_5trait]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 5\n",
    "parameter: n_causal = 1\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: share_pattern = \"all\"\n",
    "parameter: independent = False\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed <- geno$bed[, keep_index]\n",
    "\n",
    "    # Filter out columns with missing rate > 0.1\n",
    "    imiss <- 0.1\n",
    "\n",
    "    # Filter out columns with MAF < 0.05\n",
    "    maf <- 0.05\n",
    "\n",
    "    # Apply filtering\n",
    "    Xmat <- filter_X(geno$bed, imiss, maf)\n",
    "\n",
    "    # Only keep the first 4000 variants\n",
    "    ncausal <- ${n_causal}\n",
    "    ntrait <- ${n_trait}\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "\n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = ncausal)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "    # Load predefined proportions for causal variant sampling\n",
    "    prop <- readRDS(\"/home/hs3393/cloud_colocalization/simulation_code/trait6_prop.rds\")[, 1:4]\n",
    "    proportions <- prop[ncausal, ]\n",
    "\n",
    "    # Sample variant names based on predefined proportions\n",
    "    sampled_name <- as.numeric(sample(names(proportions), size = ncausal, prob = proportions, replace = TRUE))\n",
    "\n",
    "    # Initialize phenotype list and effect size matrix\n",
    "    phenotype <- list()\n",
    "    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)\n",
    "\n",
    "    # Configuration matrix for trait-variant relationships\n",
    "    config <- matrix(0, nrow = ntrait, ncol = ncausal)\n",
    "\n",
    "    # Assign causal variants to traits\n",
    "    for (i in 1:ncausal) {\n",
    "        coloc_trait <- sample(1:ntrait, sampled_name[i])\n",
    "        config[coloc_trait, i] <- 1\n",
    "    }\n",
    "\n",
    "    # Simulate effect sizes and phenotypes\n",
    "    for (i in 1:nrow(config)) {\n",
    "        beta <- B[, i, drop = FALSE]\n",
    "        index <- which(config[i, ] == 1)\n",
    "\n",
    "        if (length(index) > 0) {\n",
    "            causal_index <- vars[index]\n",
    "            beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)\n",
    "            B[, i] <- beta\n",
    "            pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] <- pheno_single$P\n",
    "        } else {\n",
    "            pheno_single <- sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] <- pheno_single$P\n",
    "        }\n",
    "    }\n",
    "\n",
    "    # Identify causal variants for each trait\n",
    "    variant <- list()\n",
    "    for (i in 1:ncol(B)) {\n",
    "        variant[[i]] <- which(B[, i] != 0)\n",
    "    }\n",
    "\n",
    "    # Combine phenotype data\n",
    "    X <- Xmat\n",
    "    Y <- bind_cols(phenotype)\n",
    "    colnames(Y) <- paste0(\"Trait\", 1:ntrait)\n",
    "\n",
    "    # Store results in a list\n",
    "    data <- list()\n",
    "    data[[\"X\"]] <- Xmat\n",
    "    data[[\"Y\"]] <- Y\n",
    "    data[[\"variant\"]] <- variant\n",
    "\n",
    "    # Save results\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a90244b-9972-49b9-9a57-f3d71ea44473",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### 3. Figure 2A simulation (10 trait)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43ab5a46-d084-47e3-bbcd-629efa9ee642",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[real_simulation_10trait]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 10\n",
    "parameter: n_causal = 1\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: independent = False\n",
    "parameter: share_pattern = \"all\"\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "    # only keep the first 4000 variants\n",
    "    ncausal = ${n_causal}\n",
    "    ntrait = ${n_trait}\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = ncausal)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "  \n",
    "    prop = readRDS(\"/home/hs3393/cloud_colocalization/simulation_code/trait10_prop.rds\")\n",
    "    proportions = prop[ncausal,]\n",
    "    sampled_name <- as.numeric(sample(names(proportions), size =  ncausal, prob = proportions, replace = TRUE))\n",
    "    phenotype = list()\n",
    "    B = matrix(0, nrow =  ncol(Xmat), ncol = ntrait)\n",
    "    config = matrix(0, nrow =  ntrait, ncol = ncausal)\n",
    "    for(i in c(1:ncausal)){\n",
    "          coloc_trait = sample(c(1:ntrait), sampled_name[i])\n",
    "          config[coloc_trait, i] = 1\n",
    "    }\n",
    "    for(i in 1:nrow(config)){\n",
    "        beta = B[,i, drop = FALSE]\n",
    "        index = which(config[i,] == 1)\n",
    "        if(length(index) > 0){\n",
    "            causal_index = vars[index]\n",
    "            beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)\n",
    "            B[, i] = beta\n",
    "            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] = pheno_single$P\n",
    "        }else{\n",
    "            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] = pheno_single$P\n",
    "        \n",
    "        }\n",
    "    }\n",
    "    variant = list()\n",
    "    for(i in 1:ncol(B)){\n",
    "        variant[[i]] = which(B[,i] != 0)\n",
    "    }\n",
    "    X = Xmat\n",
    "    Y = bind_cols(phenotype)\n",
    "    colnames(Y) = paste0(\"Trait\", c(1:ntrait))\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39514c63-1891-45e1-b527-4b1401ee712c",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### 4. Figure 2A simulation (20 trait)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3ef6d15-f018-44b0-abd2-89bf318eb4ee",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "[real_simulation_20trait]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"30G\"\n",
    "parameter: numThreads = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 20\n",
    "parameter: n_causal = 1\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: independent = False\n",
    "parameter: share_pattern = \"all\"\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/{step_name}/sample_{_index}_real_simulation_{n_causal}_ncausal_{n_trait}_trait.rds'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "    # filter by distance with. TSS\n",
    "    TSS_pos = gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "    keep_index = which(geno$bim$V4 > TSS_pos - 1500000 | geno$bim$V4 < TSS_pos + 1500000)\n",
    "    geno$bed = geno$bed[,keep_index]\n",
    "    # filter out columns with missing rate > 0.1\n",
    "    imiss = 0.1\n",
    "    # filter out columns with MAF < 0.05\n",
    "    maf = 0.05\n",
    "    Xmat = filter_X(geno$bed, imiss, maf)\n",
    "    # only keep the first 4000 variants\n",
    "    ncausal = ${n_causal}\n",
    "    ntrait = ${n_trait}\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = ncausal)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "  \n",
    "    prop = readRDS(\"/home/hs3393/cloud_colocalization/simulation_code/trait17_prop.rds\")\n",
    "    proportions = prop[ncausal,]\n",
    "    sampled_name <- as.numeric(sample(names(proportions), size =  ncausal, prob = proportions, replace = TRUE))\n",
    "    phenotype = list()\n",
    "    B = matrix(0, nrow =  ncol(Xmat), ncol = ntrait)\n",
    "    config = matrix(0, nrow =  ntrait, ncol = ncausal)\n",
    "    for(i in c(1:ncausal)){\n",
    "          coloc_trait = sample(c(1:ntrait), sampled_name[i])\n",
    "          config[coloc_trait, i] = 1\n",
    "    }\n",
    "    for(i in 1:nrow(config)){\n",
    "        beta = B[,i, drop = FALSE]\n",
    "        index = which(config[i,] == 1)\n",
    "        if(length(index) > 0){\n",
    "            causal_index = vars[index]\n",
    "            beta = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)\n",
    "            B[, i] = beta\n",
    "            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] = pheno_single$P\n",
    "        }else{\n",
    "            pheno_single = sim_multi_traits(G = Xmat, B = as.matrix(beta), h2g = ${h2g}, is_h2g_total = FALSE)\n",
    "            phenotype[[i]] = pheno_single$P\n",
    "        \n",
    "        }\n",
    "    }\n",
    "    variant = list()\n",
    "    for(i in 1:ncol(B)){\n",
    "        variant[[i]] = which(B[,i] != 0)\n",
    "    }\n",
    "    X = Xmat\n",
    "    Y = bind_cols(phenotype)\n",
    "    colnames(Y) = paste0(\"Trait\", c(1:ntrait))\n",
    "    data = list()\n",
    "    data[[\"X\"]] = Xmat\n",
    "    data[[\"Y\"]] = Y\n",
    "    data[[\"variant\"]] = variant\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b90179bb-a31b-4b51-aa49-b0fde95853c0",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### 5. Job submission bash file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45c4ca61-a038-4251-8f7e-d89ed6520c47",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "#### 2 trait"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14e57874-f533-4fb2-b206-c721fc23257d",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"real_simulation_2trait\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \\\n",
    "    --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "    --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \\\n",
    "    --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "\n",
    "EOF\n",
    "\n",
    "for ncausal in 1 2 3; do\n",
    "    base_sh=\"base_script\"\n",
    "    output_script=\"${job}_causal_${ncausal}.sh\"\n",
    "    cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"|  sed \"s|CAUSAL|${ncausal}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "426686b8-93e6-4d0d-b346-99a84523d442",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "#### 5 trait"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21dec879-4b62-4f41-88f0-fb253089e352",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"real_simulation_5trait\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "for ncausal in 1 2 3 4; do\n",
    "    base_sh=\"base_script\"\n",
    "    output_script=\"${job}_causal_${ncausal}.sh\"\n",
    "    cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"|  sed \"s|CAUSAL|${ncausal}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a64f0c9c-88d3-440f-bed9-ecfd51dc45a8",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "#### 10 trait "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4eb3a707-ffef-4111-8969-be2bd1a20bc5",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"real_simulation_10trait\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "for ncausal in 1 2 3 4 5; do\n",
    "    base_sh=\"base_script\"\n",
    "    output_script=\"${job}_causal_${ncausal}.sh\"\n",
    "    cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"|  sed \"s|CAUSAL|${ncausal}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "724f4b5b-345f-4bbd-8fdc-9e093e944e06",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "#### 20 trait"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48cc19c1-fe51-4dab-aa96-4d310fb344ae",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/\"\n",
    "job=\"real_simulation_20trait\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/\n",
    "EOF\n",
    "\n",
    "for ncausal in 1 2 3 4 5; do\n",
    "    base_sh=\"base_script\"\n",
    "    output_script=\"${job}_causal_${ncausal}.sh\"\n",
    "    cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"|  sed \"s|CAUSAL|${ncausal}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68e7a5ee-ec53-4d21-99b9-0ccca0edecf7",
   "metadata": {
    "kernel": "R"
   },
   "source": [
    "### 6. Supplementary: all shared senario"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9de19c9-0214-45ca-99d6-9d38939ac954",
   "metadata": {
    "kernel": "SoS",
    "tags": []
   },
   "outputs": [],
   "source": [
    "[all_share_simulation]\n",
    "parameter: genofile = paths\n",
    "# pheno_file: give genotype file (in plink)，we can read the gentype matrix. These files are separated by TADs.\n",
    "parameter: cwd = path(\"output\")\n",
    "parameter: job_size = 30\n",
    "parameter: walltime = \"100h\"\n",
    "parameter: mem = \"40G\"\n",
    "parameter: numThreads = 1\n",
    "parameter: n_causal = 1\n",
    "# specify the number of causal variants\n",
    "parameter: n_trait = 50\n",
    "parameter: h2g = 0.05\n",
    "parameter: total_h2g = False\n",
    "parameter: independent = False\n",
    "parameter: share_pattern = \"all\"\n",
    "# specify the number of traits (phenotypes)\n",
    "parameter: container = \"\"\n",
    "input: genofile, group_by = 1\n",
    "output: f'{cwd:a}/sample_{_index}_h2g_{h2g}_ncausal_{n_causal}.rds'\n",
    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
    "R:  expand = '${ }', stdout = f\"{_output:n}.stdout\", stderr = f\"{_output:n}.stderr\", container = container \n",
    "    library(\"MASS\")\n",
    "    library(\"plink2R\")\n",
    "    library(\"dplyr\")\n",
    "    library(\"readr\")\n",
    "    library(\"tidyverse\")\n",
    "    # source some functions to read matrix and inpute the missing data\n",
    "    source(\"~/cloud_colocalization/simulation_code/simulate_linreg.R\")\n",
    "    source(\"~/cloud_colocalization/simulation_code/misc.R\")\n",
    "    # read the plink file\n",
    "    simu_file = ${_input:r}\n",
    "    geno <- read_plink(${_input:nr})\n",
    "    gene_name = str_extract(simu_file, \"ENSG[0-9]+\")\n",
    "    gene_tss_map = read_tsv(\"/home/hs3393/coloc/fungen-xqtl-analysis/resource/gene_cis_TADB_mapper.tsv\")\n",
    "\n",
    "    # Extract the TSS position for the given gene\n",
    "    TSS_pos <- gene_tss_map$TSS[which(gene_tss_map$gene_id == gene_name)][1]\n",
    "\n",
    "    # Define the genomic region around the TSS (+/- 1.5Mb) and filter SNPs\n",
    "    keep_index <- which(geno$bim$V4 > (TSS_pos - 1500000) | geno$bim$V4 < (TSS_pos + 1500000))\n",
    "    geno$bed <- geno$bed[, keep_index]\n",
    "\n",
    "    # Set thresholds for filtering SNPs\n",
    "    imiss <- 0.1  # Maximum missing rate allowed (10%)\n",
    "    maf <- 0.05   # Minimum minor allele frequency (5%)\n",
    "\n",
    "    # Filter SNPs based on missing rate and MAF\n",
    "    Xmat <- filter_X(geno$bed, imiss, maf)\n",
    "\n",
    "    # Define the number of causal variants and traits\n",
    "    ncausal <- ${n_causal}\n",
    "    ntrait <- ${n_trait}\n",
    "    indep = ${\"TRUE\" if independent else \"FALSE\"}\n",
    "    if (indep) {\n",
    "        LD_vars = 1  # Initialize LD_vars\n",
    "\n",
    "        if (ncausal == 1) {\n",
    "            # If only one causal variant, just sample it\n",
    "            vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "        } else {\n",
    "            # Repeat sampling until selected variables are quasi independent\n",
    "            while (length(LD_vars != 0)) {\n",
    "                vars = sample(1:ncol(Xmat), size = ncausal)  \n",
    "                cor_mat = cor(Xmat[, vars]) \n",
    "                LD_vars = which(colSums(abs(cor_mat) > 0.3) > 1)\n",
    "            }\n",
    "        }\n",
    "    } else {\n",
    "        vars = sample(1:ncol(Xmat), size = ncausal)\n",
    "    }\n",
    "\n",
    "    # Initialize effect size matrix (B) with zeros\n",
    "    B <- matrix(0, nrow = ncol(Xmat), ncol = ntrait)\n",
    "\n",
    "    # Simulate effect sizes for the first two traits\n",
    "    for (i in 1:ntrait) {\n",
    "        causal_index <- vars\n",
    "        beta <- sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)\n",
    "        B[, i] <- beta\n",
    "    }\n",
    "\n",
    "    # save causal variants per trait\n",
    "    variant <- list()\n",
    "    for (i in 1:ncol(B)) {\n",
    "        variant[[i]] <- which(B[, i] != 0)\n",
    "    }\n",
    "\n",
    "    # Simulate multi-trait phenotypes\n",
    "    phenotype <- sim_multi_traits(\n",
    "        G = Xmat,\n",
    "        B = B,\n",
    "        h2g = ${h2g},\n",
    "        is_h2g_total = ${\"TRUE\" if total_h2g else \"FALSE\"}\n",
    "    )\n",
    "    phenotype <- phenotype$P\n",
    "\n",
    "    data <- list()\n",
    "    data[[\"X\"]] <- Xmat\n",
    "    data[[\"Y\"]] <- phenotype\n",
    "    data[[\"variant\"]] <- variant\n",
    "\n",
    "    saveRDS(data, ${_output:r})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38f26334-8698-42eb-871d-3f0922723ae5",
   "metadata": {
    "kernel": "SoS"
   },
   "source": [
    "### 7. Supplementary job submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54198b2d-52c2-4872-ac17-3d0ed13ac815",
   "metadata": {
    "kernel": "SoS"
   },
   "outputs": [],
   "source": [
    "work_dir=\"/home/hs3393/cb_Mar/simulation_data/all_share_simulation/\"\n",
    "job=\"all_share_simulation\"\n",
    "mkdir -p ${work_dir}\n",
    "mkdir -p ${work_dir}/code\n",
    "mkdir -p ${work_dir}/log\n",
    "cd ${work_dir}/code\n",
    "\n",
    "# Create the base_script file and write the bash code into it\n",
    "cat << 'EOF' > base_script\n",
    "#!/bin/bash -l\n",
    "# NOTE the -l flag!\n",
    "#\n",
    "#SBATCH -t 10:00:00\n",
    "#SBATCH --mem=30000\n",
    "#SBATCH -J JOB\n",
    "#SBATCH -o WORK_DIR/log/JOB.\"%j\".out\n",
    "#SBATCH -e WORK_DIR/log/JOB.\"%j\".err\n",
    "\n",
    "source ~/mamba_activate.sh\n",
    "\n",
    "sos run /home/hs3393/cb_Mar/simulation_code/1.Phenotype_simulation.ipynb JOB \\\n",
    "        --genofile `ls /home/hs3393/cloud_colocalization/simulation_data/selected_genes_genotype/*.bim` \\\n",
    "        --n_causal CAUSAL --mem 30G --h2g 0.05 --independent --n_trait 20 \\\n",
    "        --cwd /home/hs3393/cb_Mar/simulation_data/JOB/result\n",
    "EOF\n",
    "\n",
    "for ncausal in 1 2 3 4; do\n",
    "    base_sh=\"base_script\"\n",
    "    output_script=\"${job}_causal_${ncausal}.sh\"\n",
    "    cat ${base_sh}| sed \"s|WORK_DIR|${work_dir}|g\" |sed \"s|JOB|${job}|g\"|  sed \"s|CAUSAL|${ncausal}|g\"  > ${output_script}\n",
    "    sbatch ${output_script}\n",
    "done"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "SoS",
   "language": "sos",
   "name": "sos"
  },
  "language_info": {
   "codemirror_mode": "sos",
   "file_extension": ".sos",
   "mimetype": "text/x-sos",
   "name": "sos",
   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
   "pygments_lexer": "sos"
  },
  "sos": {
   "kernels": [
    [
     "Python3",
     "python3",
     "Python3",
     "#FFD91A",
     ""
    ],
    [
     "R",
     "ir",
     "R",
     "",
     "r"
    ],
    [
     "SoS",
     "sos",
     "",
     "",
     "sos"
    ]
   ],
   "version": "0.24.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}