Genomic Relationship Matrices#
This workflow generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme.
Description#
For each chromosome, we compute a GRM using data excluding this chromosome. Computation is implemented using GCTA software package.
Running the grm pipeline executes four steps in sequence:
Generate LOCO file list — for each chromosome, write a list of the per-chromosome genotype files of all other chromosomes (leave-one-chromosome-out).
Compute LOCO-GRM — run
GCTAon that leave-one-out set to produce the GRM for the held-out chromosome.Format output for APEX — reshape each GRM into the APEX-compatible kinship table.
Generate APEX input list — write a master list mapping each chromosome to its APEX-formatted LOCO GRM.
Timing: ~5-10 min on typical compute infrastructure.
Input Files#
File |
Description |
|---|---|
|
Two-column, tab-separated list: chromosome number and the path to that chromosome’s PLINK 1.0 |
|
Per-chromosome PLINK 1.0 genotype files referenced by the list above. Common variants (MAF > 1%) are recommended for GRM. |
Output Files#
File |
Description |
|---|---|
|
Per-chromosome leave-one-chromosome-out file list (all chromosomes except chr N). |
|
LOCO GRM for chr N, computed by GCTA from all other chromosomes. |
|
LOCO GRM reformatted for APEX (LMM) input. |
|
Master list mapping each chromosome to its APEX-formatted LOCO GRM. |
Minimal Working Example#
sos run pipeline/GRM.ipynb grm \
--cwd output/grm_uf \
--genoFile input/genotype/protocol_example.genotype.list
Command interface#
sos run GRM.ipynb -h
[global]
# Work directory & output directory
parameter: cwd = path("output")
# The filename name for output data
parameter: container = ''
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: genoFile = paths
cwd = f"{cwd:a}"
## Code below are copied from genotype_formatting.ipynb. Change should be made to codes in that document and sync with this chunk.
from sos.utils import expand_size
import os
def get_genotype_file(geno_file_paths):
#
def valid_geno_file(x):
suffixes = path(x).suffixes
if suffixes[-1] == '.bed':
return True
if len(suffixes)>1 and ''.join(suffixes[-2:]) == ".vcf.gz":
return True
return False
#
def complete_geno_path(x, geno_file):
if not valid_geno_file(x):
raise ValueError(f"Genotype file {x} should be VCF (end with .vcf.gz) or PLINK bed file (end with .bed)")
if not os.path.isfile(x):
# relative path
if not os.path.isfile(f'{geno_file:ad}/' + x):
raise ValueError(f"Cannot find genotype file {x}")
else:
x = f'{geno_file:ad}/' + x
return x
#
def format_chrom(chrom):
if chrom.startswith('chr'):
chrom = chrom[3:]
return chrom
# Inputs are either VCF or bed, or a vector of them
if len(geno_file_paths) > 1:
if all([valid_geno_file(x) for x in geno_file_paths]):
return paths(geno_file_paths)
else:
raise ValueError(f"Invalid input {geno_file_paths}")
# Input is one genotype file or text list of genotype files
geno_file = geno_file_paths[0]
if valid_geno_file(geno_file):
return path(geno_file)
else:
units = [x.strip().split() for x in open(geno_file).readlines() if x.strip() and not x.strip().startswith('#')]
if all([len(x) == 1 for x in units]):
return paths([complete_geno_path(x[0], geno_file) for x in units])
elif all([len(x) == 2 for x in units]):
genos = dict([(format_chrom(x[0]), path(complete_geno_path(x[1], geno_file))) for x in units])
else:
raise ValueError(f"{geno_file} should contain one column of file names, or two columns of chrom number and corresponding file name")
return genos
genotypes = get_genotype_file(genoFile)
Step 1. Generate LOCO file list#
# Generate LOCO file list
[grm_1]
chrom = list(genotypes.keys())
input: for_each = 'chrom'
output: f'{cwd}/{genotypes[_chrom]:bn}.loco.txt'
with open(_output, 'w') as f:
f.write('\n'.join([str(genotypes[x].with_suffix('')) for x in genotypes if x != _chrom]))
Step 2. Compute LOCO-GRM#
# Compute LOCO-GRM
[grm_2]
output: f'{_input:nn}.grm.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
gcta64 \
--mbfile $[_input] \
--make-grm-gz \
--out $[_output:nn]
Step 3. Format output for APEX#
# Format output for APEX
[grm_3]
output: f'{_input:nn}.apex.grm'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
# Load necessary libraries
library(dplyr)
library(readr)
library(purrr)
# Read data
grm <- read_delim(gzfile("$[_input]"), '\t', col_names = FALSE) %>%
select(1, 2, 4) %>%
set_names(c("#id1", "id2", "kinship"))
id <- read_delim("$[_input:n].id", '\t', col_names = FALSE)[, 2] %>%
unlist()
# Modify grm data
grm_updated <- grm %>%
mutate(`#id1` = map_chr(`#id1`, ~id[.x]),
`id2` = map_chr(`id2`, ~id[.x]))
# Write to output
grm_updated %>%
write_delim("$[_output]", '\t')
Step 4. Generate APEX input list#
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
# Generate APEX input list
[grm_4]
input: group_by = "all"
output: f'{cwd}/{path(list(genotypes.values())[1]):bnn}.loco_grm_list.txt'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
# Load necessary libraries
library(dplyr)
library(readr)
library(purrr)
# Define chromosomal names
chromosomes <- paste0("chr", c($[paths(genotypes.keys()):r,]), collapse = "")
# Define input files and directory
input_files <- c($[_input:r,])
# Create and sort the tibble
grm_tibble <- tibble(`#chr` = chromosomes, dir = input_files) %>%
arrange(`#chr`)
# Write to output
grm_tibble %>%
write_delim("$[_output]", "\t")