Genomic Relationship Matrices#

This workflow generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme.

Description#

For each chromosome, we compute a GRM using data excluding this chromosome. Computation is implemented using GCTA software package.

Running the grm pipeline executes four steps in sequence:

  1. Generate LOCO file list — for each chromosome, write a list of the per-chromosome genotype files of all other chromosomes (leave-one-chromosome-out).

  2. Compute LOCO-GRM — run GCTA on that leave-one-out set to produce the GRM for the held-out chromosome.

  3. Format output for APEX — reshape each GRM into the APEX-compatible kinship table.

  4. Generate APEX input list — write a master list mapping each chromosome to its APEX-formatted LOCO GRM.

Timing: ~5-10 min on typical compute infrastructure.

Input Files#

File

Description

input/genotype/protocol_example.genotype.list

Two-column, tab-separated list: chromosome number and the path to that chromosome’s PLINK 1.0 .bed file. One row per chromosome.

output/plink/protocol_example.genotype.chr<N>.{bed,bim,fam}

Per-chromosome PLINK 1.0 genotype files referenced by the list above. Common variants (MAF > 1%) are recommended for GRM.

Output Files#

File

Description

output/grm_uf/protocol_example.genotype.chr<N>.loco.txt

Per-chromosome leave-one-chromosome-out file list (all chromosomes except chr N).

output/grm_uf/protocol_example.genotype.chr<N>.loco.grm.gz

LOCO GRM for chr N, computed by GCTA from all other chromosomes.

output/grm_uf/protocol_example.genotype.chr<N>.loco.apex.grm

LOCO GRM reformatted for APEX (LMM) input.

output/grm_uf/protocol_example.genotype.loco_grm_list.txt

Master list mapping each chromosome to its APEX-formatted LOCO GRM.

Minimal Working Example#

sos run pipeline/GRM.ipynb grm \
    --cwd output/grm_uf \
    --genoFile input/genotype/protocol_example.genotype.list

Command interface#

sos run GRM.ipynb -h
[global]
# Work directory & output directory
parameter: cwd = path("output")
# The filename name for output data
parameter: container = ''
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: genoFile = paths
cwd = f"{cwd:a}"

## Code below are copied from genotype_formatting.ipynb. Change should be made to codes in that document and sync with this chunk.
from sos.utils import expand_size
import os

def get_genotype_file(geno_file_paths):
    #
    def valid_geno_file(x):
        suffixes = path(x).suffixes
        if suffixes[-1] == '.bed':
            return True
        if len(suffixes)>1 and ''.join(suffixes[-2:]) == ".vcf.gz":
            return True
        return False
    #
    def complete_geno_path(x, geno_file):
        if not valid_geno_file(x):
            raise ValueError(f"Genotype file {x} should be VCF (end with .vcf.gz) or PLINK bed file (end with .bed)")
        if not os.path.isfile(x):
            # relative path
            if not os.path.isfile(f'{geno_file:ad}/' + x):
                raise ValueError(f"Cannot find genotype file {x}")
            else:
                x = f'{geno_file:ad}/' + x
        return x
    # 
    def format_chrom(chrom):
        if chrom.startswith('chr'):
            chrom = chrom[3:]
        return chrom
    # Inputs are either VCF or bed, or a vector of them 
    if len(geno_file_paths) > 1:
        if all([valid_geno_file(x) for x in geno_file_paths]):
            return paths(geno_file_paths)
        else: 
            raise ValueError(f"Invalid input {geno_file_paths}")
    # Input is one genotype file or text list of genotype files
    geno_file = geno_file_paths[0]
    if valid_geno_file(geno_file):
        return path(geno_file)
    else: 
        units = [x.strip().split() for x in open(geno_file).readlines() if x.strip() and not x.strip().startswith('#')]
        if all([len(x) == 1 for x in units]):
            return paths([complete_geno_path(x[0], geno_file) for x in units])
        elif all([len(x) == 2 for x in units]):
            genos = dict([(format_chrom(x[0]), path(complete_geno_path(x[1], geno_file))) for x in units])
        else:
            raise ValueError(f"{geno_file} should contain one column of file names, or two columns of chrom number and corresponding file name")
        return genos
                        
genotypes = get_genotype_file(genoFile)

Step 1. Generate LOCO file list#

# Generate LOCO file list
[grm_1]
chrom = list(genotypes.keys())
input: for_each = 'chrom'
output: f'{cwd}/{genotypes[_chrom]:bn}.loco.txt'
with open(_output, 'w') as f:
    f.write('\n'.join([str(genotypes[x].with_suffix('')) for x in genotypes if x != _chrom]))

Step 2. Compute LOCO-GRM#

# Compute LOCO-GRM
[grm_2]
output: f'{_input:nn}.grm.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
   gcta64 \
   --mbfile $[_input] \
   --make-grm-gz \
   --out $[_output:nn]

Step 3. Format output for APEX#

# Format output for APEX
[grm_3]
output: f'{_input:nn}.apex.grm'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container

    # Load necessary libraries
    library(dplyr)
    library(readr)
    library(purrr)

    # Read data
    grm <- read_delim(gzfile("$[_input]"), '\t', col_names = FALSE) %>%
      select(1, 2, 4) %>%
      set_names(c("#id1", "id2", "kinship"))

    id <- read_delim("$[_input:n].id", '\t', col_names = FALSE)[, 2] %>%
      unlist()

    # Modify grm data
    grm_updated <- grm %>%
      mutate(`#id1` = map_chr(`#id1`, ~id[.x]),
             `id2` = map_chr(`id2`, ~id[.x]))

    # Write to output
    grm_updated %>%
      write_delim("$[_output]", '\t')

Step 4. Generate APEX input list#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

# Generate APEX input list
[grm_4]
input: group_by = "all"
output: f'{cwd}/{path(list(genotypes.values())[1]):bnn}.loco_grm_list.txt'
R: expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container

    # Load necessary libraries
    library(dplyr)
    library(readr)
    library(purrr)

    # Define chromosomal names
    chromosomes <- paste0("chr", c($[paths(genotypes.keys()):r,]), collapse = "")

    # Define input files and directory
    input_files <- c($[_input:r,])

    # Create and sort the tibble
    grm_tibble <- tibble(`#chr` = chromosomes, dir = input_files) %>%
      arrange(`#chr`)

    # Write to output
    grm_tibble %>%
      write_delim("$[_output]", "\t")