Genotype Data Formatting

Genotype Data Formatting#

This module implements a collection of workflows used to format genotype data.

Description#

We include steps for the formatting of genotype files. This includes the conversion between VCF and PLINK formats, the splitting of data (by specified input, chromosomes or genes) and the merging of data (by specified input, or by chromosomes).

Input Files#

This minimal working example starts from the per-chromosome toy genotype VCFs under input/genotype/:

File	Description
`input/genotype/protocol_example.genotype.chr{1..22}.vcf.gz`	22 per-chromosome toy genotype VCFs (60 samples)

The pipeline also accepts a single whole-genome VCF or PLINK bed/bim/fam bundle, a list of VCF/PLINK files, or a two-column (chrom, file) list.

Output Files#

File	Description
`output/genotype_formatting/plink/protocol_example.genotype.chr{1..22}.{bed,bim,fam}`	Per-chromosome PLINK bundles (Step 1)
`output/genotype_formatting/plink/protocol_example.genotype.merged.{bed,bim,fam}`	Single merged genome-wide PLINK bundle (Step 2)
`output/genotype_formatting/genotype_by_chrom/protocol_example.genotype.merged.{1..22}.{bed,bim,fam}`	Merged bundle partitioned back by chromosome (Step 3)

The merged PLINK bundle is the input expected by downstream genotype workers (GWAS_QC, GRM, PCA).

Minimal Working Example#

The data and singularity container bioinfo.sif can be downloaded from Synapse.

Step 1. VCF to PLINK#

Convert the 22 per-chromosome toy VCFs into per-chromosome PLINK bed/bim/fam bundles. Timing: <1 min on the toy data.

Each input VCF becomes one PLINK bundle named after the input stem.

sos run pipeline/genotype_formatting.ipynb vcf_to_plink \
    --genoFile `ls input/genotype/protocol_example.genotype.chr*.vcf.gz | grep -vE "rawchr|withfmt|add_chr"` \
    --cwd output/genotype_formatting/plink \
    -j 4

Step 2. Merge PLINK Files#

Timing: <1 min on the toy data.

This step merges all per-chromosome PLINK files into one genome-wide bundle. Note: only plink (v1.9) can perform the merge — plink2 does not support it.

sos run pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls output/genotype_formatting/plink/protocol_example.genotype.chr*.bed` \
    --name protocol_example.genotype.merged \
    --cwd output/genotype_formatting/plink \
    -j 2

Step 3. Partition Genotype Data by Chromosome#

Split the merged genome-wide bundle back into per-chromosome PLINK files. Timing: <1 min on the toy data.

sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile output/genotype_formatting/plink/protocol_example.genotype.merged.bed \
    --cwd output/genotype_formatting/genotype_by_chrom \
    --chrom `cut -f 1 output/genotype_formatting/plink/protocol_example.genotype.merged.bim | uniq | sed "s/chr//g"` \
    -j 4

Command Interface#

sos run genotype_formatting.ipynb -h

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# Work directory & output directory
parameter: cwd = path("output")
# The filename name for containers
parameter: container = ''
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# the path to a bed file or VCF file, a vector of bed files or VCF files, or a text file listing the bed files or VCF files to process
parameter: genoFile = paths
parameter: name = ""
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = f"{cwd:a}"

import os
def get_genotype_file(geno_file_paths):
    def valid_geno_file(x):
        suffixes = path(x).suffixes
        if suffixes[-1] in ['.bed', '.pgen']:
            return True
        if len(suffixes) > 1 and ''.join(suffixes[-2:]) == ".vcf.gz":
            return True
        return False
    #
    def complete_geno_path(x, geno_file):
        if not valid_geno_file(x):
            raise ValueError(f"Genotype file {x} should be VCF (end with .vcf.gz) or PLINK bed file (end with .bed)")
        if not os.path.isfile(x):
            # relative path
            if not os.path.isfile(f'{geno_file:ad}/' + x):
                raise ValueError(f"Cannot find genotype file {x}")
            else:
                x = f'{geno_file:ad}/' + x
        return x
    # 
    def format_chrom(chrom):
        if chrom.startswith('chr'):
            chrom = chrom[3:]
        return chrom
    # Inputs are either VCF or bed, or a vector of them 
    if len(geno_file_paths) > 1:
        if all([valid_geno_file(x) for x in geno_file_paths]):
            return paths(geno_file_paths)
        else: 
            raise ValueError(f"Invalid input {geno_file_paths}")
    # Input is one genotype file or text list of genotype files
    geno_file = geno_file_paths[0]
    if valid_geno_file(geno_file):
        return paths(geno_file)
    else: 
        units = [x.strip().split() for x in open(geno_file).readlines() if x.strip() and not x.strip().startswith('#')]
        if all([len(x) == 1 for x in units]):
            return paths([complete_geno_path(x[0], geno_file) for x in units])
        elif all([len(x) == 2 for x in units]):
            genos = dict([(format_chrom(x[0]), path(complete_geno_path(x[1], geno_file))) for x in units])
        else:
            raise ValueError(f"{geno_file} should contain one column of file names, or two columns of chrom number and corresponding file name")
        return genos

# Determine if the file is in PLINK1 (BED/BIM/FAM) or PLINK2 (PGEN/PVAR/PSAM) format
def determine_plink_format(file_path):
    """
    Determine the PLINK file format based on file extensions and companion files.
    
    Args:
        file_path (str or Path): Path to the input file
    
    Returns:
        str: 'plink1' or 'plink2'
    """
    # Convert to string if it's a Path object
    file_path = str(file_path)
    
    # Check direct file extensions
    if file_path.endswith('.bed'):
        return 'plink1'
    elif file_path.endswith('.pgen'):
        return 'plink2'
    
    # If the file doesn't have a standard extension, try to infer format
    try:
        # Remove the file extension if present
        base_path = file_path.rsplit('.', 1)[0] if '.' in file_path else file_path
        
        # Check for PLINK1 companion files
        plink1_companion_files = [
            f"{base_path}.bim",
            f"{base_path}.fam"
        ]
        
        # Check for PLINK2 companion files
        plink2_companion_files = [
            f"{base_path}.pvar",
            f"{base_path}.psam"
        ]
        
        # Check PLINK1 format
        if all(os.path.exists(f) for f in plink1_companion_files):
            return 'plink1'
        
        # Check PLINK2 format
        if all(os.path.exists(f) for f in plink2_companion_files):
            return 'plink2'
    
    except Exception as e:
        print(f"Error determining PLINK format: {e}")
    
    # Default to PLINK1 if can't determine
    return 'plink1'

# Get the appropriate PLINK command prefix
def get_plink_command_prefix(file_path):
    format_type = determine_plink_format(file_path)
    if format_type == 'plink1':
        return "--bfile"
    else:  # plink2
        return "--pfile"

# Get output file extension based on format
def get_output_extension(format_type):
    if format_type == 'plink1':
        return "bed"
    else:  # plink2
        return "pgen"       
     
# Choose the make-bed or make-pgen command based on desired output format
def get_make_command(output_format):
    if output_format == 'plink1':
        return '--make-bed'
    else:  # plink2
        return '--make-pgen'
genoFile = get_genotype_file(genoFile)

PLINK to VCF#

[plink_to_vcf_1]
if isinstance(genoFile, dict):
    genoFile = genoFile.values()

input: genoFile, group_by = 1
output: f'{cwd}/{_input:bn}.vcf.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint=entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.sh plink_to_vcf_1 \
        --genoFile "${_input}" \
        --output "${_output}" \
        --mem "${mem}" \
        --numThreads ${numThreads}

bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    python3 ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.py \
        --step vcf_gz_summary \
        --data-files "${_output}"

VCF to PLINK#

Export VCF files to PLINK 1.0 format, without keeping allele orders by default. The resulting PLINK will lose ref/alt allele information but will go by major/minor allele, as conventionally used in standard PLINK format. Notice that PLINK 1.0 format does not allow for dosages. PLINK 2.0 format support it, but it is generally not supported by downstreams data analysis.

In the following code block the option --vcf-half-call m treat half-call as missing.

[vcf_to_plink]
parameter: remove_duplicates = False
parameter: add_chr = True
parameter: keep_dosage = False
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
fail_if(not (keep_samples.is_file() or keep_samples == path('.')), msg = f'Cannot find ``{keep_samples}``')
fail_if(not (remove_samples.is_file() or remove_samples == path('.')), msg = f'Cannot find ``{remove_samples}``')

if isinstance(genoFile, dict):
    genoFile = genoFile.values()

input: genoFile, group_by = 1
output: f'{cwd}/{_input:bnn}.{"pgen" if keep_dosage else "bed"}'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.sh vcf_to_plink \
        --cwd "${cwd}" \
        --genoFile "${_input}" \
        --numThreads ${numThreads}

Split PLINK genotypes into specified regions#

[genotype_by_region_1]
# cis window size
parameter: window = 0
# Region definition
parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
input: genoFile, for_each = 'regions'
output: f'{cwd}/{region_list:bn}_genotype_by_region/{_input:bn}.{_regions[3]}.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.sh genotype_by_region_1 \
        --genoFile "${_input}" \
        --output "${_output}" \
        --region-chrom "${_regions[0]}" \
        --region-start "${_regions[1]}" \
        --region-end "${_regions[2]}" \
        --window ${window} \
        --numThreads ${numThreads}

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    python3 ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.py \
        --step file_size_summary \
        --data-files "${_output}"

Compute LD matrices for given input region#

PLINK based implementation#

FIXME: Hao, I suggest including all contents for LD matrix storage type benchmarking into this repo, so we justify why we would like to save the data as square 0, float 16 and using npz format. Perhaps we should start a folder called “code/auxillary” to keep notebooks such as these? You can then remove what you have in the brain-xqtl-analysis repository after you migrate all the relevant contents here.

[ld_by_region_plink_1]
# Region definition
parameter: region_list = path
parameter: float_type = 16
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
input: genoFile, for_each = 'regions'
output: f'{cwd}/{region_list:bn}_LD/{_input:bn}.{_regions[0]}_{_regions[1]}_{_regions[2]}.float{float_type}.npz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.py \
        --step ld_by_region_plink_1 \
        --genoFile "${_input}" \
        --region-chrom "${_regions[0]}" \
        --region-start "${_regions[1]}" \
        --region-end "${_regions[2]}" \
        --float-type ${float_type} \
        --output "${_output}" \
        --numThreads ${numThreads}

ldstore2 based implementation#

This is good for larger sample sizes such as data from UK Biobank although we are not facing this challenge in the FunGen-xQTL project.

FIXME: we need to build ldstore2 into a container image. According to Diana it should be

pip3 install https://files.pythonhosted.org/packages/a8/fd/f98ab7dea176f42cb61b80450b795ef19b329e8eb715b87b0d13c2a0854d/ldstore-0.1.9.tar.gz 

FIXME: Diana, what’s the input for this workflow?

Create `master` file for `ldstore2`#

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line:

FIXME: Diana, this documentation is not clearly written. I cannot understand it. What are the mandatory column names? What does it mean by one data-set per line?

For the Z file, the format should be rsid:chrom:pos:a1:a2. Formatting for chromosome should be 01,02,03 etc
List of samples

The LDstore draft is currently availale here with the code to prepare for the genotypic input here. A minimal working example can be found [here]

Split PLINK by Chromosome#

Pls note that .pvar file have meta_file headers before the variant info dataframe#

sos run /home/rl3328/xqtl-protocol/code/data_preprocessing/genotype/genotype_formatting.ipynb genotype_by_chrom
–genoFile /home/rl3328/motor_data/geno_data/final/LD_pruning/chrAll_GRCh38_biallelic.motor_eQTL.plink_qc.pgen
–cwd /home/rl3328/motor_data/geno_data/final/genotype_by_chrom
–chrom grep -v '^#' /home/rl3328/motor_data/geno_data/final/LD_pruning/chrAll_GRCh38_biallelic.motor_eQTL.plink_qc.pvar | cut -f 1 | uniq | sed "s/chr//g"

[genotype_by_chrom_1]
stop_if(len(paths(genoFile))>1, msg = "This workflow expects one input genotype file.")
parameter: chrom = list
chrom = list(set(chrom))
input: genoFile, for_each = "chrom"
plink_command = get_plink_command_prefix(_input)
output_format = determine_plink_format(_input)
output_ext = get_output_extension(output_format)
make_command = get_make_command(output_format)
output: f'{cwd}/{_input:bn}.{_chrom}.{output_ext}'
# look up for genotype file
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.sh genotype_by_chrom \
        --cwd "${cwd}" \
        --genoFile "${_input}" \
        --name "${name}" \
        --output "${_output}" \
        --plink-command "${plink_command}" \
        --make-command "${make_command}" \
        --output-format "${output_format}" \
        --chrom ${_chrom} \
        --mem "${mem}" \
        --numThreads ${numThreads}

[genotype_by_chrom_2]
input: group_by = "all"
output_format = determine_plink_format(_input)
output_ext = get_output_extension(output_format)
output: f'{_input[0]:nn}.{step_name[:-2]}_files.txt'
sos_run("write_data_list", data_files = _input, out = _output, ext = f"{output_ext}")

[plink_to_vcf_2]
input: group_by = "all"
output: f'{_input[0]:nnn}.{step_name[:-2]}_files.txt'
sos_run("write_data_list", data_files = _input, out = _output, ext = "vcf.gz")

[genotype_by_region_2]
input: group_by = "all"
output: f'{_input[0]:nn}.{step_name[:-2]}_files.txt'
sos_run("write_data_list", data_files = _input, out = _output, ext = "bed")

[ld_by_region_*_2]
parameter: region_list = path
input: group_by = "all"
output: f'{cwd}/{region_list:bn}_LD/{genoFile:bn}.ld.list'
sos_run("write_data_list", data_files = _input, out = _output, ext = "npy.gz")

[write_data_list]
parameter: out = path
parameter: ext = str
parameter: data_files = paths
input: data_files
output: out
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.py \
        --step write_data_list \
        --data-files "${"::".join([str(x) for x in _input])}" \
        --ext "${ext}" \
        --output "${_output}" \
        --numThreads ${numThreads}

Split VCF by Chromosome#

FIXME: add this as needed

Merge PLINK files#

[merge_plink]
skip_if(len(genoFile) == 1)
# File prefix for the analysis output
parameter: name = str
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
parameter: extra_plink_opts = []
parameter: mem = '300G'
input: genoFile, group_by = 'all'
output: f"{cwd}/{name}{_input[0]:x}"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.sh merge_plink \
        --cwd "${cwd}" \
        --genoFile "${_input}" \
        --name "${name}" \
        --mem "${mem}" \
        --numThreads ${numThreads}

Merge VCF files#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[merge_vcf]
skip_if(len(genoFile) == 1)
# File prefix for the analysis output
parameter: name = str
input: genoFile, group_by = 'all'
output:  f"{cwd}/{name}.vcf.gz"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.sh merge_vcf \
        --genoFile "${_input}" \
        --output "${_output}"

bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    python3 ${modular_script_dir}/data_preprocessing/genotype/genotype_formatting.py \
        --step vcf_gz_summary \
        --data-files "${_output}"