Skip to contents

This function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data) or summary statistics (sumstats, LD). Run load_regional_univariate_data and load_rss_data multiple times for different datasets

Usage

load_multitask_regional_data(
  region,
  genotype_list = NULL,
  phenotype_list = NULL,
  covariate_list = NULL,
  conditions_list_individual = NULL,
  match_geno_pheno = NULL,
  maf_cutoff = 0,
  mac_cutoff = 0,
  xvar_cutoff = 0,
  imiss_cutoff = 0,
  association_window = NULL,
  extract_region_name = NULL,
  region_name_col = NULL,
  keep_indel = TRUE,
  keep_samples = NULL,
  keep_variants = NULL,
  phenotype_header = 4,
  scale_residuals = FALSE,
  tabix_header = TRUE,
  sumstat_path_list = NULL,
  column_file_path_list = NULL,
  LD_meta_file_path_list = NULL,
  match_LD_sumstat = NULL,
  conditions_list_sumstat = NULL,
  n_samples = 0,
  n_cases = 0,
  n_controls = 0,
  extract_sumstats_region_name = NULL,
  sumstats_region_name_col = NULL,
  comment_string = "#",
  extract_coordinates = NULL
)

Arguments

region

The region where tabix use to subset the input dataset.

genotype_list

a vector of PLINK bed file containing genotype data.

phenotype_list

A vector of phenotype file names.

covariate_list

A vector of covariate file names corresponding to the phenotype file vector.

conditions_list_individual

A vector of strings representing different conditions or groups.

match_geno_pheno

A vector of index of phentoypes matched to genotype if mulitple genotype PLINK files

maf_cutoff

Minimum minor allele frequency (MAF) cutoff. Default is 0.

mac_cutoff

Minimum minor allele count (MAC) cutoff. Default is 0.

xvar_cutoff

Minimum variance cutoff. Default is 0.

imiss_cutoff

Maximum individual missingness cutoff. Default is 0.

association_window

A string of chr:start-end for the association analysis window (cis or trans). If not provided, all genotype data will be loaded.

extract_region_name

A list of vectors of strings (e.g., gene ID ENSG00000269699) to subset the information when there are multiple regions available. Default is NULL.

region_name_col

Column name containing the region name. Default is NULL.

keep_indel

Logical indicating whether to keep insertions/deletions (INDELs). Default is TRUE.

keep_samples

A vector of sample names to keep. Default is NULL.

phenotype_header

Number of rows to skip at the beginning of the transposed phenotype file (default is 4 for chr, start, end, and ID).

scale_residuals

Logical indicating whether to scale residuals. Default is FALSE.

tabix_header

Logical indicating whether the tabix file has a header. Default is TRUE.

sumstat_path_list

A vector of file path to the summary statistics.

column_file_path_list

A vector of file path to the column file for mapping.

LD_meta_file_path_list

A vector of path of LD_metadata, LD_metadata is a data frame specifying LD blocks with columns "chrom", "start", "end", and "path". "start" and "end" denote the positions of LD blocks. "path" is the path of each LD block, optionally including bim file paths.

match_LD_sumstat

A vector of index of sumstat matched to LD if mulitple sumstat files

conditions_list_sumstat

A vector of strings representing different sumstats.

n_samples

User-specified sample size. If unknown, set as 0 to retrieve from the sumstat file.

n_cases

User-specified number of cases.

n_controls

User-specified number of controls.

extract_sumstats_region_name

User-specified gene/phenotype name used to further subset the phenotype data.

sumstats_region_name_col

Filter this specific column for the extract_sumstats_region_name.

comment_string

comment sign in the column_mapping file, default is #

extract_coordinates

Optional data frame with columns "chrom" and "pos" for specific coordinates extraction.

Value

A list containing the individual_data and sumstat_data: individual_data contains the following components if exist

  • residual_Y: A list of residualized phenotype values (either a vector or a matrix).

  • residual_X: A list of residualized genotype matrices for each condition.

  • residual_Y_scalar: Scaling factor for residualized phenotype values.

  • residual_X_scalar: Scaling factor for residualized genotype values.

  • dropped_sample: A list of dropped samples for X, Y, and covariates.

  • covar: Covariate data.

  • Y: Original phenotype data.

  • X_data: Original genotype data.

  • X: Filtered genotype matrix.

  • maf: Minor allele frequency (MAF) for each variant.

  • chrom: Chromosome of the region.

  • grange: Genomic range of the region (start and end positions).

  • Y_coordinates: Phenotype coordinates if a region is specified.

sumstat_data contains the following components if exist

  • sumstats: A list of summary statistics for the matched LD_info, each sublist contains sumstats, n, var_y from load_rss_data.

  • LD_info: A list of LD information, each sublist contains combined_LD_variants, combined_LD_matrix, ref_panel load_LD_matrix.

Loading individual level data from multiple corhorts

NA

Loading summary statistics from multiple corhorts or data set

NA