This function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data) or summary statistics (sumstats, LD). Run load_regional_univariate_data and load_rss_data multiple times for different datasets

This function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data) or summary statistics (sumstats, LD). Run load_regional_univariate_data and load_rss_data multiple times for different datasets

Usage

load_multitask_regional_data(
  region,
  genotype_list = NULL,
  phenotype_list = NULL,
  covariate_list = NULL,
  conditions_list_individual = NULL,
  match_geno_pheno = NULL,
  maf_cutoff = 0,
  mac_cutoff = 0,
  xvar_cutoff = 0,
  imiss_cutoff = 0,
  association_window = NULL,
  extract_region_name = NULL,
  region_name_col = NULL,
  keep_indel = TRUE,
  keep_samples = NULL,
  keep_variants = NULL,
  phenotype_header = 4,
  scale_residuals = FALSE,
  tabix_header = TRUE,
  sumstat_path_list = NULL,
  column_file_path_list = NULL,
  LD_meta_file_path_list = NULL,
  match_LD_sumstat = NULL,
  conditions_list_sumstat = NULL,
  n_samples = 0,
  n_cases = 0,
  n_controls = 0,
  extract_sumstats_region_name = NULL,
  sumstats_region_name_col = NULL,
  comment_string = "#",
  extract_coordinates = NULL
)

Arguments

region: The region where tabix use to subset the input dataset.
genotype_list: a vector of PLINK bed file containing genotype data.
phenotype_list: A vector of phenotype file names.
covariate_list: A vector of covariate file names corresponding to the phenotype file vector.
conditions_list_individual: A vector of strings representing different conditions or groups.
match_geno_pheno: A vector of index of phentoypes matched to genotype if mulitple genotype PLINK files
maf_cutoff: Minimum minor allele frequency (MAF) cutoff. Default is 0.
mac_cutoff: Minimum minor allele count (MAC) cutoff. Default is 0.
xvar_cutoff: Minimum variance cutoff. Default is 0.
imiss_cutoff: Maximum individual missingness cutoff. Default is 0.
association_window: A string of chr:start-end for the association analysis window (cis or trans). If not provided, all genotype data will be loaded.
extract_region_name: A list of vectors of strings (e.g., gene ID ENSG00000269699) to subset the information when there are multiple regions available. Default is NULL.
region_name_col: Column name containing the region name. Default is NULL.
keep_indel: Logical indicating whether to keep insertions/deletions (INDELs). Default is TRUE.
keep_samples: A vector of sample names to keep. Default is NULL.
phenotype_header: Number of rows to skip at the beginning of the transposed phenotype file (default is 4 for chr, start, end, and ID).
scale_residuals: Logical indicating whether to scale residuals. Default is FALSE.
tabix_header: Logical indicating whether the tabix file has a header. Default is TRUE.
sumstat_path_list: A vector of file path to the summary statistics.
column_file_path_list: A vector of file path to the column file for mapping.
LD_meta_file_path_list: A vector of path of LD_metadata, LD_metadata is a data frame specifying LD blocks with columns "chrom", "start", "end", and "path". "start" and "end" denote the positions of LD blocks. "path" is the path of each LD block, optionally including bim file paths.
match_LD_sumstat: A vector of index of sumstat matched to LD if mulitple sumstat files
conditions_list_sumstat: A vector of strings representing different sumstats.
n_samples: User-specified sample size. If unknown, set as 0 to retrieve from the sumstat file.
n_cases: User-specified number of cases.
n_controls: User-specified number of controls.
extract_sumstats_region_name: User-specified gene/phenotype name used to further subset the phenotype data.
sumstats_region_name_col: Filter this specific column for the extract_sumstats_region_name.
comment_string: comment sign in the column_mapping file, default is #
extract_coordinates: Optional data frame with columns "chrom" and "pos" for specific coordinates extraction.

Value

A list containing the individual_data and sumstat_data: individual_data contains the following components if exist

residual_Y: A list of residualized phenotype values (either a vector or a matrix).
residual_X: A list of residualized genotype matrices for each condition.
residual_Y_scalar: Scaling factor for residualized phenotype values.
residual_X_scalar: Scaling factor for residualized genotype values.
dropped_sample: A list of dropped samples for X, Y, and covariates.
covar: Covariate data.
Y: Original phenotype data.
X_data: Original genotype data.
X: Filtered genotype matrix.
maf: Minor allele frequency (MAF) for each variant.
chrom: Chromosome of the region.
grange: Genomic range of the region (start and end positions).
Y_coordinates: Phenotype coordinates if a region is specified.

sumstat_data contains the following components if exist

sumstats: A list of summary statistics for the matched LD_info, each sublist contains sumstats, n, var_y from load_rss_data.
LD_info: A list of LD information, each sublist contains LD_variants, LD_matrix, ref_panel load_LD_matrix.

This function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data) or summary statistics (sumstats, LD). Run `load_regional_univariate_data` and `load_rss_data` multiple times for different datasets

Usage

Arguments

Value

Loading individual level data from multiple corhorts

Loading summary statistics from multiple corhorts or data set