This function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data) or summary statistics (sumstats, LD). Run load_regional_univariate_data and load_rss_data multiple times for different datasets
Source: R/file_utils.R
load_multitask_regional_data.RdThis function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data)
or summary statistics (sumstats, LD). Run load_regional_univariate_data and load_rss_data multiple times for different datasets
Usage
load_multitask_regional_data(
region,
genotype_list = NULL,
phenotype_list = NULL,
covariate_list = NULL,
conditions_list_individual = NULL,
match_geno_pheno = NULL,
maf_cutoff = 0,
mac_cutoff = 0,
xvar_cutoff = 0,
imiss_cutoff = 0,
association_window = NULL,
extract_region_name = NULL,
region_name_col = NULL,
keep_indel = TRUE,
keep_samples = NULL,
keep_variants = NULL,
phenotype_header = 4,
scale_residuals = FALSE,
tabix_header = TRUE,
sumstat_path_list = NULL,
column_file_path_list = NULL,
LD_meta_file_path_list = NULL,
match_LD_sumstat = NULL,
conditions_list_sumstat = NULL,
n_samples = 0,
n_cases = 0,
n_controls = 0,
extract_sumstats_region_name = NULL,
sumstats_region_name_col = NULL,
comment_string = "#",
extract_coordinates = NULL
)Arguments
- region
The region where tabix use to subset the input dataset.
- genotype_list
a vector of PLINK bed file containing genotype data.
- phenotype_list
A vector of phenotype file names.
- covariate_list
A vector of covariate file names corresponding to the phenotype file vector.
- conditions_list_individual
A vector of strings representing different conditions or groups.
- match_geno_pheno
A vector of index of phentoypes matched to genotype if mulitple genotype PLINK files
- maf_cutoff
Minimum minor allele frequency (MAF) cutoff. Default is 0.
- mac_cutoff
Minimum minor allele count (MAC) cutoff. Default is 0.
- xvar_cutoff
Minimum variance cutoff. Default is 0.
- imiss_cutoff
Maximum individual missingness cutoff. Default is 0.
- association_window
A string of chr:start-end for the association analysis window (cis or trans). If not provided, all genotype data will be loaded.
- extract_region_name
A list of vectors of strings (e.g., gene ID ENSG00000269699) to subset the information when there are multiple regions available. Default is NULL.
- region_name_col
Column name containing the region name. Default is NULL.
- keep_indel
Logical indicating whether to keep insertions/deletions (INDELs). Default is TRUE.
- keep_samples
A vector of sample names to keep. Default is NULL.
- phenotype_header
Number of rows to skip at the beginning of the transposed phenotype file (default is 4 for chr, start, end, and ID).
- scale_residuals
Logical indicating whether to scale residuals. Default is FALSE.
- tabix_header
Logical indicating whether the tabix file has a header. Default is TRUE.
- sumstat_path_list
A vector of file path to the summary statistics.
- column_file_path_list
A vector of file path to the column file for mapping.
- LD_meta_file_path_list
A vector of path of LD_metadata, LD_metadata is a data frame specifying LD blocks with columns "chrom", "start", "end", and "path". "start" and "end" denote the positions of LD blocks. "path" is the path of each LD block, optionally including bim file paths.
- match_LD_sumstat
A vector of index of sumstat matched to LD if mulitple sumstat files
- conditions_list_sumstat
A vector of strings representing different sumstats.
- n_samples
User-specified sample size. If unknown, set as 0 to retrieve from the sumstat file.
- n_cases
User-specified number of cases.
- n_controls
User-specified number of controls.
- extract_sumstats_region_name
User-specified gene/phenotype name used to further subset the phenotype data.
- sumstats_region_name_col
Filter this specific column for the extract_sumstats_region_name.
- comment_string
comment sign in the column_mapping file, default is #
- extract_coordinates
Optional data frame with columns "chrom" and "pos" for specific coordinates extraction.
Value
A list containing the individual_data and sumstat_data: individual_data contains the following components if exist
residual_Y: A list of residualized phenotype values (either a vector or a matrix).
residual_X: A list of residualized genotype matrices for each condition.
residual_Y_scalar: Scaling factor for residualized phenotype values.
residual_X_scalar: Scaling factor for residualized genotype values.
dropped_sample: A list of dropped samples for X, Y, and covariates.
covar: Covariate data.
Y: Original phenotype data.
X_data: Original genotype data.
X: Filtered genotype matrix.
maf: Minor allele frequency (MAF) for each variant.
chrom: Chromosome of the region.
grange: Genomic range of the region (start and end positions).
Y_coordinates: Phenotype coordinates if a region is specified.
sumstat_data contains the following components if exist
sumstats: A list of summary statistics for the matched LD_info, each sublist contains sumstats, n, var_y from
load_rss_data.LD_info: A list of LD information, each sublist contains combined_LD_variants, combined_LD_matrix, ref_panel
load_LD_matrix.