keep_significant_qtls_from_region
- ukbppp_dl.pgwas.keep_significant_qtls_from_region(synapse_folder_id: str = 'syn51365308', download_location: str = './ukb_ppp_dl/data', res_location: str = './ukb_ppp_dl/results', login_kwargs: dict[str, Any] = {}, regenie_sep: str = ' ', regenie_columns: list[str] | None = None, csv_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = True, log_kwargs: dict[str, Any] = {}, protein_to_process: list[str] | None = None, verbose: bool | int = False, delete_downloaded_tar: bool = True, delete_chr_csv: bool = True, delete_tar_csv: bool = True, delete_tar_log: bool = True, delete_partial_logs: str | bool = 'current', delete_partial_outputs: str | bool = 'current') tuple[DataFrame, dict]
Keep significant pGWAS QTLs for an entire Synapse region folder.
This is the main entry point for processing one ancestry group. It:
Lists all protein tar files in the Synapse folder.
For each protein, downloads the tar file from Synapse (if not already downloaded), extracts per-chromosome REGENIE files, filters for significant QTLs, and merges them into a protein-level CSV.
Concatenates all protein-level results into one region-level CSV.
Documents the process and results in logs and output text files.
Optionally cleans up intermediate files.
Partial results from a previous interrupted run are automatically detected and reused.
This function does not require a lot of free space to run because it processes one protein at a time and deletes intermediate files along the way (if specified).
This function creates logs and output text files for each run. This is important and has several benefits:
It allows the function to be safely re-run multiple times without overwriting previous logs and outputs.
It allows the function to automatically detect and reuse partial runs
It provides a detailed record of the parameters used to create the generated results. This can be crucial for reproducibility in science.
It facilitates sanity checks and monitoring of the results.
It is highly recommended to keep the logs and output text files (at least at the region level, that is to say with
create_log=Trueorcreate_log=1) for future reference.Note that even if
create_log=False, partial log files will still be created, but they will be deleted at the end of the function.This function will create many files during the process, but most of them are intermediate files that can be deleted at the end of the run if specified. The main final output is the region-level CSV file containing all significant QTLs across all processed proteins, and the region-level output file documenting the process and results.
- Parameters:
synapse_folder_id (str, optional) – Synapse folder ID for the GWAS region, e.g.
"syn51365308"for the Combined ancestry. Defaults toPGWAS_REGIONS["Combined"].download_location (str, optional) – Local directory where tar files are downloaded.
res_location (str, optional) – Local directory where result CSV, log and output files are written.
login_kwargs (Dict[str, Any], optional) –
Keyword arguments forwarded to
synapseclient.Synapse.login(). Common keys include"authToken"(str),"email"(str) and"profile"(str). See synapseclient.Synapse.login() for more information.Alternatively, you can configure your Synapse credentials in a
.synapseConfigfile and leave this argument empty to use the default login behaviour. You can find more information about.synapseConfigin the Synapse documentation.regenie_sep (str, optional) – Field separator used in REGENIE files. Default is
" "(space).regenie_columns (list[str] or None, optional) – Subset of REGENIE columns to load, e.g.
["CHROM", "GENPOS", "ID", "BETA", "SE", "LOG10P"]."LOG10P"must be included.Noneloads all columns:["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "INFO", "N", "TEST", "BETA", "SE", "CHISQ", "LOG10P", "EXTRA"].csv_columns (list[str] or None, optional) – New column names to assign after loading. Must match the length of regenie_columns when both are provided.
"LOG10P"must keep the same name. IfNone, no renaming is done and original REGENIE column names are used.log10p_threshold (float or int, optional) – Minimum
LOG10Pvalue for significance. Default is7.create_log (bool or int, optional) – Logging verbosity.
0disables;1creates region-level logs;2also creates protein-level logs;3also creates per-chromosome logs. Default isTrue(1).log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to
save_log(). Recognised key:"log_fname"(str) to override the default log filename. This parameter is not fully implemented yet and it is mostly the default filenames that are used.protein_to_process (list[str] or None, optional) – Whitelist of proteins to process. Accepts Synapse IDs (e.g.
"syn52361344") or tar filenames (e.g."ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar").Noneprocesses all proteins.verbose (bool or int, optional) – Verbosity level.
0/Falseis silent; higher values print more details at each processing step.delete_downloaded_tar (bool, optional) – Delete each downloaded tar file after processing to save disk space. Default is
True.delete_chr_csv (bool, optional) – Delete per-chromosome significant-QTL CSV files after merging them into the protein-level CSV. Default is
True.delete_tar_csv (bool, optional) – Delete protein-level significant-QTL CSV files after merging them into the region-level CSV. Default is
True.delete_tar_log (bool, optional) – Delete protein-level log files after the region-level log is created. Default is
True.delete_partial_logs (str or bool, optional) – Controls deletion of partial region log files.
"current"orTruedeletes only the log created by this run;"all"deletes all compatible partial logs;Falsekeeps them all. Default is"current".delete_partial_outputs (str or bool, optional) – Controls deletion of partial region output text files. Same options as delete_partial_logs. Default is
"current".
- Returns:
all_significant_qtls (pl.DataFrame) – DataFrame of all significant QTLs across all processed proteins. Contains a leading
"protein_name"(str) column followed by the REGENIE (or renamed) columns.final_log_reg (dict) – Final region log dictionary (merged from all partial logs).