keep_significant_qtls_from_region

ukbppp_dl.pgwas.keep_significant_qtls_from_region(synapse_folder_id: str = 'syn51365308', download_location: str = './ukb_ppp_dl/data', res_location: str = './ukb_ppp_dl/results', login_kwargs: dict[str, Any] = {}, regenie_sep: str = ' ', regenie_columns: list[str] | None = None, csv_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = True, log_kwargs: dict[str, Any] = {}, protein_to_process: list[str] | None = None, verbose: bool | int = False, delete_downloaded_tar: bool = True, delete_chr_csv: bool = True, delete_tar_csv: bool = True, delete_tar_log: bool = True, delete_partial_logs: str | bool = 'current', delete_partial_outputs: str | bool = 'current') tuple[DataFrame, dict]

Keep significant pGWAS QTLs for an entire Synapse region folder.

This is the main entry point for processing one ancestry group. It:

  1. Lists all protein tar files in the Synapse folder.

  2. For each protein, downloads the tar file from Synapse (if not already downloaded), extracts per-chromosome REGENIE files, filters for significant QTLs, and merges them into a protein-level CSV.

  3. Concatenates all protein-level results into one region-level CSV.

  4. Documents the process and results in logs and output text files.

  5. Optionally cleans up intermediate files.

Partial results from a previous interrupted run are automatically detected and reused.

This function does not require a lot of free space to run because it processes one protein at a time and deletes intermediate files along the way (if specified).

This function creates logs and output text files for each run. This is important and has several benefits:

  1. It allows the function to be safely re-run multiple times without overwriting previous logs and outputs.

  2. It allows the function to automatically detect and reuse partial runs

  3. It provides a detailed record of the parameters used to create the generated results. This can be crucial for reproducibility in science.

  4. It facilitates sanity checks and monitoring of the results.

It is highly recommended to keep the logs and output text files (at least at the region level, that is to say with create_log=True or create_log=1) for future reference.

Note that even if create_log=False, partial log files will still be created, but they will be deleted at the end of the function.

This function will create many files during the process, but most of them are intermediate files that can be deleted at the end of the run if specified. The main final output is the region-level CSV file containing all significant QTLs across all processed proteins, and the region-level output file documenting the process and results.

Parameters:
  • synapse_folder_id (str, optional) – Synapse folder ID for the GWAS region, e.g. "syn51365308" for the Combined ancestry. Defaults to PGWAS_REGIONS["Combined"].

  • download_location (str, optional) – Local directory where tar files are downloaded.

  • res_location (str, optional) – Local directory where result CSV, log and output files are written.

  • login_kwargs (Dict[str, Any], optional) –

    Keyword arguments forwarded to synapseclient.Synapse.login(). Common keys include "authToken" (str), "email" (str) and "profile" (str). See synapseclient.Synapse.login() for more information.

    Alternatively, you can configure your Synapse credentials in a .synapseConfig file and leave this argument empty to use the default login behaviour. You can find more information about .synapseConfig in the Synapse documentation.

  • regenie_sep (str, optional) – Field separator used in REGENIE files. Default is " " (space).

  • regenie_columns (list[str] or None, optional) – Subset of REGENIE columns to load, e.g. ["CHROM", "GENPOS", "ID", "BETA", "SE", "LOG10P"]. "LOG10P" must be included. None loads all columns: ["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "INFO", "N", "TEST", "BETA", "SE", "CHISQ", "LOG10P", "EXTRA"].

  • csv_columns (list[str] or None, optional) – New column names to assign after loading. Must match the length of regenie_columns when both are provided. "LOG10P" must keep the same name. If None, no renaming is done and original REGENIE column names are used.

  • log10p_threshold (float or int, optional) – Minimum LOG10P value for significance. Default is 7.

  • create_log (bool or int, optional) – Logging verbosity. 0 disables; 1 creates region-level logs; 2 also creates protein-level logs; 3 also creates per-chromosome logs. Default is True (1).

  • log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to save_log(). Recognised key: "log_fname" (str) to override the default log filename. This parameter is not fully implemented yet and it is mostly the default filenames that are used.

  • protein_to_process (list[str] or None, optional) – Whitelist of proteins to process. Accepts Synapse IDs (e.g. "syn52361344") or tar filenames (e.g. "ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar"). None processes all proteins.

  • verbose (bool or int, optional) – Verbosity level. 0/False is silent; higher values print more details at each processing step.

  • delete_downloaded_tar (bool, optional) – Delete each downloaded tar file after processing to save disk space. Default is True.

  • delete_chr_csv (bool, optional) – Delete per-chromosome significant-QTL CSV files after merging them into the protein-level CSV. Default is True.

  • delete_tar_csv (bool, optional) – Delete protein-level significant-QTL CSV files after merging them into the region-level CSV. Default is True.

  • delete_tar_log (bool, optional) – Delete protein-level log files after the region-level log is created. Default is True.

  • delete_partial_logs (str or bool, optional) – Controls deletion of partial region log files. "current" or True deletes only the log created by this run; "all" deletes all compatible partial logs; False keeps them all. Default is "current".

  • delete_partial_outputs (str or bool, optional) – Controls deletion of partial region output text files. Same options as delete_partial_logs. Default is "current".

Returns:

  • all_significant_qtls (pl.DataFrame) – DataFrame of all significant QTLs across all processed proteins. Contains a leading "protein_name" (str) column followed by the REGENIE (or renamed) columns.

  • final_log_reg (dict) – Final region log dictionary (merged from all partial logs).