ukbppp_dl.pgwas

Functions for processing pGWAS summary statistics files from UKB PPP.

Functions

find_partial_region_logs([example_log_dict, ...])

Find compatible pre-existing partial region log files.

keep_significant_qtls_from_chr_gz_file(...)

Keep significant QTLs from a gzipped summary-statistics file.

keep_significant_qtls_from_region([...])

Keep significant pGWAS QTLs for an entire Synapse region folder.

list_tar_files_in_region_folder([...])

List all protein tar files in the folder of given GWAS region.

merge_partial_output_files(output_fname, ...)

Concatenate partial text output files into a single output file.

merge_partial_region_logs(compatible_log_files)

Merge multiple compatible partial region log dictionaries into one.

merge_significant_qtls_from_csv(csv_fnames)

Concatenate multiple CSV files into one DataFrame.

process_one_chr_from_protein_tar_file(...[, ...])

Keep significant QTLs from one chromosome file of a protein tar.

process_one_tar_file(tar_fname[, ...])

Keep significant QTLs from one protein tar file.

ukbppp_dl.pgwas.list_tar_files_in_region_folder(synapse_id: str = 'syn51365308', login_kwargs: Dict[str, Any] = {}) List[Tuple[str, str]]

List all protein tar files in the folder of given GWAS region.

Parameters:
  • synapse_id (str, optional) – Synapse folder ID for the GWAS region, e.g. "syn51365308" for the Combined ancestry group. Defaults to PGWAS_REGIONS["Combined"].

  • login_kwargs (Dict[str, Any], optional) –

    Keyword arguments forwarded to synapseclient.Synapse.login(). Common keys include "authToken" (str), "email" (str) and "profile" (str). See synapseclient.Synapse.login() for more information.

    Alternatively, you can configure your Synapse credentials in a .synapseConfig file and leave this argument empty to use the default login behaviour. You can find more information about .synapseConfig in the Synapse documentation.

Returns:

Sorted list of (synapse_id, tar_filename) pairs, one per protein tar file found in the folder. Example entry: [("syn52361344", "ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar"), ("syn51470065", "ABHD14B_Q96IU4_OID20921_v1_Neurology.tar.tar")].

Return type:

List[Tuple[str, str]]

ukbppp_dl.pgwas.keep_significant_qtls_from_chr_gz_file(chr_file_gz: str | PathLike[str] | IO[bytes], separator: str = ' ', columns: list[str] | None = None, new_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = False, log_kwargs: dict[str, Any] = {}, verbose: bool | int = False) tuple[DataFrame, dict]

Keep significant QTLs from a gzipped summary-statistics file.

May create a JSON log file.

Parameters:
  • chr_file_gz (str or os.PathLike[str] or IO[bytes]) – Path to the .gz file, or a binary file-like object (e.g. extracted from a tarfile). To be passed directly to gzip.open.

  • separator (str, optional) – Field separator used in the REGENIE file. Default is " " (space).

  • columns (list[str] or None, optional) – Subset of column names to load from the REGENIE file, e.g. ["CHROM", "GENPOS", "ID", "LOG10P"]. None loads all columns. LOG10P column must be present.

  • new_columns (list[str] or None, optional) – Rename the loaded columns using this list of new names. Must have the same length as columns when provided. LOG10P column must be present and with the same name for filtering to work.

  • log10p_threshold (float or int, optional) – Minimum LOG10P value for a QTL to be considered significant. Default is 7 (roughly p < 1e-7).

  • create_log (bool or int, optional) – Whether to save a JSON log file. 0/False disables logging; True or any positive integer enables it. Default is False.

  • log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to save_log(). Recognised key: "log_fname" (str) to override the default log filename.

  • verbose (bool or int, optional) – Verbosity level. 0/False is silent; higher values print more.

Returns:

  • summary_stats_significant (pl.DataFrame) – DataFrame containing only rows where LOG10P >= log10p_threshold.

  • log (dict) – Dictionary with keys: "log_filename" (str | None), "log10p_threshold" (float), "n_tot_qtls" (int), "n_kept_qtls" (int), "source_chr_file" (str).

ukbppp_dl.pgwas.process_one_chr_from_protein_tar_file(protein_tar_file: TarFile, chr_gz_fname: str, res_location: str = './ukb_ppp_dl/results', separator: str = ' ', columns: list[str] | None = None, new_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = False, log_kwargs: dict[str, Any] = {}, verbose: bool | int = False) tuple[str, dict]

Keep significant QTLs from one chromosome file of a protein tar.

May create a JSON log file and will create a CSV file.

Parameters:
  • protein_tar_file (tarfile.TarFile) – Open TarFile object for the protein’s tar archive.

  • chr_gz_fname (str) – Name or partial path of the chromosome .gz file inside the tar, e.g. "chr1.regenie.gz". A substring match is used to locate the actual entry.

  • res_location (str, optional) – Directory where the result CSV (and log) will be written.

  • separator (str, optional) – Field separator used in the REGENIE file. Default is " " (space).

  • columns (list[str] or None, optional) – Subset of column names to load from the REGENIE file, e.g. ["CHROM", "GENPOS", "ID", "LOG10P"]. None loads all columns. LOG10P column must be present.

  • new_columns (list[str] or None, optional) – Rename the loaded columns using this list of new names. Must have the same length as columns when provided. LOG10P column must be present and with the same name for filtering to work.

  • log10p_threshold (float or int, optional) – Minimum LOG10P threshold for significance. Default is 7.

  • create_log (bool or int, optional) – Verbosity of per-chromosome logging. 0 disables; 1 creates a JSON log for this chromosome file.

  • log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to save_log(). Recognised key: "log_fname" (str) to override the default log filename.

  • verbose (bool or int, optional) – Verbosity level.

Returns:

  • res_csv_fname (str) – Path to the CSV file containing significant QTLs for this chromosome.

  • log_chr (dict) – Log dictionary with key "skipped" (bool), and when not skipped: "log10p_threshold" (float), "n_tot_qtls" (int), "n_kept_qtls" (int), "source_chr_file" (str).

ukbppp_dl.pgwas.merge_significant_qtls_from_csv(csv_fnames: list[str], output_fname: str | None = None, create_log: bool | int = False, log_kwargs: dict[str, Any] = {}, delete_csv: bool = False, verbose: bool | int = False) tuple[DataFrame, dict]

Concatenate multiple CSV files into one DataFrame.

May create a JSON log file, may create a CSV file and can optionally delete the input CSV files after merging.

Parameters:
  • csv_fnames (list[str]) – Paths to the per-chromosome CSV files to merge. Must be non-empty. Each file must contain at least a LOG10P column. Files must be consistent with each other (e.g. same columns) but can have no rows (but a header).

  • output_fname (str or None, optional) – If provided, the merged DataFrame is written to this CSV path.

  • create_log (bool or int, optional) – If truthy and output_fname is set, save a JSON log alongside the merged CSV.

  • log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to save_log(). Recognised key: "log_fname" (str) to override the default log filename.

  • delete_csv (bool, optional) – If True, delete the input csv_fnames after merging. Default is False.

  • verbose (bool or int, optional) – Verbosity level.

Returns:

  • all_significant_qtls (pl.DataFrame) – Concatenated DataFrame of all significant QTLs from every input file. Empty (but with a header) if no significant QTL was found.

  • log (dict) – Dictionary with keys: "log_filename" (str | None), "merged_csv_filename" (str | None), "n_csv_merged" (int), "n_kept_qtls" (int), "n_kept_qtls_per_csv" (dict[str, int], mapping each CSV path to its row count), "min_log10p" (float | None).

ukbppp_dl.pgwas.process_one_tar_file(tar_fname: str, res_location: str = './ukb_ppp_dl/results', separator: str = ' ', columns: list[str] | None = None, new_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = True, log_kwargs: dict[str, Any] = {}, verbose: bool | int = False) tuple[list[str], dict]

Keep significant QTLs from one protein tar file.

May create a JSON log file for the protein and may create JSON log files for each chromosome. Will create a CSV file for each chromosome.

Parameters:
  • tar_fname (str) – Path to the protein .tar file, e.g. "ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar".

  • res_location (str, optional) – Directory where result CSVs and logs are written.

  • separator (str, optional) – Field separator used in REGENIE files. Default is " " (space).

  • columns (list[str] or None, optional) – Subset of column names to load from the REGENIE file, e.g. ["CHROM", "GENPOS", "ID", "LOG10P"]. None loads all columns. LOG10P column must be present.

  • new_columns (list[str] or None, optional) – Rename the loaded columns using this list of new names. Must have the same length as columns when provided. LOG10P column must be present and with the same name for filtering to work.

  • log10p_threshold (float or int, optional) – Minimum LOG10P threshold for significance. Default is 7.

  • create_log (bool or int, optional) – Logging verbosity. 0 disables; 1 creates a tar-level JSON log; 2 also creates per-chromosome logs. Default is True (1).

  • log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to save_log(). Recognised key: "log_fname" (str) to override the default log filename.

  • verbose (bool or int, optional) – Verbosity level.

Returns:

  • all_csv_fnames (list[str]) – Paths to the per-chromosome CSV files of significant QTLs.

  • log_tar (dict) – Log dictionary with keys: "log_filename" (str | None), "tar_fname" (str), "protein_name" (str), "n_chr_files" (int), "skipped_chr_files" (list[str]), "all_csv_fnames" (list[str]), "n_processed_qtls" (int), "log10p_threshold" (float), "regenie_columns" (list[str] | None), "csv_columns" (list[str] | None).

ukbppp_dl.pgwas.find_partial_region_logs(example_log_dict: dict | None = None, synapse_folder_id: str = 'syn51365308', res_location: str = './ukb_ppp_dl/results', regenie_columns: list[str] | None = None, csv_columns: list[str] | None = None, log10p_threshold: float | int = 7, all_tar_files: list[str] = [], verbose: bool | int = False) dict[str, dict]

Find compatible pre-existing partial region log files.

Parameters:
  • example_log_dict (dict or None, optional) – If provided, the function reads synapse_folder_id, regenie_columns, csv_columns, log10p_threshold, and all_tar_files from this dict instead of the individual keyword arguments. Expected keys match those written by keep_significant_qtls_from_region().

  • synapse_folder_id (str, optional) – Synapse folder ID used to filter log filenames, e.g. "syn51365308".

  • res_location (str, optional) – Directory to search for *.json partial log files.

  • regenie_columns (list[str] or None, optional) – Expected value of the "regenie_columns" key in matching logs.

  • csv_columns (list[str] or None, optional) – Expected value of the "csv_columns" key in matching logs.

  • log10p_threshold (float or int, optional) – Expected threshold. Logs with a different value are excluded.

  • all_tar_files (list[str], optional) – Expected list of tar filenames, e.g. ["ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar", ...].

  • verbose (bool or int, optional) – Verbosity level.

Returns:

Mapping from log file path (str) to the parsed log dictionary (dict) for each compatible partial log found.

Return type:

dict[str, dict]

ukbppp_dl.pgwas.merge_partial_region_logs(compatible_log_files: dict[str, dict]) dict

Merge multiple compatible partial region log dictionaries into one.

To be compatible, log files must have been created with the same parameters and created by running the function keep_significant_qtls_from_region(). In particular, they must share identical log10p_threshold, regenie_columns, csv_columns, and synapse_folder_id values.

We cannot use this function with already merged log files.

It is not advised to use this function unless you are extremely sure that all log files were created with the same parameters and that you will use the merged log file in a consistent way.

This function is mainly intended to be used as a helper function for the keep_significant_qtls_from_region() function.

Parameters:

compatible_log_files (dict[str, dict]) – Mapping from log file path to log dictionary, as returned by find_partial_region_logs(). All logs must share identical log10p_threshold, regenie_columns, csv_columns, and synapse_folder_id values.

Returns:

A single merged log dictionary. Returns an empty {} when compatible_log_files is empty.

Return type:

dict

ukbppp_dl.pgwas.merge_partial_output_files(output_fname: str, partial_output_filenames: list[str]) None

Concatenate partial text output files into a single output file.

This function will create a new text file.

Parameters:
  • output_fname (str) – Path to the final merged output text file to create, e.g. "syn51365308-output_text-2026-05-12--10:00:00.txt".

  • partial_output_filenames (list[str]) – Paths to the partial output files to concatenate, e.g. ["PART-syn51365308-output_text-2026-05-12--09:00:00.txt", ...]. Files are sorted by filename before concatenation so that order matches creation order. Each file’s content is wrapped with a separator banner showing the source filename.

ukbppp_dl.pgwas.keep_significant_qtls_from_region(synapse_folder_id: str = 'syn51365308', download_location: str = './ukb_ppp_dl/data', res_location: str = './ukb_ppp_dl/results', login_kwargs: dict[str, Any] = {}, regenie_sep: str = ' ', regenie_columns: list[str] | None = None, csv_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = True, log_kwargs: dict[str, Any] = {}, protein_to_process: list[str] | None = None, verbose: bool | int = False, delete_downloaded_tar: bool = True, delete_chr_csv: bool = True, delete_tar_csv: bool = True, delete_tar_log: bool = True, delete_partial_logs: str | bool = 'current', delete_partial_outputs: str | bool = 'current') tuple[DataFrame, dict]

Keep significant pGWAS QTLs for an entire Synapse region folder.

This is the main entry point for processing one ancestry group. It:

  1. Lists all protein tar files in the Synapse folder.

  2. For each protein, downloads the tar file from Synapse (if not already downloaded), extracts per-chromosome REGENIE files, filters for significant QTLs, and merges them into a protein-level CSV.

  3. Concatenates all protein-level results into one region-level CSV.

  4. Documents the process and results in logs and output text files.

  5. Optionally cleans up intermediate files.

Partial results from a previous interrupted run are automatically detected and reused.

This function does not require a lot of free space to run because it processes one protein at a time and deletes intermediate files along the way (if specified).

This function creates logs and output text files for each run. This is important and has several benefits:

  1. It allows the function to be safely re-run multiple times without overwriting previous logs and outputs.

  2. It allows the function to automatically detect and reuse partial runs

  3. It provides a detailed record of the parameters used to create the generated results. This can be crucial for reproducibility in science.

  4. It facilitates sanity checks and monitoring of the results.

It is highly recommended to keep the logs and output text files (at least at the region level, that is to say with create_log=True or create_log=1) for future reference.

Note that even if create_log=False, partial log files will still be created, but they will be deleted at the end of the function.

This function will create many files during the process, but most of them are intermediate files that can be deleted at the end of the run if specified. The main final output is the region-level CSV file containing all significant QTLs across all processed proteins, and the region-level output file documenting the process and results.

Parameters:
  • synapse_folder_id (str, optional) – Synapse folder ID for the GWAS region, e.g. "syn51365308" for the Combined ancestry. Defaults to PGWAS_REGIONS["Combined"].

  • download_location (str, optional) – Local directory where tar files are downloaded.

  • res_location (str, optional) – Local directory where result CSV, log and output files are written.

  • login_kwargs (Dict[str, Any], optional) –

    Keyword arguments forwarded to synapseclient.Synapse.login(). Common keys include "authToken" (str), "email" (str) and "profile" (str). See synapseclient.Synapse.login() for more information.

    Alternatively, you can configure your Synapse credentials in a .synapseConfig file and leave this argument empty to use the default login behaviour. You can find more information about .synapseConfig in the Synapse documentation.

  • regenie_sep (str, optional) – Field separator used in REGENIE files. Default is " " (space).

  • regenie_columns (list[str] or None, optional) – Subset of REGENIE columns to load, e.g. ["CHROM", "GENPOS", "ID", "BETA", "SE", "LOG10P"]. "LOG10P" must be included. None loads all columns: ["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "INFO", "N", "TEST", "BETA", "SE", "CHISQ", "LOG10P", "EXTRA"].

  • csv_columns (list[str] or None, optional) – New column names to assign after loading. Must match the length of regenie_columns when both are provided. "LOG10P" must keep the same name. If None, no renaming is done and original REGENIE column names are used.

  • log10p_threshold (float or int, optional) – Minimum LOG10P value for significance. Default is 7.

  • create_log (bool or int, optional) – Logging verbosity. 0 disables; 1 creates region-level logs; 2 also creates protein-level logs; 3 also creates per-chromosome logs. Default is True (1).

  • log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to save_log(). Recognised key: "log_fname" (str) to override the default log filename. This parameter is not fully implemented yet and it is mostly the default filenames that are used.

  • protein_to_process (list[str] or None, optional) – Whitelist of proteins to process. Accepts Synapse IDs (e.g. "syn52361344") or tar filenames (e.g. "ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar"). None processes all proteins.

  • verbose (bool or int, optional) – Verbosity level. 0/False is silent; higher values print more details at each processing step.

  • delete_downloaded_tar (bool, optional) – Delete each downloaded tar file after processing to save disk space. Default is True.

  • delete_chr_csv (bool, optional) – Delete per-chromosome significant-QTL CSV files after merging them into the protein-level CSV. Default is True.

  • delete_tar_csv (bool, optional) – Delete protein-level significant-QTL CSV files after merging them into the region-level CSV. Default is True.

  • delete_tar_log (bool, optional) – Delete protein-level log files after the region-level log is created. Default is True.

  • delete_partial_logs (str or bool, optional) – Controls deletion of partial region log files. "current" or True deletes only the log created by this run; "all" deletes all compatible partial logs; False keeps them all. Default is "current".

  • delete_partial_outputs (str or bool, optional) – Controls deletion of partial region output text files. Same options as delete_partial_logs. Default is "current".

Returns:

  • all_significant_qtls (pl.DataFrame) – DataFrame of all significant QTLs across all processed proteins. Contains a leading "protein_name" (str) column followed by the REGENIE (or renamed) columns.

  • final_log_reg (dict) – Final region log dictionary (merged from all partial logs).