ukbppp_dl.pgwas
Functions for processing pGWAS summary statistics files from UKB PPP.
Functions
|
Find compatible pre-existing partial region log files. |
Keep significant QTLs from a gzipped summary-statistics file. |
|
Keep significant pGWAS QTLs for an entire Synapse region folder. |
|
List all protein tar files in the folder of given GWAS region. |
|
|
Concatenate partial text output files into a single output file. |
|
Merge multiple compatible partial region log dictionaries into one. |
|
Concatenate multiple CSV files into one DataFrame. |
|
Keep significant QTLs from one chromosome file of a protein tar. |
|
Keep significant QTLs from one protein tar file. |
- ukbppp_dl.pgwas.list_tar_files_in_region_folder(synapse_id: str = 'syn51365308', login_kwargs: Dict[str, Any] = {}) List[Tuple[str, str]]
List all protein tar files in the folder of given GWAS region.
- Parameters:
synapse_id (str, optional) – Synapse folder ID for the GWAS region, e.g.
"syn51365308"for the Combined ancestry group. Defaults toPGWAS_REGIONS["Combined"].login_kwargs (Dict[str, Any], optional) –
Keyword arguments forwarded to
synapseclient.Synapse.login(). Common keys include"authToken"(str),"email"(str) and"profile"(str). See synapseclient.Synapse.login() for more information.Alternatively, you can configure your Synapse credentials in a
.synapseConfigfile and leave this argument empty to use the default login behaviour. You can find more information about.synapseConfigin the Synapse documentation.
- Returns:
Sorted list of
(synapse_id, tar_filename)pairs, one per protein tar file found in the folder. Example entry:[("syn52361344", "ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar"), ("syn51470065", "ABHD14B_Q96IU4_OID20921_v1_Neurology.tar.tar")].- Return type:
List[Tuple[str, str]]
- ukbppp_dl.pgwas.keep_significant_qtls_from_chr_gz_file(chr_file_gz: str | PathLike[str] | IO[bytes], separator: str = ' ', columns: list[str] | None = None, new_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = False, log_kwargs: dict[str, Any] = {}, verbose: bool | int = False) tuple[DataFrame, dict]
Keep significant QTLs from a gzipped summary-statistics file.
May create a JSON log file.
- Parameters:
chr_file_gz (str or os.PathLike[str] or IO[bytes]) – Path to the
.gzfile, or a binary file-like object (e.g. extracted from atarfile). To be passed directly togzip.open.separator (str, optional) – Field separator used in the REGENIE file. Default is
" "(space).columns (list[str] or None, optional) – Subset of column names to load from the REGENIE file, e.g.
["CHROM", "GENPOS", "ID", "LOG10P"].Noneloads all columns.LOG10Pcolumn must be present.new_columns (list[str] or None, optional) – Rename the loaded columns using this list of new names. Must have the same length as columns when provided.
LOG10Pcolumn must be present and with the same name for filtering to work.log10p_threshold (float or int, optional) – Minimum
LOG10Pvalue for a QTL to be considered significant. Default is7(roughly p < 1e-7).create_log (bool or int, optional) – Whether to save a JSON log file.
0/Falsedisables logging;Trueor any positive integer enables it. Default isFalse.log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to
save_log(). Recognised key:"log_fname"(str) to override the default log filename.verbose (bool or int, optional) – Verbosity level.
0/Falseis silent; higher values print more.
- Returns:
summary_stats_significant (pl.DataFrame) – DataFrame containing only rows where
LOG10P >= log10p_threshold.log (dict) – Dictionary with keys:
"log_filename"(str | None),"log10p_threshold"(float),"n_tot_qtls"(int),"n_kept_qtls"(int),"source_chr_file"(str).
- ukbppp_dl.pgwas.process_one_chr_from_protein_tar_file(protein_tar_file: TarFile, chr_gz_fname: str, res_location: str = './ukb_ppp_dl/results', separator: str = ' ', columns: list[str] | None = None, new_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = False, log_kwargs: dict[str, Any] = {}, verbose: bool | int = False) tuple[str, dict]
Keep significant QTLs from one chromosome file of a protein tar.
May create a JSON log file and will create a CSV file.
- Parameters:
protein_tar_file (tarfile.TarFile) – Open
TarFileobject for the protein’s tar archive.chr_gz_fname (str) – Name or partial path of the chromosome
.gzfile inside the tar, e.g."chr1.regenie.gz". A substring match is used to locate the actual entry.res_location (str, optional) – Directory where the result CSV (and log) will be written.
separator (str, optional) – Field separator used in the REGENIE file. Default is
" "(space).columns (list[str] or None, optional) – Subset of column names to load from the REGENIE file, e.g.
["CHROM", "GENPOS", "ID", "LOG10P"].Noneloads all columns.LOG10Pcolumn must be present.new_columns (list[str] or None, optional) – Rename the loaded columns using this list of new names. Must have the same length as columns when provided.
LOG10Pcolumn must be present and with the same name for filtering to work.log10p_threshold (float or int, optional) – Minimum
LOG10Pthreshold for significance. Default is7.create_log (bool or int, optional) – Verbosity of per-chromosome logging.
0disables;1creates a JSON log for this chromosome file.log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to
save_log(). Recognised key:"log_fname"(str) to override the default log filename.verbose (bool or int, optional) – Verbosity level.
- Returns:
res_csv_fname (str) – Path to the CSV file containing significant QTLs for this chromosome.
log_chr (dict) – Log dictionary with key
"skipped"(bool), and when not skipped:"log10p_threshold"(float),"n_tot_qtls"(int),"n_kept_qtls"(int),"source_chr_file"(str).
- ukbppp_dl.pgwas.merge_significant_qtls_from_csv(csv_fnames: list[str], output_fname: str | None = None, create_log: bool | int = False, log_kwargs: dict[str, Any] = {}, delete_csv: bool = False, verbose: bool | int = False) tuple[DataFrame, dict]
Concatenate multiple CSV files into one DataFrame.
May create a JSON log file, may create a CSV file and can optionally delete the input CSV files after merging.
- Parameters:
csv_fnames (list[str]) – Paths to the per-chromosome CSV files to merge. Must be non-empty. Each file must contain at least a
LOG10Pcolumn. Files must be consistent with each other (e.g. same columns) but can have no rows (but a header).output_fname (str or None, optional) – If provided, the merged DataFrame is written to this CSV path.
create_log (bool or int, optional) – If truthy and output_fname is set, save a JSON log alongside the merged CSV.
log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to
save_log(). Recognised key:"log_fname"(str) to override the default log filename.delete_csv (bool, optional) – If
True, delete the input csv_fnames after merging. Default isFalse.verbose (bool or int, optional) – Verbosity level.
- Returns:
all_significant_qtls (pl.DataFrame) – Concatenated DataFrame of all significant QTLs from every input file. Empty (but with a header) if no significant QTL was found.
log (dict) – Dictionary with keys:
"log_filename"(str | None),"merged_csv_filename"(str | None),"n_csv_merged"(int),"n_kept_qtls"(int),"n_kept_qtls_per_csv"(dict[str, int], mapping each CSV path to its row count),"min_log10p"(float | None).
- ukbppp_dl.pgwas.process_one_tar_file(tar_fname: str, res_location: str = './ukb_ppp_dl/results', separator: str = ' ', columns: list[str] | None = None, new_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = True, log_kwargs: dict[str, Any] = {}, verbose: bool | int = False) tuple[list[str], dict]
Keep significant QTLs from one protein tar file.
May create a JSON log file for the protein and may create JSON log files for each chromosome. Will create a CSV file for each chromosome.
- Parameters:
tar_fname (str) – Path to the protein
.tarfile, e.g."ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar".res_location (str, optional) – Directory where result CSVs and logs are written.
separator (str, optional) – Field separator used in REGENIE files. Default is
" "(space).columns (list[str] or None, optional) – Subset of column names to load from the REGENIE file, e.g.
["CHROM", "GENPOS", "ID", "LOG10P"].Noneloads all columns.LOG10Pcolumn must be present.new_columns (list[str] or None, optional) – Rename the loaded columns using this list of new names. Must have the same length as columns when provided.
LOG10Pcolumn must be present and with the same name for filtering to work.log10p_threshold (float or int, optional) – Minimum
LOG10Pthreshold for significance. Default is7.create_log (bool or int, optional) – Logging verbosity.
0disables;1creates a tar-level JSON log;2also creates per-chromosome logs. Default isTrue(1).log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to
save_log(). Recognised key:"log_fname"(str) to override the default log filename.verbose (bool or int, optional) – Verbosity level.
- Returns:
all_csv_fnames (list[str]) – Paths to the per-chromosome CSV files of significant QTLs.
log_tar (dict) – Log dictionary with keys:
"log_filename"(str | None),"tar_fname"(str),"protein_name"(str),"n_chr_files"(int),"skipped_chr_files"(list[str]),"all_csv_fnames"(list[str]),"n_processed_qtls"(int),"log10p_threshold"(float),"regenie_columns"(list[str] | None),"csv_columns"(list[str] | None).
- ukbppp_dl.pgwas.find_partial_region_logs(example_log_dict: dict | None = None, synapse_folder_id: str = 'syn51365308', res_location: str = './ukb_ppp_dl/results', regenie_columns: list[str] | None = None, csv_columns: list[str] | None = None, log10p_threshold: float | int = 7, all_tar_files: list[str] = [], verbose: bool | int = False) dict[str, dict]
Find compatible pre-existing partial region log files.
- Parameters:
example_log_dict (dict or None, optional) – If provided, the function reads synapse_folder_id, regenie_columns, csv_columns, log10p_threshold, and all_tar_files from this dict instead of the individual keyword arguments. Expected keys match those written by
keep_significant_qtls_from_region().synapse_folder_id (str, optional) – Synapse folder ID used to filter log filenames, e.g.
"syn51365308".res_location (str, optional) – Directory to search for
*.jsonpartial log files.regenie_columns (list[str] or None, optional) – Expected value of the
"regenie_columns"key in matching logs.csv_columns (list[str] or None, optional) – Expected value of the
"csv_columns"key in matching logs.log10p_threshold (float or int, optional) – Expected threshold. Logs with a different value are excluded.
all_tar_files (list[str], optional) – Expected list of tar filenames, e.g.
["ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar", ...].verbose (bool or int, optional) – Verbosity level.
- Returns:
Mapping from log file path (
str) to the parsed log dictionary (dict) for each compatible partial log found.- Return type:
dict[str, dict]
- ukbppp_dl.pgwas.merge_partial_region_logs(compatible_log_files: dict[str, dict]) dict
Merge multiple compatible partial region log dictionaries into one.
To be compatible, log files must have been created with the same parameters and created by running the function
keep_significant_qtls_from_region(). In particular, they must share identicallog10p_threshold,regenie_columns,csv_columns, andsynapse_folder_idvalues.We cannot use this function with already merged log files.
It is not advised to use this function unless you are extremely sure that all log files were created with the same parameters and that you will use the merged log file in a consistent way.
This function is mainly intended to be used as a helper function for the
keep_significant_qtls_from_region()function.- Parameters:
compatible_log_files (dict[str, dict]) – Mapping from log file path to log dictionary, as returned by
find_partial_region_logs(). All logs must share identicallog10p_threshold,regenie_columns,csv_columns, andsynapse_folder_idvalues.- Returns:
A single merged log dictionary. Returns an empty
{}when compatible_log_files is empty.- Return type:
dict
- ukbppp_dl.pgwas.merge_partial_output_files(output_fname: str, partial_output_filenames: list[str]) None
Concatenate partial text output files into a single output file.
This function will create a new text file.
- Parameters:
output_fname (str) – Path to the final merged output text file to create, e.g.
"syn51365308-output_text-2026-05-12--10:00:00.txt".partial_output_filenames (list[str]) – Paths to the partial output files to concatenate, e.g.
["PART-syn51365308-output_text-2026-05-12--09:00:00.txt", ...]. Files are sorted by filename before concatenation so that order matches creation order. Each file’s content is wrapped with a separator banner showing the source filename.
- ukbppp_dl.pgwas.keep_significant_qtls_from_region(synapse_folder_id: str = 'syn51365308', download_location: str = './ukb_ppp_dl/data', res_location: str = './ukb_ppp_dl/results', login_kwargs: dict[str, Any] = {}, regenie_sep: str = ' ', regenie_columns: list[str] | None = None, csv_columns: list[str] | None = None, log10p_threshold: float | int = 7, create_log: bool | int = True, log_kwargs: dict[str, Any] = {}, protein_to_process: list[str] | None = None, verbose: bool | int = False, delete_downloaded_tar: bool = True, delete_chr_csv: bool = True, delete_tar_csv: bool = True, delete_tar_log: bool = True, delete_partial_logs: str | bool = 'current', delete_partial_outputs: str | bool = 'current') tuple[DataFrame, dict]
Keep significant pGWAS QTLs for an entire Synapse region folder.
This is the main entry point for processing one ancestry group. It:
Lists all protein tar files in the Synapse folder.
For each protein, downloads the tar file from Synapse (if not already downloaded), extracts per-chromosome REGENIE files, filters for significant QTLs, and merges them into a protein-level CSV.
Concatenates all protein-level results into one region-level CSV.
Documents the process and results in logs and output text files.
Optionally cleans up intermediate files.
Partial results from a previous interrupted run are automatically detected and reused.
This function does not require a lot of free space to run because it processes one protein at a time and deletes intermediate files along the way (if specified).
This function creates logs and output text files for each run. This is important and has several benefits:
It allows the function to be safely re-run multiple times without overwriting previous logs and outputs.
It allows the function to automatically detect and reuse partial runs
It provides a detailed record of the parameters used to create the generated results. This can be crucial for reproducibility in science.
It facilitates sanity checks and monitoring of the results.
It is highly recommended to keep the logs and output text files (at least at the region level, that is to say with
create_log=Trueorcreate_log=1) for future reference.Note that even if
create_log=False, partial log files will still be created, but they will be deleted at the end of the function.This function will create many files during the process, but most of them are intermediate files that can be deleted at the end of the run if specified. The main final output is the region-level CSV file containing all significant QTLs across all processed proteins, and the region-level output file documenting the process and results.
- Parameters:
synapse_folder_id (str, optional) – Synapse folder ID for the GWAS region, e.g.
"syn51365308"for the Combined ancestry. Defaults toPGWAS_REGIONS["Combined"].download_location (str, optional) – Local directory where tar files are downloaded.
res_location (str, optional) – Local directory where result CSV, log and output files are written.
login_kwargs (Dict[str, Any], optional) –
Keyword arguments forwarded to
synapseclient.Synapse.login(). Common keys include"authToken"(str),"email"(str) and"profile"(str). See synapseclient.Synapse.login() for more information.Alternatively, you can configure your Synapse credentials in a
.synapseConfigfile and leave this argument empty to use the default login behaviour. You can find more information about.synapseConfigin the Synapse documentation.regenie_sep (str, optional) – Field separator used in REGENIE files. Default is
" "(space).regenie_columns (list[str] or None, optional) – Subset of REGENIE columns to load, e.g.
["CHROM", "GENPOS", "ID", "BETA", "SE", "LOG10P"]."LOG10P"must be included.Noneloads all columns:["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "INFO", "N", "TEST", "BETA", "SE", "CHISQ", "LOG10P", "EXTRA"].csv_columns (list[str] or None, optional) – New column names to assign after loading. Must match the length of regenie_columns when both are provided.
"LOG10P"must keep the same name. IfNone, no renaming is done and original REGENIE column names are used.log10p_threshold (float or int, optional) – Minimum
LOG10Pvalue for significance. Default is7.create_log (bool or int, optional) – Logging verbosity.
0disables;1creates region-level logs;2also creates protein-level logs;3also creates per-chromosome logs. Default isTrue(1).log_kwargs (dict[str, Any], optional) – Keyword arguments forwarded to
save_log(). Recognised key:"log_fname"(str) to override the default log filename. This parameter is not fully implemented yet and it is mostly the default filenames that are used.protein_to_process (list[str] or None, optional) – Whitelist of proteins to process. Accepts Synapse IDs (e.g.
"syn52361344") or tar filenames (e.g."ABCA2_Q9BZC7_OID30146_v1_Cardiometabolic_II.tar").Noneprocesses all proteins.verbose (bool or int, optional) – Verbosity level.
0/Falseis silent; higher values print more details at each processing step.delete_downloaded_tar (bool, optional) – Delete each downloaded tar file after processing to save disk space. Default is
True.delete_chr_csv (bool, optional) – Delete per-chromosome significant-QTL CSV files after merging them into the protein-level CSV. Default is
True.delete_tar_csv (bool, optional) – Delete protein-level significant-QTL CSV files after merging them into the region-level CSV. Default is
True.delete_tar_log (bool, optional) – Delete protein-level log files after the region-level log is created. Default is
True.delete_partial_logs (str or bool, optional) – Controls deletion of partial region log files.
"current"orTruedeletes only the log created by this run;"all"deletes all compatible partial logs;Falsekeeps them all. Default is"current".delete_partial_outputs (str or bool, optional) – Controls deletion of partial region output text files. Same options as delete_partial_logs. Default is
"current".
- Returns:
all_significant_qtls (pl.DataFrame) – DataFrame of all significant QTLs across all processed proteins. Contains a leading
"protein_name"(str) column followed by the REGENIE (or renamed) columns.final_log_reg (dict) – Final region log dictionary (merged from all partial logs).