Add support for PLINK2 pgen/pvar/psam input across pipeline by ariannalandini · Pull Request #118 · Biostatistics-Unit-HT/Flanders

ariannalandini · 2026-03-25T15:29:53Z

This PR addresses issue #104 by extending the pipeline to support PLINK2 format datasets (.pgen/.pvar/.psam) in addition to the existing PLINK1 (.bed/.bim/.fam) format. Employing dosages (when possible) should improve fine-mapping and allow the use of SAFE-LD-generated files.

The implementation is designed to:

Accept both formats at input level, giving priority to the PLINK2 format if both are present
Automatically detect dataset type
Preserve the original input format throughout processing
Maintain backward compatibility with existing workflows

Key changes

1. Input validation (validate_columns.py)

Updated validation logic to accept either:
- PLINK1: .bed/.bim/.fam
- PLINK2: .pgen/.pvar/.psam
Validation now passes if at least one complete dataset exists
Priority is implicitly given to PLINK2 when both formats are present

2. Dataset detection in main.nf

After INPUT_COLUMNS_VALIDATION, PLINK files are now assigned a file_type (plink1 or plink2), which is passed downstream together with the PLINK files for:

PROCESS_BFILE
MUNG_AND_LOCUS_BREAKER
SUSIE_FINEMAPPING

Downstream channel structure has been updated accordingly.

- Add get_plink2_input_flag() helper to auto-detect format and return correct plink2 flag - Update run_dentist() to use auto-detected format - Update prep_susie_ld() to use auto-detected format (already done) - Update dataset.align() in s02_sumstat_munging_and_aligning.R to support both formats - Update s01_alpha_sort_alleles_snpid.R to: - Detect input format (pgen vs bed) - Read variant info from .pvar (pgen) or .bim (bed) - Write standard_snpid files for both formats - Use correct plink2 flags for extraction - Update --bfile help strings to mention pgen/psam/pvar support - Update main.nf to auto-detect and collect pgen/psam/pvar files alongside bed/bim/fam

…ws error only if bfiles with neither extesion stes are present

…ut - added variable file_type = plink2|plink1

…(pgen/pvar/psam) or plink1 (bed/bim/fam) format

…nk2)

edg1983 · 2026-03-25T16:13:47Z

For the sake of long-term maintainability, we can consider refactoring some internal logic to move plink commands out of the R scripts. In this way, we can structure a unified upstream processing that makes the input data (whatever it is) in the format we need for downstream computations, and we don't have to percolate input file format logic in all the scripts.
Anyway, the modifications look good so far... I'll finish reviewing tomorrow.

edg1983

I've added a few minor comments. The only really relevant one is this one
https://github.com/Biostatistics-Unit-HT/Flanders/pull/118/changes#r3009833686

edg1983 · 2026-03-30T13:25:16Z

bin/s01_alpha_sort_alleles_snpid.R

 option_list <- list(
-    make_option("--bfile", default=NULL, help="Path and prefix name of custom LD bfiles (PLINK format .bed .bim .fam)"),
+    make_option("--bfile", default=NULL, help="Path and prefix name of custom LD genotype files (PLINK bed/bim/fam or pgen/psam/pvar)"),
+    make_option("--file-type", default = NULL, help = "Input file type: plink1 or plink2"),


It seems this is not used anywhere downstream. The file type is determined directly from the file path checking the existence of the .pgen file.
https://github.com/Biostatistics-Unit-HT/Flanders/pull/118/changes#diff-3aefe62b0b66beca7581a2aa9f8962deabc09814f5127c72ce787c9086561801R70

If this is the case, I would opt for the automatic logic, since in my understanding, we want to prioritize pgen over bed when both are present, and this is easier to manage looking at the file path itself.

You're right.

In the main.nf I check for existence of both plink formats (L58):

def has_plink1 = plink1_files.every { it.exists() } def has_plink2 = plink2_files.every { it.exists() }

and then if both exist, I give priority to the plink 2 format (L68):

def file_type = has_plink2 ? "plink2" : "plink1" def bfile_dataset = has_plink2 ? plink2_files : plink1_files

So priority of plink2 format over plink1 is taken care in the Nextflow part. Do we still want a second layer of prioritizing the plink2 format also in this Rscript? While instead other Rscripts (bin/funs_locus_breaker_cojo_finemap_all_at_once.R and bin/s02_sumstat_munging_and_aligning.R) just use the bfile_type argument?

OK, I see.
We can let Nextflow manage the priority, and then the scripts will behave in a strictly controlled way according to the bfile_type argument. I agree, let's implement here the same approach used in the other R scripts.

bin/s04_cojo_finemapping.R

bin/s04_susie_finemapping.R

modules/local/process_bfile/main.nf

Sodbo and others added 10 commits March 17, 2026 11:04

Check also for existence of bfile with pgen/pvar/psam extension. Thro…

9ff972c

…ws error only if bfiles with neither extesion stes are present

process bfile modified to accept both plink1 and plink2 format as inp…

78afa45

…ut - added variable file_type = plink2|plink1

Specify file extensions of plink genotypes

70edc35

Include value defining whether genotypes input is provided in plink2 …

922286b

…(pgen/pvar/psam) or plink1 (bed/bim/fam) format

Add to prep_susie_ld() the argument specifying bfile_type (plink1|pli…

26fc788

…nk2)

Add to dataset.align() the argument specifying bfile_type (plink1|pli…

8c3fa83

…nk2)

Add bfile_type (plink1|plink2) to script arguments

6e279e1

Adding file_type for bfile as value to input channel

621aa35

Remove HT HPC path to local conda

ff0a1b2

ariannalandini requested review from Sodbo, bruno-ariano and edg1983 March 25, 2026 15:29

edg1983 requested changes Mar 30, 2026

View reviewed changes

Fix typo in argument descriptions

7d30c56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for PLINK2 pgen/pvar/psam input across pipeline#118

Add support for PLINK2 pgen/pvar/psam input across pipeline#118
ariannalandini wants to merge 11 commits intomainfrom
pgen_support

ariannalandini commented Mar 25, 2026

Uh oh!

edg1983 commented Mar 25, 2026

Uh oh!

edg1983 left a comment

Uh oh!

edg1983 Mar 30, 2026

Uh oh!

ariannalandini Mar 30, 2026

Uh oh!

edg1983 Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ariannalandini commented Mar 25, 2026

Key changes

1. Input validation (validate_columns.py)

2. Dataset detection in main.nf

Uh oh!

edg1983 commented Mar 25, 2026

Uh oh!

edg1983 left a comment

Choose a reason for hiding this comment

Uh oh!

edg1983 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ariannalandini Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

edg1983 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants