Skip to content

Add support for PLINK2 pgen/pvar/psam input across pipeline#118

Open
ariannalandini wants to merge 11 commits intomainfrom
pgen_support
Open

Add support for PLINK2 pgen/pvar/psam input across pipeline#118
ariannalandini wants to merge 11 commits intomainfrom
pgen_support

Conversation

@ariannalandini
Copy link
Copy Markdown
Collaborator

This PR addresses issue #104 by extending the pipeline to support PLINK2 format datasets (.pgen/.pvar/.psam) in addition to the existing PLINK1 (.bed/.bim/.fam) format. Employing dosages (when possible) should improve fine-mapping and allow the use of SAFE-LD-generated files.

The implementation is designed to:

  • Accept both formats at input level, giving priority to the PLINK2 format if both are present
  • Automatically detect dataset type
  • Preserve the original input format throughout processing
  • Maintain backward compatibility with existing workflows

Key changes

1. Input validation (validate_columns.py)

  • Updated validation logic to accept either:
    • PLINK1: .bed/.bim/.fam
    • PLINK2: .pgen/.pvar/.psam
  • Validation now passes if at least one complete dataset exists
  • Priority is implicitly given to PLINK2 when both formats are present

2. Dataset detection in main.nf

After INPUT_COLUMNS_VALIDATION, PLINK files are now assigned a file_type (plink1 or plink2), which is passed downstream together with the PLINK files for:

  • PROCESS_BFILE
  • MUNG_AND_LOCUS_BREAKER
  • SUSIE_FINEMAPPING

Downstream channel structure has been updated accordingly.

Sodbo and others added 10 commits March 17, 2026 11:04
- Add get_plink2_input_flag() helper to auto-detect format and return correct plink2 flag
- Update run_dentist() to use auto-detected format
- Update prep_susie_ld() to use auto-detected format (already done)
- Update dataset.align() in s02_sumstat_munging_and_aligning.R to support both formats
- Update s01_alpha_sort_alleles_snpid.R to:
  - Detect input format (pgen vs bed)
  - Read variant info from .pvar (pgen) or .bim (bed)
  - Write standard_snpid files for both formats
  - Use correct plink2 flags for extraction
- Update --bfile help strings to mention pgen/psam/pvar support
- Update main.nf to auto-detect and collect pgen/psam/pvar files alongside bed/bim/fam
…ws error only if bfiles with neither extesion stes are present
…ut - added variable file_type = plink2|plink1
…(pgen/pvar/psam) or plink1 (bed/bim/fam) format
@edg1983
Copy link
Copy Markdown
Collaborator

edg1983 commented Mar 25, 2026

For the sake of long-term maintainability, we can consider refactoring some internal logic to move plink commands out of the R scripts. In this way, we can structure a unified upstream processing that makes the input data (whatever it is) in the format we need for downstream computations, and we don't have to percolate input file format logic in all the scripts.
Anyway, the modifications look good so far... I'll finish reviewing tomorrow.

Copy link
Copy Markdown
Collaborator

@edg1983 edg1983 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few minor comments. The only really relevant one is this one
https://github.com/Biostatistics-Unit-HT/Flanders/pull/118/changes#r3009833686

option_list <- list(
make_option("--bfile", default=NULL, help="Path and prefix name of custom LD bfiles (PLINK format .bed .bim .fam)"),
make_option("--bfile", default=NULL, help="Path and prefix name of custom LD genotype files (PLINK bed/bim/fam or pgen/psam/pvar)"),
make_option("--file-type", default = NULL, help = "Input file type: plink1 or plink2"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is not used anywhere downstream. The file type is determined directly from the file path checking the existence of the .pgen file.
https://github.com/Biostatistics-Unit-HT/Flanders/pull/118/changes#diff-3aefe62b0b66beca7581a2aa9f8962deabc09814f5127c72ce787c9086561801R70

If this is the case, I would opt for the automatic logic, since in my understanding, we want to prioritize pgen over bed when both are present, and this is easier to manage looking at the file path itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

In the main.nf I check for existence of both plink formats (L58):

def has_plink1 = plink1_files.every { it.exists() }
def has_plink2 = plink2_files.every { it.exists() }

and then if both exist, I give priority to the plink 2 format (L68):

def file_type = has_plink2 ? "plink2" : "plink1"
def bfile_dataset = has_plink2 ? plink2_files : plink1_files

So priority of plink2 format over plink1 is taken care in the Nextflow part. Do we still want a second layer of prioritizing the plink2 format also in this Rscript? While instead other Rscripts (bin/funs_locus_breaker_cojo_finemap_all_at_once.R and bin/s02_sumstat_munging_and_aligning.R) just use the bfile_type argument?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see.
We can let Nextflow manage the priority, and then the scripts will behave in a strictly controlled way according to the bfile_type argument. I agree, let's implement here the same approach used in the other R scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants