Add support for PLINK2 pgen/pvar/psam input across pipeline#118
Add support for PLINK2 pgen/pvar/psam input across pipeline#118ariannalandini wants to merge 11 commits intomainfrom
Conversation
- Add get_plink2_input_flag() helper to auto-detect format and return correct plink2 flag - Update run_dentist() to use auto-detected format - Update prep_susie_ld() to use auto-detected format (already done) - Update dataset.align() in s02_sumstat_munging_and_aligning.R to support both formats - Update s01_alpha_sort_alleles_snpid.R to: - Detect input format (pgen vs bed) - Read variant info from .pvar (pgen) or .bim (bed) - Write standard_snpid files for both formats - Use correct plink2 flags for extraction - Update --bfile help strings to mention pgen/psam/pvar support - Update main.nf to auto-detect and collect pgen/psam/pvar files alongside bed/bim/fam
…ws error only if bfiles with neither extesion stes are present
…ut - added variable file_type = plink2|plink1
…(pgen/pvar/psam) or plink1 (bed/bim/fam) format
|
For the sake of long-term maintainability, we can consider refactoring some internal logic to move plink commands out of the R scripts. In this way, we can structure a unified upstream processing that makes the input data (whatever it is) in the format we need for downstream computations, and we don't have to percolate input file format logic in all the scripts. |
edg1983
left a comment
There was a problem hiding this comment.
I've added a few minor comments. The only really relevant one is this one
https://github.com/Biostatistics-Unit-HT/Flanders/pull/118/changes#r3009833686
| option_list <- list( | ||
| make_option("--bfile", default=NULL, help="Path and prefix name of custom LD bfiles (PLINK format .bed .bim .fam)"), | ||
| make_option("--bfile", default=NULL, help="Path and prefix name of custom LD genotype files (PLINK bed/bim/fam or pgen/psam/pvar)"), | ||
| make_option("--file-type", default = NULL, help = "Input file type: plink1 or plink2"), |
There was a problem hiding this comment.
It seems this is not used anywhere downstream. The file type is determined directly from the file path checking the existence of the .pgen file.
https://github.com/Biostatistics-Unit-HT/Flanders/pull/118/changes#diff-3aefe62b0b66beca7581a2aa9f8962deabc09814f5127c72ce787c9086561801R70
If this is the case, I would opt for the automatic logic, since in my understanding, we want to prioritize pgen over bed when both are present, and this is easier to manage looking at the file path itself.
There was a problem hiding this comment.
You're right.
In the main.nf I check for existence of both plink formats (L58):
def has_plink1 = plink1_files.every { it.exists() }
def has_plink2 = plink2_files.every { it.exists() }
and then if both exist, I give priority to the plink 2 format (L68):
def file_type = has_plink2 ? "plink2" : "plink1"
def bfile_dataset = has_plink2 ? plink2_files : plink1_files
So priority of plink2 format over plink1 is taken care in the Nextflow part. Do we still want a second layer of prioritizing the plink2 format also in this Rscript? While instead other Rscripts (bin/funs_locus_breaker_cojo_finemap_all_at_once.R and bin/s02_sumstat_munging_and_aligning.R) just use the bfile_type argument?
There was a problem hiding this comment.
OK, I see.
We can let Nextflow manage the priority, and then the scripts will behave in a strictly controlled way according to the bfile_type argument. I agree, let's implement here the same approach used in the other R scripts.
This PR addresses issue #104 by extending the pipeline to support PLINK2 format datasets (.pgen/.pvar/.psam) in addition to the existing PLINK1 (.bed/.bim/.fam) format. Employing dosages (when possible) should improve fine-mapping and allow the use of SAFE-LD-generated files.
The implementation is designed to:
Key changes
1. Input validation (validate_columns.py)
2. Dataset detection in main.nf
After INPUT_COLUMNS_VALIDATION, PLINK files are now assigned a
file_type(plink1orplink2), which is passed downstream together with the PLINK files for:Downstream channel structure has been updated accordingly.