-
Notifications
You must be signed in to change notification settings - Fork 99
Description
Overview
The malariagen_data Python package provides powerful tools for accessing and analyzing MalariaGEN datasets. While the API is well-designed for researchers familiar with genomic data workflows, new contributors and interdisciplinary users may benefit from structured, end-to-end reproducible example pipelines.
This issue proposes the addition of guided analysis workflows and reproducible templates that demonstrate common genomic data analysis tasks using malariagen_data.
Problem Statement
Currently, users may face the following challenges:
Understanding how to move from raw dataset access → analysis → interpretation
Lack of beginner-to-intermediate level reproducible notebooks
Limited example workflows for common population genomics tasks
Onboarding difficulty for users from computational or data science backgrounds
There is an opportunity to improve usability and accessibility without modifying core functionality.
Proposed Solution
Introduce a new “Reproducible Workflows” section in the documentation, including:
1️⃣ Beginner-Friendly Notebooks
Accessing MalariaGEN datasets
Filtering variants
Basic allele frequency computation
Visualization of genomic variation
Simple statistical summaries
2️⃣ Intermediate Analysis Pipelines
Population structure analysis
PCA-based clustering
Linkage disequilibrium exploration
Genotype-phenotype association examples (mock/demo scale)
3️⃣ Reproducibility Enhancements
Provide environment.yml / requirements.txt
Example Docker container setup
Clear dataset download + caching instructions
Structured pipeline scripts (CLI-based execution optional)
4️⃣ Performance & Memory Usage Guide
Best practices for working with large genomic datasets
Chunked loading examples
Dask or parallel processing suggestions (if applicable)
Suggested Implementation Approach
Create a examples/ or workflows/ directory
Add Jupyter notebooks with narrative explanations
Include small mock datasets for demonstration
Integrate documentation links between API reference and workflow examples
Optional:
Add automated notebook testing (CI)
Add benchmark example for large-scale dataset usage
Impact
This enhancement would:
Improve accessibility for students and early researchers
Support reproducible research practices
Encourage interdisciplinary adoption
Reduce onboarding friction
Increase educational value of the project