Skip to content

[Enhancement ] Add Reproducible Analysis Pipelines & Beginner-Friendly Example Workflows #1025

@ayushk687

Description

@ayushk687

Overview

The malariagen_data Python package provides powerful tools for accessing and analyzing MalariaGEN datasets. While the API is well-designed for researchers familiar with genomic data workflows, new contributors and interdisciplinary users may benefit from structured, end-to-end reproducible example pipelines.

This issue proposes the addition of guided analysis workflows and reproducible templates that demonstrate common genomic data analysis tasks using malariagen_data.

Problem Statement

Currently, users may face the following challenges:

Understanding how to move from raw dataset access → analysis → interpretation

Lack of beginner-to-intermediate level reproducible notebooks

Limited example workflows for common population genomics tasks

Onboarding difficulty for users from computational or data science backgrounds

There is an opportunity to improve usability and accessibility without modifying core functionality.

Proposed Solution

Introduce a new “Reproducible Workflows” section in the documentation, including:

1️⃣ Beginner-Friendly Notebooks

Accessing MalariaGEN datasets

Filtering variants

Basic allele frequency computation

Visualization of genomic variation

Simple statistical summaries

2️⃣ Intermediate Analysis Pipelines

Population structure analysis

PCA-based clustering

Linkage disequilibrium exploration

Genotype-phenotype association examples (mock/demo scale)

3️⃣ Reproducibility Enhancements

Provide environment.yml / requirements.txt

Example Docker container setup

Clear dataset download + caching instructions

Structured pipeline scripts (CLI-based execution optional)

4️⃣ Performance & Memory Usage Guide

Best practices for working with large genomic datasets

Chunked loading examples

Dask or parallel processing suggestions (if applicable)

Suggested Implementation Approach

Create a examples/ or workflows/ directory

Add Jupyter notebooks with narrative explanations

Include small mock datasets for demonstration

Integrate documentation links between API reference and workflow examples

Optional:

Add automated notebook testing (CI)

Add benchmark example for large-scale dataset usage

Impact

This enhancement would:

Improve accessibility for students and early researchers

Support reproducible research practices

Encourage interdisciplinary adoption

Reduce onboarding friction

Increase educational value of the project

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions