Exploring GSoC FASTQ-based taxon classifier – initial setup and questions #905

muien5080 · 2026-02-22T11:51:56Z

muien5080
Feb 22, 2026

I’ve set up malariagen-data-python locally and am currently running through the tests and example workflows to understand how the API interacts with cloud-hosted genomic datasets.

I’m interested in developing a proposal around the FASTQ-based taxon classifier project. My current thinking is to prototype a lightweight feature extraction pipeline (e.g., streaming k-mer frequency encoding from FASTQ reads) that could operate without full genotyping, followed by a simple baseline classifier to evaluate feasibility and memory footprint.

Before drafting a more detailed outline, I’d like to clarify:

Are there preferred existing reference panels or datasets within the current ecosystem that would be appropriate for benchmarking such a classifier?

Is the intended integration point within malariagen_data as an auxiliary utility module, or as part of a higher-level workflow guiding resource selection?

I plan to submit a small PR shortly after finishing local exploration to contribute to the existing codebase before moving further with the proposal.

Thanks for your time.

jonbrenas · 2026-02-23T12:38:49Z

jonbrenas
Feb 23, 2026
Maintainer

Hi @muien5080,

Are there preferred existing reference panels or datasets within the current ecosystem that would be appropriate for benchmarking such a classifier?

Yes, we have access to several 1000s FASTQs for which the taxon is available that can be used to benchmark any classifier.

Is the intended integration point within malariagen_data as an auxiliary utility module, or as part of a higher-level workflow guiding resource selection?

The classifier would be used upstream of the API, i.e., during data generation. The API will most likely be used during the project solely to access the metadata.

I hope these answers help.

0 replies

muien5080 · 2026-02-23T12:57:51Z

muien5080
Feb 23, 2026
Author

Thank you for the clarification.

I’ll begin exploring the FASTQ dataset structure and design a small benchmarking prototype to understand:

Class balance across taxa

Read length variability

Feasible feature extraction strategies (e.g., k-mer frequency vs alignment-based)

I’ll first implement a lightweight baseline classifier on a subset to establish reference performance before proposing a full pipeline.

If there are any preferred preprocessing constraints (e.g., quality trimming, subsampling, paired-end handling), please let me know so I can align with existing practices.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring GSoC FASTQ-based taxon classifier – initial setup and questions #905

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Exploring GSoC FASTQ-based taxon classifier – initial setup and questions #905

Uh oh!

muien5080 Feb 22, 2026

Replies: 2 comments

Uh oh!

jonbrenas Feb 23, 2026 Maintainer

Uh oh!

muien5080 Feb 23, 2026 Author

muien5080
Feb 22, 2026

jonbrenas
Feb 23, 2026
Maintainer

muien5080
Feb 23, 2026
Author