Replies: 2 comments
-
|
Hi @muien5080,
Yes, we have access to several 1000s FASTQs for which the taxon is available that can be used to benchmark any classifier.
The classifier would be used upstream of the API, i.e., during data generation. The API will most likely be used during the project solely to access the metadata. I hope these answers help. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the clarification. I’ll begin exploring the FASTQ dataset structure and design a small benchmarking prototype to understand: Class balance across taxa Read length variability Feasible feature extraction strategies (e.g., k-mer frequency vs alignment-based) I’ll first implement a lightweight baseline classifier on a subset to establish reference performance before proposing a full pipeline. If there are any preferred preprocessing constraints (e.g., quality trimming, subsampling, paired-end handling), please let me know so I can align with existing practices. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @jonbrenas and @tristanpwdennis,
I’ve set up malariagen-data-python locally and am currently running through the tests and example workflows to understand how the API interacts with cloud-hosted genomic datasets.
I’m interested in developing a proposal around the FASTQ-based taxon classifier project. My current thinking is to prototype a lightweight feature extraction pipeline (e.g., streaming k-mer frequency encoding from FASTQ reads) that could operate without full genotyping, followed by a simple baseline classifier to evaluate feasibility and memory footprint.
Before drafting a more detailed outline, I’d like to clarify:
Are there preferred existing reference panels or datasets within the current ecosystem that would be appropriate for benchmarking such a classifier?
Is the intended integration point within malariagen_data as an auxiliary utility module, or as part of a higher-level workflow guiding resource selection?
I plan to submit a small PR shortly after finishing local exploration to contribute to the existing codebase before moving further with the proposal.
Thanks for your time.
Beta Was this translation helpful? Give feedback.
All reactions