Skip to content

Add mean and median sequence length to Basic Statistics (#203)#204

Closed
ewels wants to merge 1 commit into
s-andrews:masterfrom
ewels:issue-203-mean-median-length
Closed

Add mean and median sequence length to Basic Statistics (#203)#204
ewels wants to merge 1 commit into
s-andrews:masterfrom
ewels:issue-203-mean-median-length

Conversation

@ewels

@ewels ewels commented May 29, 2026

Copy link
Copy Markdown
Contributor

Closes #203.

The Basic Statistics module currently reports only the range of read lengths (min-max). This adds two rows so users get accurate mean and median read lengths straight from the source, rather than having downstream tools like MultiQC estimate them from the binned length distribution:

Sequence length          16
Mean sequence length     16.00
Median sequence length   16
%GC                      50

Implementation

  • Mean = total bases / total sequences, formatted to 2 decimal places (Locale.ROOT, so the decimal separator is locale-independent).
  • Median = derived from a per-length histogram maintained over non-filtered sequences (the same set used for the existing min/max range). For an even number of reads, the two central values are averaged and rounded up.
  • Both values are added to the ResultsTable model, so they appear consistently in the interactive results panel, the HTML report, and fastqc_data.txt.

The histogram grows on demand using the same idiom as SequenceLengthDistribution.

Testing

  • Built with ant and ran the compiled FastQC on the minimal and complex test files; confirmed the new rows appear correctly in both fastqc_data.txt and the HTML report.
  • Updated the FileContentsTest approved files (data + HTML, for minimal and complex). The HTML snapshots add only the two new table rows; the embedded chart images are unchanged.

🤖 Generated with Claude Code

Implements s-andrews#203. The Basic Statistics module currently reports only the
range of read lengths; this adds "Mean sequence length" and "Median
sequence length" rows so users (and downstream tools such as MultiQC) get
accurate values straight from the source rather than estimating them.

The mean is total bases / total sequences (2 d.p.); the median is derived
from a per-length histogram over non-filtered sequences (for an even count,
the two central values averaged and rounded up). Both are added to the
ResultsTable model, so they appear in the interactive results panel, the
HTML report, and fastqc_data.txt together.

Integration test approved files updated accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ewels

ewels commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Hah, you beat me to it in 54b336e 😆

Comparing implementations now..

@ewels

ewels commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Closing to build on what you pushed already instead. See #205

@ewels ewels closed this May 29, 2026
@ewels ewels deleted the issue-203-mean-median-length branch June 26, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mean + median read length

1 participant