Add a fragment count column by marcellevstek · Pull Request #407 · genialis/resdk

marcellevstek · 2026-05-18T11:44:42Z

in QCTables general_fastq field output

Previously, raw and trimmed read count columns in QCTables reported the sum across all lanes and mates of FASTQ files per sample. This resulted in inflated number for samples sequenced using paired-end sequencing, since the value was a sum of read counts of both mates. The code now sums across lanes of each mate and then averages the number across both mates. The code assumes that both mates are properly paired. A different aggregation function, for example minimum, could also be used.

in QCTables ``general_fastq`` field output

mstajdohar · 2026-06-18T08:57:34Z

What’s the latest on this PR? Should we close it? Otherwise, what’s blocking a merge?

gregorjerse

Adversarial review — High/Medium findings (Low-severity items and nits omitted). Per-line comments below.

Two items not tied to a specific changed line:

[Medium] No tests / no fixture. The new branching (_is_paired_end, the per-mate _1/_2 block) is unexercised — the diff touches no tests/, and there is no multiqc_general_stats fixture in the repo, so the central _1/_2 naming assumption is unvalidated. A multi-lane paired-end fixture asserting the exact fragment count is exactly what would catch the naming issue flagged inline.
[Medium] CHANGELOG. docs/CHANGELOG.rst has no Unreleased entry for the two new user-visible columns and the changed paired-end read-count semantics.

Credit: the _filter_and_rename_columns rewrite to a fresh result DataFrame is correct and now load-bearing — with two slugs (total_read_count_* and estimated_fragment_count_*) sharing one source column, the old in-place df[name] *= ... + rename_map approach would double-scale and collide column names. Please don't revert it to in-place mutation.

gregorjerse · 2026-06-19T10:01:21Z


-def general_multiqc_parser(file_object, name, column_map):
+def _is_paired_end(sample) -> Optional[bool]:
+    reads = list(sample.data.filter(type="data:reads:fastq:"))


[High] Blocking, synchronous network call inside the asyncio download loop. sample.data.filter(...) materialized with list() runs ResolweQuery._fetch → self.api.get(...) — a blocking slumber/requests GET. It executes inside general_multiqc_parser → _parse_file, which is called from the async def _download_file coroutine gathered by asyncio.gather (base.py), with no executor / to_thread. So it stalls the event loop and serializes the very downloads the chunked async path exists to parallelize, adding a round-trip per parsed file. It is also uncached (.filter() clones with an empty cache; no memoization on _is_paired_end).

Suggest resolving library strategy once per sample off the event loop (e.g. precompute a {sample_id: Optional[bool]} map) and passing the boolean into the parser.

gregorjerse · 2026-06-19T10:01:21Z

-def general_multiqc_parser(file_object, name, column_map):
+def _is_paired_end(sample) -> Optional[bool]:
+    reads = list(sample.data.filter(type="data:reads:fastq:"))
+    origin_read_obj = reads[-1] if reads else None


[Medium] reads[-1] is non-deterministic and origin_read_obj is misleading. No ordering= is set, so the "last" element depends on undefined server ordering, yet the variable is named origin_read_obj while taking the last item. The codebase idiom (shortcuts/sample.py get_reads) pins ordering='-id' and takes [0]. Combined with the strict endswith just below, which reads object is picked actually matters. Pin an explicit order (or tie selection to the reads upstream of the MultiQC Data) and rename to match what is selected.

The ordering is determined by the backend, so this is not entirely applicable.

gregorjerse · 2026-06-19T10:01:21Z

+def _is_paired_end(sample) -> Optional[bool]:
+    reads = list(sample.data.filter(type="data:reads:fastq:"))
+    origin_read_obj = reads[-1] if reads else None
+    if origin_read_obj is None:


[Medium] None (unknown strategy) is silently treated as single-end. The call site is if _is_paired_end(sample):, so a None return falls into the single-end branch, which sums across both mates — re-introducing exactly the 2× inflation this PR fixes, with no warning, for any genuinely paired-end sample whose reads object can't be resolved here. Handle the three states explicitly (e.g. warn or skip the adjustment on None), consistent with the UserWarning pattern already used in base.py.

gregorjerse · 2026-06-19T10:01:21Z

+    origin_read_obj = reads[-1] if reads else None
+    if origin_read_obj is None:
+        return None
+    if origin_read_obj._process._type.endswith("paired:"):


[High] Strict endswith('paired:') / ('single:') raises on processed reads, and the failure takes down unrelated tables. Upload types like data:reads:fastq:paired: match, but a trimmed object — e.g. data:reads:fastq:single:cutadapt: (this exact shape is in tests/unit/test_sample.py) — ends in cutadapt: and hits the else: raise ValueError. Since reads[-1] often resolves to the higher-id (trimmed) object, this is plausible rather than exotic. And because _is_paired_end runs for ~15 data types via general_multiqc_parser while asyncio.gather is called without return_exceptions=True, one such sample aborts parsing of picard / flagstat / bowtie2 / etc. for the whole chunk — coupling that did not exist before this PR.

Match the library token as a path segment instead of a strict suffix (e.g. 'paired' in type.split(':')), and make the unknown case non-fatal.

This is an insightful comment, as the code was written with assumptions of the user running the General RNA-seq pipeline. Will modify to make it more general.

gregorjerse · 2026-06-19T10:01:21Z

+    ]
+    fragment_columns = df.columns.intersection(fragment_metrics)
+
+    if _is_paired_end(sample):


[Medium] _is_paired_end is called unconditionally, even when there are no fragment columns. fragment_columns is computed just above but not used to gate this call. For the ~14 non-FASTQ data types routed through this parser, fragment_columns is always empty, so the per-mate loops are a no-op — yet the blocking reads query (and the ValueError risk above) is still paid for every sample. A single guard removes the waste and confines the crash surface to the one table that needs it:

if not fragment_columns.empty: paired = _is_paired_end(sample) ... # the whole paired/single block, only here

gregorjerse · 2026-06-19T10:01:21Z

+    if _is_paired_end(sample):
+        for col in fragment_columns:
+            for s in ["_1", "_2"]:
+                mate_mask = df.index.str.endswith(s) & df[col].notna()


[High] The _1/_2 mate-detection assumption is unvalidated and fails silently for common naming; it also crashes on non-string indexes.

(1) endswith('_1') / ('_2') matches SRA-style sample_1 / sample_2 but not Illumina sample_R1_001 / sample_R2_001 or _R1 / _R2 — 'sample_R1'.endswith('_1') is False, and MQC_CLEANED_SFXS strips .fastq / .gz but not the _R1_001 tail. When neither matches, both masks are empty, continue fires, and _aggregate_df averages all 2L lane×mate rows → the reported value is sum / 2L instead of sum / 2, i.e. 1/L of correct (under-reporting that scales with lane count). A single-lane sample returns the right number by coincidence, so a naive fixture won't catch it. Please confirm what suffix the Genialis MultiQC config actually emits; safest is to derive mate identity from the fastq Data's R1/R2 files, or at minimum match _(?:R)?[12](?:_\d+)?$ and warn/raise when a paired sample matches neither mate instead of silently averaging.

(2) df.index.str.endswith(...) raises AttributeError: Can only use .str accessor with string values when sample names are purely numeric (read_csv(index_col=0).convert_dtypes() yields an integer index). The single-end branch isn't affected, so the failure is asymmetric. Use df.index.astype(str).str.endswith(s).

Will test further

gregorjerse · 2026-06-19T10:01:21Z

+                mate_mask = df.index.str.endswith(s) & df[col].notna()
+                if not mate_mask.any():
+                    continue
+                total = df.loc[mate_mask, col].sum(min_count=1)


[Medium] "Mean of the two mate totals" is an estimator that degrades silently. When only one mate's FastQC row is present, the other mate's mask is empty and mean over [mate1_total, NA…] returns the single-mate count, not a pair count. For the trimmed column, mates are trimmed independently and can legitimately differ, so the mean is a fractional value that is neither mate's count nor a true pair count (and disagrees by design with total_read_count_*, which still sums everything). Consider defining fragment count as R1's lane-summed total (== pair count by construction, immune to mate-2 asymmetry), or document the estimator and warn when |m1 − m2| is large or a mate is missing.

Add a fragment count column

3ba6d7c

in QCTables ``general_fastq`` field output

gregorjerse reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a fragment count column#407

Add a fragment count column#407
marcellevstek wants to merge 1 commit into
genialis:masterfrom
marcellevstek:add/fragment_count_reporting

marcellevstek commented May 18, 2026

Uh oh!

mstajdohar commented Jun 18, 2026

Uh oh!

gregorjerse left a comment

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

marcellevstek Jun 23, 2026

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

marcellevstek Jun 23, 2026

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

marcellevstek Jun 23, 2026

Uh oh!

gregorjerse Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

marcellevstek commented May 18, 2026

Uh oh!

mstajdohar commented Jun 18, 2026

Uh oh!

gregorjerse left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants