Add a fragment count column#407
Conversation
in QCTables ``general_fastq`` field output
|
What’s the latest on this PR? Should we close it? Otherwise, what’s blocking a merge? |
gregorjerse
left a comment
There was a problem hiding this comment.
Adversarial review — High/Medium findings (Low-severity items and nits omitted). Per-line comments below.
Two items not tied to a specific changed line:
- [Medium] No tests / no fixture. The new branching (
_is_paired_end, the per-mate_1/_2block) is unexercised — the diff touches notests/, and there is nomultiqc_general_statsfixture in the repo, so the central_1/_2naming assumption is unvalidated. A multi-lane paired-end fixture asserting the exact fragment count is exactly what would catch the naming issue flagged inline. - [Medium] CHANGELOG.
docs/CHANGELOG.rsthas noUnreleasedentry for the two new user-visible columns and the changed paired-end read-count semantics.
Credit: the _filter_and_rename_columns rewrite to a fresh result DataFrame is correct and now load-bearing — with two slugs (total_read_count_* and estimated_fragment_count_*) sharing one source column, the old in-place df[name] *= ... + rename_map approach would double-scale and collide column names. Please don't revert it to in-place mutation.
|
|
||
| def general_multiqc_parser(file_object, name, column_map): | ||
| def _is_paired_end(sample) -> Optional[bool]: | ||
| reads = list(sample.data.filter(type="data:reads:fastq:")) |
There was a problem hiding this comment.
[High] Blocking, synchronous network call inside the asyncio download loop. sample.data.filter(...) materialized with list() runs ResolweQuery._fetch → self.api.get(...) — a blocking slumber/requests GET. It executes inside general_multiqc_parser → _parse_file, which is called from the async def _download_file coroutine gathered by asyncio.gather (base.py), with no executor / to_thread. So it stalls the event loop and serializes the very downloads the chunked async path exists to parallelize, adding a round-trip per parsed file. It is also uncached (.filter() clones with an empty cache; no memoization on _is_paired_end).
Suggest resolving library strategy once per sample off the event loop (e.g. precompute a {sample_id: Optional[bool]} map) and passing the boolean into the parser.
| def general_multiqc_parser(file_object, name, column_map): | ||
| def _is_paired_end(sample) -> Optional[bool]: | ||
| reads = list(sample.data.filter(type="data:reads:fastq:")) | ||
| origin_read_obj = reads[-1] if reads else None |
There was a problem hiding this comment.
[Medium] reads[-1] is non-deterministic and origin_read_obj is misleading. No ordering= is set, so the "last" element depends on undefined server ordering, yet the variable is named origin_read_obj while taking the last item. The codebase idiom (shortcuts/sample.py get_reads) pins ordering='-id' and takes [0]. Combined with the strict endswith just below, which reads object is picked actually matters. Pin an explicit order (or tie selection to the reads upstream of the MultiQC Data) and rename to match what is selected.
There was a problem hiding this comment.
The ordering is determined by the backend, so this is not entirely applicable.
| def _is_paired_end(sample) -> Optional[bool]: | ||
| reads = list(sample.data.filter(type="data:reads:fastq:")) | ||
| origin_read_obj = reads[-1] if reads else None | ||
| if origin_read_obj is None: |
There was a problem hiding this comment.
[Medium] None (unknown strategy) is silently treated as single-end. The call site is if _is_paired_end(sample):, so a None return falls into the single-end branch, which sums across both mates — re-introducing exactly the 2× inflation this PR fixes, with no warning, for any genuinely paired-end sample whose reads object can't be resolved here. Handle the three states explicitly (e.g. warn or skip the adjustment on None), consistent with the UserWarning pattern already used in base.py.
| origin_read_obj = reads[-1] if reads else None | ||
| if origin_read_obj is None: | ||
| return None | ||
| if origin_read_obj._process._type.endswith("paired:"): |
There was a problem hiding this comment.
[High] Strict endswith('paired:') / ('single:') raises on processed reads, and the failure takes down unrelated tables. Upload types like data:reads:fastq:paired: match, but a trimmed object — e.g. data:reads:fastq:single:cutadapt: (this exact shape is in tests/unit/test_sample.py) — ends in cutadapt: and hits the else: raise ValueError. Since reads[-1] often resolves to the higher-id (trimmed) object, this is plausible rather than exotic. And because _is_paired_end runs for ~15 data types via general_multiqc_parser while asyncio.gather is called without return_exceptions=True, one such sample aborts parsing of picard / flagstat / bowtie2 / etc. for the whole chunk — coupling that did not exist before this PR.
Match the library token as a path segment instead of a strict suffix (e.g. 'paired' in type.split(':')), and make the unknown case non-fatal.
There was a problem hiding this comment.
This is an insightful comment, as the code was written with assumptions of the user running the General RNA-seq pipeline. Will modify to make it more general.
| ] | ||
| fragment_columns = df.columns.intersection(fragment_metrics) | ||
|
|
||
| if _is_paired_end(sample): |
There was a problem hiding this comment.
[Medium] _is_paired_end is called unconditionally, even when there are no fragment columns. fragment_columns is computed just above but not used to gate this call. For the ~14 non-FASTQ data types routed through this parser, fragment_columns is always empty, so the per-mate loops are a no-op — yet the blocking reads query (and the ValueError risk above) is still paid for every sample. A single guard removes the waste and confines the crash surface to the one table that needs it:
if not fragment_columns.empty:
paired = _is_paired_end(sample)
... # the whole paired/single block, only here| if _is_paired_end(sample): | ||
| for col in fragment_columns: | ||
| for s in ["_1", "_2"]: | ||
| mate_mask = df.index.str.endswith(s) & df[col].notna() |
There was a problem hiding this comment.
[High] The _1/_2 mate-detection assumption is unvalidated and fails silently for common naming; it also crashes on non-string indexes.
(1) endswith('_1') / ('_2') matches SRA-style sample_1 / sample_2 but not Illumina sample_R1_001 / sample_R2_001 or _R1 / _R2 — 'sample_R1'.endswith('_1') is False, and MQC_CLEANED_SFXS strips .fastq / .gz but not the _R1_001 tail. When neither matches, both masks are empty, continue fires, and _aggregate_df averages all 2L lane×mate rows → the reported value is sum / 2L instead of sum / 2, i.e. 1/L of correct (under-reporting that scales with lane count). A single-lane sample returns the right number by coincidence, so a naive fixture won't catch it. Please confirm what suffix the Genialis MultiQC config actually emits; safest is to derive mate identity from the fastq Data's R1/R2 files, or at minimum match _(?:R)?[12](?:_\d+)?$ and warn/raise when a paired sample matches neither mate instead of silently averaging.
(2) df.index.str.endswith(...) raises AttributeError: Can only use .str accessor with string values when sample names are purely numeric (read_csv(index_col=0).convert_dtypes() yields an integer index). The single-end branch isn't affected, so the failure is asymmetric. Use df.index.astype(str).str.endswith(s).
There was a problem hiding this comment.
Will test further
| mate_mask = df.index.str.endswith(s) & df[col].notna() | ||
| if not mate_mask.any(): | ||
| continue | ||
| total = df.loc[mate_mask, col].sum(min_count=1) |
There was a problem hiding this comment.
[Medium] "Mean of the two mate totals" is an estimator that degrades silently. When only one mate's FastQC row is present, the other mate's mask is empty and mean over [mate1_total, NA…] returns the single-mate count, not a pair count. For the trimmed column, mates are trimmed independently and can legitimately differ, so the mean is a fractional value that is neither mate's count nor a true pair count (and disagrees by design with total_read_count_*, which still sums everything). Consider defining fragment count as R1's lane-summed total (== pair count by construction, immune to mate-2 asymmetry), or document the estimator and warn when |m1 − m2| is large or a mate is missing.
in QCTables
general_fastqfield outputPreviously, raw and trimmed read count columns in QCTables reported the sum across all lanes and mates of FASTQ files per sample. This resulted in inflated number for samples sequenced using paired-end sequencing, since the value was a sum of read counts of both mates. The code now sums across lanes of each mate and then averages the number across both mates. The code assumes that both mates are properly paired. A different aggregation function, for example minimum, could also be used.