Skip to content

Add a fragment count column#407

Open
marcellevstek wants to merge 1 commit into
genialis:masterfrom
marcellevstek:add/fragment_count_reporting
Open

Add a fragment count column#407
marcellevstek wants to merge 1 commit into
genialis:masterfrom
marcellevstek:add/fragment_count_reporting

Conversation

@marcellevstek

Copy link
Copy Markdown
Contributor

in QCTables general_fastq field output

Previously, raw and trimmed read count columns in QCTables reported the sum across all lanes and mates of FASTQ files per sample. This resulted in inflated number for samples sequenced using paired-end sequencing, since the value was a sum of read counts of both mates. The code now sums across lanes of each mate and then averages the number across both mates. The code assumes that both mates are properly paired. A different aggregation function, for example minimum, could also be used.

in QCTables ``general_fastq`` field output
@mstajdohar

Copy link
Copy Markdown
Member

What’s the latest on this PR? Should we close it? Otherwise, what’s blocking a merge?

@gregorjerse gregorjerse left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial review — High/Medium findings (Low-severity items and nits omitted). Per-line comments below.

Two items not tied to a specific changed line:

  • [Medium] No tests / no fixture. The new branching (_is_paired_end, the per-mate _1/_2 block) is unexercised — the diff touches no tests/, and there is no multiqc_general_stats fixture in the repo, so the central _1/_2 naming assumption is unvalidated. A multi-lane paired-end fixture asserting the exact fragment count is exactly what would catch the naming issue flagged inline.
  • [Medium] CHANGELOG. docs/CHANGELOG.rst has no Unreleased entry for the two new user-visible columns and the changed paired-end read-count semantics.

Credit: the _filter_and_rename_columns rewrite to a fresh result DataFrame is correct and now load-bearing — with two slugs (total_read_count_* and estimated_fragment_count_*) sharing one source column, the old in-place df[name] *= ... + rename_map approach would double-scale and collide column names. Please don't revert it to in-place mutation.

Comment thread src/resdk/tables/qc.py

def general_multiqc_parser(file_object, name, column_map):
def _is_paired_end(sample) -> Optional[bool]:
reads = list(sample.data.filter(type="data:reads:fastq:"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] Blocking, synchronous network call inside the asyncio download loop. sample.data.filter(...) materialized with list() runs ResolweQuery._fetch → self.api.get(...) — a blocking slumber/requests GET. It executes inside general_multiqc_parser → _parse_file, which is called from the async def _download_file coroutine gathered by asyncio.gather (base.py), with no executor / to_thread. So it stalls the event loop and serializes the very downloads the chunked async path exists to parallelize, adding a round-trip per parsed file. It is also uncached (.filter() clones with an empty cache; no memoization on _is_paired_end).

Suggest resolving library strategy once per sample off the event loop (e.g. precompute a {sample_id: Optional[bool]} map) and passing the boolean into the parser.

Comment thread src/resdk/tables/qc.py
def general_multiqc_parser(file_object, name, column_map):
def _is_paired_end(sample) -> Optional[bool]:
reads = list(sample.data.filter(type="data:reads:fastq:"))
origin_read_obj = reads[-1] if reads else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] reads[-1] is non-deterministic and origin_read_obj is misleading. No ordering= is set, so the "last" element depends on undefined server ordering, yet the variable is named origin_read_obj while taking the last item. The codebase idiom (shortcuts/sample.py get_reads) pins ordering='-id' and takes [0]. Combined with the strict endswith just below, which reads object is picked actually matters. Pin an explicit order (or tie selection to the reads upstream of the MultiQC Data) and rename to match what is selected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering is determined by the backend, so this is not entirely applicable.

Comment thread src/resdk/tables/qc.py
def _is_paired_end(sample) -> Optional[bool]:
reads = list(sample.data.filter(type="data:reads:fastq:"))
origin_read_obj = reads[-1] if reads else None
if origin_read_obj is None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] None (unknown strategy) is silently treated as single-end. The call site is if _is_paired_end(sample):, so a None return falls into the single-end branch, which sums across both mates — re-introducing exactly the 2× inflation this PR fixes, with no warning, for any genuinely paired-end sample whose reads object can't be resolved here. Handle the three states explicitly (e.g. warn or skip the adjustment on None), consistent with the UserWarning pattern already used in base.py.

Comment thread src/resdk/tables/qc.py
origin_read_obj = reads[-1] if reads else None
if origin_read_obj is None:
return None
if origin_read_obj._process._type.endswith("paired:"):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] Strict endswith('paired:') / ('single:') raises on processed reads, and the failure takes down unrelated tables. Upload types like data:reads:fastq:paired: match, but a trimmed object — e.g. data:reads:fastq:single:cutadapt: (this exact shape is in tests/unit/test_sample.py) — ends in cutadapt: and hits the else: raise ValueError. Since reads[-1] often resolves to the higher-id (trimmed) object, this is plausible rather than exotic. And because _is_paired_end runs for ~15 data types via general_multiqc_parser while asyncio.gather is called without return_exceptions=True, one such sample aborts parsing of picard / flagstat / bowtie2 / etc. for the whole chunk — coupling that did not exist before this PR.

Match the library token as a path segment instead of a strict suffix (e.g. 'paired' in type.split(':')), and make the unknown case non-fatal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an insightful comment, as the code was written with assumptions of the user running the General RNA-seq pipeline. Will modify to make it more general.

Comment thread src/resdk/tables/qc.py
]
fragment_columns = df.columns.intersection(fragment_metrics)

if _is_paired_end(sample):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] _is_paired_end is called unconditionally, even when there are no fragment columns. fragment_columns is computed just above but not used to gate this call. For the ~14 non-FASTQ data types routed through this parser, fragment_columns is always empty, so the per-mate loops are a no-op — yet the blocking reads query (and the ValueError risk above) is still paid for every sample. A single guard removes the waste and confines the crash surface to the one table that needs it:

if not fragment_columns.empty:
    paired = _is_paired_end(sample)
    ...  # the whole paired/single block, only here

Comment thread src/resdk/tables/qc.py
if _is_paired_end(sample):
for col in fragment_columns:
for s in ["_1", "_2"]:
mate_mask = df.index.str.endswith(s) & df[col].notna()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[High] The _1/_2 mate-detection assumption is unvalidated and fails silently for common naming; it also crashes on non-string indexes.

(1) endswith('_1') / ('_2') matches SRA-style sample_1 / sample_2 but not Illumina sample_R1_001 / sample_R2_001 or _R1 / _R2'sample_R1'.endswith('_1') is False, and MQC_CLEANED_SFXS strips .fastq / .gz but not the _R1_001 tail. When neither matches, both masks are empty, continue fires, and _aggregate_df averages all 2L lane×mate rows → the reported value is sum / 2L instead of sum / 2, i.e. 1/L of correct (under-reporting that scales with lane count). A single-lane sample returns the right number by coincidence, so a naive fixture won't catch it. Please confirm what suffix the Genialis MultiQC config actually emits; safest is to derive mate identity from the fastq Data's R1/R2 files, or at minimum match _(?:R)?[12](?:_\d+)?$ and warn/raise when a paired sample matches neither mate instead of silently averaging.

(2) df.index.str.endswith(...) raises AttributeError: Can only use .str accessor with string values when sample names are purely numeric (read_csv(index_col=0).convert_dtypes() yields an integer index). The single-end branch isn't affected, so the failure is asymmetric. Use df.index.astype(str).str.endswith(s).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test further

Comment thread src/resdk/tables/qc.py
mate_mask = df.index.str.endswith(s) & df[col].notna()
if not mate_mask.any():
continue
total = df.loc[mate_mask, col].sum(min_count=1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] "Mean of the two mate totals" is an estimator that degrades silently. When only one mate's FastQC row is present, the other mate's mask is empty and mean over [mate1_total, NA…] returns the single-mate count, not a pair count. For the trimmed column, mates are trimmed independently and can legitimately differ, so the mean is a fractional value that is neither mate's count nor a true pair count (and disagrees by design with total_read_count_*, which still sums everything). Consider defining fragment count as R1's lane-summed total (== pair count by construction, immune to mate-2 asymmetry), or document the estimator and warn when |m1 − m2| is large or a mate is missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants