Skip to content

DI-2978 eggd_pytmb v1.0.0 release#4

Open
Aisha-D wants to merge 21 commits into
mainfrom
DI-2978_eggd_pyTMB_v1.0.0_release
Open

DI-2978 eggd_pytmb v1.0.0 release#4
Aisha-D wants to merge 21 commits into
mainfrom
DI-2978_eggd_pyTMB_v1.0.0_release

Conversation

@Aisha-D

@Aisha-D Aisha-D commented Jun 19, 2026

Copy link
Copy Markdown

This change is Reviewable

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a complete Tumour Mutational Burden (TMB) calculation application accepting VEP-annotated VCF, BAM files, and optional BED/GTF inputs with configurable filtering parameters (VAF, MAF, depth thresholds).
  • Documentation

    • Updated documentation with usage examples, input/output specifications, and typical workflow instructions.
  • Chores

    • Configured asset dependencies and environment settings for application deployment.

@Aisha-D Aisha-D requested a review from RSWilson1 June 19, 2026 10:16
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

Transforms a DNAnexus app template into a fully implemented eggd_pyTMB app. dxapp.json is rewritten with pyTMB-specific inputs/outputs and pinned asset dependencies; src/code.sh becomes a complete pipeline driver running mosdepth-backed effective genome size computation and TMB calculation; pyTMB YAML configs, conda environment, ignore files, and README are all added.

Changes

eggd_pyTMB App Implementation

Layer / File(s) Summary
App manifest and asset declarations
dxapp.json, assets/mosdepth/dxasset.json, assets/pytmb/dxasset.json
dxapp.json is rewritten with pyTMB-specific inputSpec/outputSpec, mosdepth and pytmb pinned asset dependencies, and instance type upgraded to mem1_ssd1_v2_x4; dxasset.json manifests added for both assets.
pyTMB variant caller and VEP configs
resources/home/dnanexus/pytmb/config/tnhaplotyper.yml, resources/home/dnanexus/pytmb/config/vep.yml
tnhaplotyper.yml defines freq/depth/altDepth field mappings; vep.yml configures CSQ tag parsing, consequence classification, splicing terms, and gnomad population database subfield indices.
Main Bash pipeline driver
src/code.sh
Rewritten to download inputs, derive sample name, relocate BAM index, run pytmb.cli.run_effgenomesize extracting EFF_GENOME_SIZE, run pytmb.cli.run_tmb with computed size and config paths, validate the output report, and upload via dx-upload-all-outputs.
Conda environment, ignore files, and README
resources/home/dnanexus/environment.yml, .dxignore, resources/home/dnanexus/.dxignore, .gitignore, README.md
environment.yml pins Python, cyvcf2, pandas, mosdepth, and other runtime/test dependencies; ignore files exclude build artefacts and bytecode; README fully rewritten with inputs, outputs, and a dx run example.

Sequence Diagram

sequenceDiagram
  participant User
  participant code.sh
  participant run_effgenomesize as pytmb.cli.run_effgenomesize
  participant run_tmb as pytmb.cli.run_tmb
  participant DNAnexus as DNAnexus Platform

  User->>DNAnexus: dx run eggd_pyTMB (VCF, BAM, BAI, BED, filters)
  DNAnexus->>code.sh: execute with downloaded inputs
  code.sh->>code.sh: derive SAMPLE from vcf_prefix, mv bam_bai to ${bam_path}.bai
  code.sh->>run_effgenomesize: BAM + BED → compute effective genome size
  run_effgenomesize-->>code.sh: stdout/stderr containing EFF_GENOME_SIZE
  code.sh->>code.sh: grep/awk extract + validate EFF_GENOME_SIZE
  code.sh->>run_tmb: VCF + configs + EFF_GENOME_SIZE + vaf/maf/depth filters
  run_tmb-->>code.sh: ${SAMPLE}_TMB.txt report
  code.sh->>code.sh: validate report non-empty
  code.sh->>DNAnexus: dx-upload-all-outputs (tmb_report)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • eastgenomics/eggd_pyTMB#1: Introduces the same core src/code.sh linear pipeline implementing run_effgenomesizerun_tmb with BAM index handling.
  • eastgenomics/eggd_pyTMB#3: Modifies dxapp.json with the same bed/gtf input adjustments and instance type upgrade to mem1_ssd1_v2_x4.
  • eastgenomics/eggd_pyTMB#2: Updates README.md to the same eggd_pyTMB-specific documentation with matching inputs/outputs and dx run example.

Suggested reviewers

  • RSWilson1
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the primary objective: a v1.0.0 release of eggd_pyTMB, which aligns with the substantial changes across configuration files, documentation, and implementation scripts.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch DI-2978_eggd_pyTMB_v1.0.0_release

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
README.md (1)

43-53: ⚡ Quick win

Add language specifier to code fence.

The fenced code block lacks a language identifier. Add bash to improve syntax highlighting and conformance with markdown best practices.

Proposed fix
-```
+```bash
 dx run eggd_pyTMB \
   -ivcf="sample.vep_annotated.vcf.gz" \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 43 - 53, The code fence for the dx run command
example is missing a language specifier, which prevents proper syntax
highlighting and violates markdown best practices. Add the bash language
identifier to the opening triple backticks (change ``` to ```bash) on the line
immediately before the dx run eggd_pyTMB command to enable syntax highlighting
and conform to markdown standards.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/code.sh`:
- Around line 37-47: The EFF_GENOME_SIZE extraction pipeline is vulnerable when
pipefail is enabled because grep exits with code 1 when no match is found,
causing premature script termination before validation can occur. Additionally,
the empty check on line 39 does not validate whether the parsed value is
numeric, allowing non-numeric strings to reach the arithmetic comparison on line
44. Fix this by either: allowing grep to return empty output without triggering
pipefail (such as using grep with a fallback pattern or piping through a
non-failing command), and then enhance the validation logic to check both that
EFF_GENOME_SIZE is non-empty AND matches a numeric pattern before performing the
arithmetic comparison with -eq. This ensures the script handles missing matches,
empty results, and non-numeric values gracefully.
- Around line 17-21: The error checking happens after the move operation, but
with set -e enabled, a failing mv command will exit the script immediately,
bypassing the custom error message. Add validation checks BEFORE the mv command
instead of after it. Verify that the source file bam_bai_path exists and that
the destination BAI_PATH conditions are valid BEFORE attempting the move
operation. This ensures proper guards are in place to prevent the mv command
from failing with a hard exit, allowing your custom error handling to work as
intended.

---

Nitpick comments:
In `@README.md`:
- Around line 43-53: The code fence for the dx run command example is missing a
language specifier, which prevents proper syntax highlighting and violates
markdown best practices. Add the bash language identifier to the opening triple
backticks (change ``` to ```bash) on the line immediately before the dx run
eggd_pyTMB command to enable syntax highlighting and conform to markdown
standards.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c9843ae1-be11-4bb4-8e46-3cf31d6902bb

📥 Commits

Reviewing files that changed from the base of the PR and between 773f61a and 9bc5acd.

📒 Files selected for processing (12)
  • .dxignore
  • .gitignore
  • README.md
  • assets/mosdepth/dxasset.json
  • assets/pytmb/dxasset.json
  • dxapp.json
  • resources/home/dnanexus/.dxignore
  • resources/home/dnanexus/environment.yml
  • resources/home/dnanexus/python_script.py
  • resources/home/dnanexus/pytmb/config/tnhaplotyper.yml
  • resources/home/dnanexus/pytmb/config/vep.yml
  • src/code.sh

Comment thread src/code.sh
Comment on lines +17 to +21
mv "${bam_bai_path}" "${BAI_PATH}"
if [[ ! -f "${BAI_PATH}" ]]; then
echo "Error: BAM index not found at ${BAI_PATH}" >&2
exit 1
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if src/code.sh exists and read the relevant section
if [[ -f src/code.sh ]]; then
    echo "=== Full file content ===" 
    wc -l src/code.sh
    echo "=== Lines 1-30 (context around lines 17-21) ==="
    head -30 src/code.sh | cat -n
else
    echo "File not found: src/code.sh"
    find . -name "code.sh" -type f
fi

Repository: eastgenomics/eggd_pyTMB

Length of output: 1278


Guard index placement before moving the file.

With set -e enabled, a failing mv exits the script immediately, bypassing the custom error message. The same-file move case also triggers a hard failure.

Suggested patch
 BAI_PATH="${bam_path}.bai"
-mv "${bam_bai_path}" "${BAI_PATH}"
+if [[ ! -f "${bam_bai_path}" ]]; then
+    echo "Error: BAM index input not found at ${bam_bai_path}" >&2
+    exit 1
+fi
+if [[ "${bam_bai_path}" != "${BAI_PATH}" ]]; then
+    cp -f "${bam_bai_path}" "${BAI_PATH}"
+fi
 if [[ ! -f "${BAI_PATH}" ]]; then
     echo "Error: BAM index not found at ${BAI_PATH}" >&2
     exit 1
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mv "${bam_bai_path}" "${BAI_PATH}"
if [[ ! -f "${BAI_PATH}" ]]; then
echo "Error: BAM index not found at ${BAI_PATH}" >&2
exit 1
fi
if [[ ! -f "${bam_bai_path}" ]]; then
echo "Error: BAM index input not found at ${bam_bai_path}" >&2
exit 1
fi
if [[ "${bam_bai_path}" != "${BAI_PATH}" ]]; then
cp -f "${bam_bai_path}" "${BAI_PATH}"
fi
if [[ ! -f "${BAI_PATH}" ]]; then
echo "Error: BAM index not found at ${BAI_PATH}" >&2
exit 1
fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 17-17: bam_bai_path is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/code.sh` around lines 17 - 21, The error checking happens after the move
operation, but with set -e enabled, a failing mv command will exit the script
immediately, bypassing the custom error message. Add validation checks BEFORE
the mv command instead of after it. Verify that the source file bam_bai_path
exists and that the destination BAI_PATH conditions are valid BEFORE attempting
the move operation. This ensures proper guards are in place to prevent the mv
command from failing with a hard exit, allowing your custom error handling to
work as intended.

Comment thread src/code.sh
Comment on lines +37 to +47
EFF_GENOME_SIZE=$(echo "${EFF_OUTPUT}" | grep "Effective Genome Size" | awk '{print $(NF-1)}')

if [[ -z "${EFF_GENOME_SIZE}" ]]; then
echo "Error: pyEffGenomeSize did not return an effective genome size. Check logs above." >&2
exit 1
fi

if [[ "${EFF_GENOME_SIZE}" -eq 0 ]]; then
echo "Error: Effective genome size is 0 bp" >&2
exit 1
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check if file exists and examine the relevant lines
if [[ -f "src/code.sh" ]]; then
    echo "=== File exists ==="
    wc -l src/code.sh
    echo ""
    echo "=== Lines 1-50 to see context ==="
    sed -n '1,50p' src/code.sh
else
    echo "File src/code.sh not found"
    git ls-files | grep -i "code\\.sh"
fi

Repository: eastgenomics/eggd_pyTMB

Length of output: 1640


Make EFF_GENOME_SIZE parsing resilient under pipefail.

The pipeline on line 37 fails prematurely: if grep finds no match, it exits with code 1, triggering script termination before the validation block ever runs. Additionally, non-numeric values bypass the empty check and crash the arithmetic comparison on line 44.

Suggested patch
-EFF_GENOME_SIZE=$(echo "${EFF_OUTPUT}" | grep "Effective Genome Size" | awk '{print $(NF-1)}')
+EFF_GENOME_SIZE="$(awk '/Effective Genome Size/ {print $(NF-1)}' <<< "${EFF_OUTPUT}" | tail -n1 || true)"
 
 if [[ -z "${EFF_GENOME_SIZE}" ]]; then
     echo "Error: pyEffGenomeSize did not return an effective genome size. Check logs above." >&2
     exit 1
 fi
 
+if [[ ! "${EFF_GENOME_SIZE}" =~ ^[0-9]+$ ]]; then
+    echo "Error: Effective genome size is not a valid integer: ${EFF_GENOME_SIZE}" >&2
+    exit 1
+fi
+
 if [[ "${EFF_GENOME_SIZE}" -eq 0 ]]; then
     echo "Error: Effective genome size is 0 bp" >&2
     exit 1
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/code.sh` around lines 37 - 47, The EFF_GENOME_SIZE extraction pipeline is
vulnerable when pipefail is enabled because grep exits with code 1 when no match
is found, causing premature script termination before validation can occur.
Additionally, the empty check on line 39 does not validate whether the parsed
value is numeric, allowing non-numeric strings to reach the arithmetic
comparison on line 44. Fix this by either: allowing grep to return empty output
without triggering pipefail (such as using grep with a fallback pattern or
piping through a non-failing command), and then enhance the validation logic to
check both that EFF_GENOME_SIZE is non-empty AND matches a numeric pattern
before performing the arithmetic comparison with -eq. This ensures the script
handles missing matches, empty results, and non-numeric values gracefully.

@RSWilson1 RSWilson1 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RSWilson1 reviewed 12 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on Aisha-D).


src/code.sh line 31 at r1 (raw file):

  --minCoverage 100 \
  --minMapq 50 \
  --oprefix "/home/dnanexus/out/${SAMPLE}" 2>&1); then

should it be --oprefix ?


src/code.sh line 68 at r1 (raw file):

fi

dx-upload-all-outputs

could add --parallel here but probably won't make much difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants