Context
We're exploring ways to reduce the serialization/deserialization overhead when passing large TRAPI messages between shepherd workers. Each worker in the pipeline currently does a full decode (zstandard decompress + JSON/msgpack parse) of the entire message from Redis, processes it, then re-encodes and writes it back. For large messages this adds up across the pipeline.
We evaluated several options:
- Codec swap (orjson → msgspec msgpack) — Phase 1 is on branch
claude/jolly-bohr-KLfa9
- Typed deserialization using TOM (translator_tom) Pydantic objects or msgspec Structs
- Split Redis storage by message section so workers only decode what they need
This issue captures the analysis to date as a discussion point.
Performance Impact of Using TOM Objects in Shepherd
The Pipeline Cost Per Query
A typical Aragorn query passes through roughly 8 workers in sequence, each doing a get+save cycle against Redis. Here's the serialization budget for a 5,000-element message (a realistic large query with 5K nodes, 5K edges, 5K results):
| Step |
Current (plain dicts) |
With TOM Objects |
| Decompress + decode to dicts |
~83ms |
~83ms |
Pydantic model_validate |
— |
~250ms (est. ~3× decode time) |
| Worker logic |
varies |
same (but richer API) |
| Encode + compress |
~6ms |
~6ms |
| Per-worker overhead |
~89ms |
~339ms |
| 8-worker pipeline total |
~712ms |
~2,712ms |
The validation cost comes from constructing ~200,000 model instances per decode (one per node, edge, result, attribute, etc.) multiplied by 8 workers in the pipeline.
Message Size Breakdown
At the 5K scale, the knowledge graph dominates the message:
| Section |
Size |
Share |
knowledge_graph |
2.96 MB |
75% |
results |
0.81 MB |
20% |
auxiliary_graphs |
0.14 MB |
3% |
query_graph |
<0.01 MB |
<1% |
What Each Worker Actually Touches
Not every worker needs every section. Here's the access pattern across the pipeline:
| Worker |
Reads |
Modifies |
Ser/deser cycles |
Notes |
| merge_message |
KG + results + aux (×3 messages) |
all |
3 gets + 1 save |
Semantic merge, dedup, edge remapping — richest logic |
| aragorn_score |
full message |
results.analyses.score |
1+1 |
Sets one field per analysis |
| aragorn_omnicorp |
KG + results |
node/edge attributes |
1+1 |
Appends attributes |
| filter_results_top_n |
results only |
truncates list |
1+1 |
results[:N] |
| filter_kgraph_orphans |
full message |
removes KG entries |
1+1 |
Set operations on keys |
| sort_results_score |
results |
reorders list |
1+1 |
Sort by score |
| filter_analyses_top_n |
results |
truncates analyses |
1+1 |
Similar to filter_results |
| finish_query |
reads for callback |
none |
1 get |
Read-only |
| aragorn/bte (entry) |
query_graph |
none |
1 get |
Reads 2 fields |
| aragorn_lookup/bte_lookup |
query_graph + params |
params |
1 get |
Small message portion |
Where TOM Objects Add Value vs. Overhead
TOM provides a well-designed set of helpers for TRAPI operations — update() for merging knowledge graphs, normalize() for hash-based edge ID remapping, hash-based result deduplication, constraint checking, and more. These are genuinely useful for complex semantic operations.
However, in the shepherd pipeline, most workers perform straightforward dict operations (append to a list, sort by a key, truncate, set a field). For these workers, the Pydantic validation cost of constructing typed objects outweighs the benefit — there's no complex logic that benefits from the richer API.
The one worker where TOM's helpers would clearly pay for themselves is merge_message, which does real semantic operations (merge KGs, deduplicate results, remap edge IDs). The question is whether that one worker's benefit justifies the validation cost, or whether it's better to use TOM selectively.
Scaling Behavior
For reference, here's how serialization cost scales with message size:
| Scale |
Raw Size |
Compressed |
Decode Time |
Encode Time |
Round-trip |
| Small (100 elements) |
0.1 MB |
5 KB |
0.8ms |
0.2ms |
1.0ms |
| Medium (1K) |
0.8 MB |
32 KB |
12ms |
2ms |
13ms |
| Large (5K) |
3.9 MB |
135 KB |
103ms |
12ms |
115ms |
| XL (10K) |
7.8 MB |
250 KB |
350ms |
23ms |
373ms |
| XXL (50K) |
39.7 MB |
1.1 MB |
2,365ms |
119ms |
2,484ms |
Decode dominates encode by ~10×. This is because decoding must construct hundreds of thousands of Python objects (dicts, lists, strings), while encoding just walks the existing object tree.
Possible Approaches (Not Mutually Exclusive)
1. Split Redis storage by message section
Store {id}:kg, {id}:results, {id}:aux, {id}:qg as separate keys. Workers fetch only the sections they need. Since the knowledge graph is 75% of the message and only 5 of 16 workers access it, this could reduce decode time by ~40–60% for most workers with no new dependencies.
2. Use TOM selectively
Use TOM objects in merge_message (and potentially other workers that benefit from semantic helpers) while keeping plain dicts for simple filter/sort workers. This gets the best of both worlds but means two code styles in the repo.
3. msgspec msgpack as the wire format (Phase 1 — done)
Branch claude/jolly-bohr-KLfa9 swaps the codec from orjson to msgspec.msgpack. Performance is roughly neutral for untyped dicts, but it establishes the binary wire format. Includes backward-compatible decoding for in-flight JSON messages during rolling deploys.
4. Typed deserialization (future)
If/when typed Structs or similar become viable with the constraints we need (extra field preservation, hash-based identity), typed decode could skip Python dict construction entirely. This remains the theoretical ceiling but has practical constraints today — see discussion in the analysis above.
Open Questions
- Is split storage worth the added complexity in Redis key management and TTL handling?
- For
merge_message specifically — would it be cleaner to convert dicts → TOM objects only inside that worker, or to pass TOM objects through the pipeline?
- Are there other workers where TOM's helpers (constraint checking, normalization) would justify the validation overhead?
- What's the typical message size distribution in production? The cost/benefit shifts significantly between the 1K and 10K scales.
Feedback welcome from anyone familiar with the TRAPI pipeline or TOM internals.
Context
We're exploring ways to reduce the serialization/deserialization overhead when passing large TRAPI messages between shepherd workers. Each worker in the pipeline currently does a full decode (zstandard decompress + JSON/msgpack parse) of the entire message from Redis, processes it, then re-encodes and writes it back. For large messages this adds up across the pipeline.
We evaluated several options:
claude/jolly-bohr-KLfa9This issue captures the analysis to date as a discussion point.
Performance Impact of Using TOM Objects in Shepherd
The Pipeline Cost Per Query
A typical Aragorn query passes through roughly 8 workers in sequence, each doing a get+save cycle against Redis. Here's the serialization budget for a 5,000-element message (a realistic large query with 5K nodes, 5K edges, 5K results):
model_validateThe validation cost comes from constructing ~200,000 model instances per decode (one per node, edge, result, attribute, etc.) multiplied by 8 workers in the pipeline.
Message Size Breakdown
At the 5K scale, the knowledge graph dominates the message:
knowledge_graphresultsauxiliary_graphsquery_graphWhat Each Worker Actually Touches
Not every worker needs every section. Here's the access pattern across the pipeline:
results.analyses.scoreresults[:N]Where TOM Objects Add Value vs. Overhead
TOM provides a well-designed set of helpers for TRAPI operations —
update()for merging knowledge graphs,normalize()for hash-based edge ID remapping, hash-based result deduplication, constraint checking, and more. These are genuinely useful for complex semantic operations.However, in the shepherd pipeline, most workers perform straightforward dict operations (append to a list, sort by a key, truncate, set a field). For these workers, the Pydantic validation cost of constructing typed objects outweighs the benefit — there's no complex logic that benefits from the richer API.
The one worker where TOM's helpers would clearly pay for themselves is merge_message, which does real semantic operations (merge KGs, deduplicate results, remap edge IDs). The question is whether that one worker's benefit justifies the validation cost, or whether it's better to use TOM selectively.
Scaling Behavior
For reference, here's how serialization cost scales with message size:
Decode dominates encode by ~10×. This is because decoding must construct hundreds of thousands of Python objects (dicts, lists, strings), while encoding just walks the existing object tree.
Possible Approaches (Not Mutually Exclusive)
1. Split Redis storage by message section
Store
{id}:kg,{id}:results,{id}:aux,{id}:qgas separate keys. Workers fetch only the sections they need. Since the knowledge graph is 75% of the message and only 5 of 16 workers access it, this could reduce decode time by ~40–60% for most workers with no new dependencies.2. Use TOM selectively
Use TOM objects in
merge_message(and potentially other workers that benefit from semantic helpers) while keeping plain dicts for simple filter/sort workers. This gets the best of both worlds but means two code styles in the repo.3. msgspec msgpack as the wire format (Phase 1 — done)
Branch
claude/jolly-bohr-KLfa9swaps the codec from orjson tomsgspec.msgpack. Performance is roughly neutral for untyped dicts, but it establishes the binary wire format. Includes backward-compatible decoding for in-flight JSON messages during rolling deploys.4. Typed deserialization (future)
If/when typed Structs or similar become viable with the constraints we need (extra field preservation, hash-based identity), typed decode could skip Python dict construction entirely. This remains the theoretical ceiling but has practical constraints today — see discussion in the analysis above.
Open Questions
merge_messagespecifically — would it be cleaner to convert dicts → TOM objects only inside that worker, or to pass TOM objects through the pipeline?Feedback welcome from anyone familiar with the TRAPI pipeline or TOM internals.