Discussion: Serialization strategy for TRAPI messages — TOM objects, split storage, and codec options

## Context

We're exploring ways to reduce the serialization/deserialization overhead when passing large TRAPI messages between shepherd workers. Each worker in the pipeline currently does a full decode (zstandard decompress + JSON/msgpack parse) of the entire message from Redis, processes it, then re-encodes and writes it back. For large messages this adds up across the pipeline.

We evaluated several options:
- **Codec swap** (orjson → msgspec msgpack) — Phase 1 is on branch `claude/jolly-bohr-KLfa9`
- **Typed deserialization** using [TOM (translator_tom)](https://github.com/NCATSTranslator/TRAPIObjectModeling/) Pydantic objects or msgspec Structs
- **Split Redis storage** by message section so workers only decode what they need

This issue captures the analysis to date as a discussion point.

---

## Performance Impact of Using TOM Objects in Shepherd

### The Pipeline Cost Per Query

A typical Aragorn query passes through roughly 8 workers in sequence, each doing a get+save cycle against Redis. Here's the serialization budget for a **5,000-element message** (a realistic large query with 5K nodes, 5K edges, 5K results):

| Step | Current (plain dicts) | With TOM Objects |
|---|---|---|
| Decompress + decode to dicts | ~83ms | ~83ms |
| Pydantic `model_validate` | — | ~250ms (est. ~3× decode time) |
| Worker logic | varies | same (but richer API) |
| Encode + compress | ~6ms | ~6ms |
| **Per-worker overhead** | **~89ms** | **~339ms** |
| **8-worker pipeline total** | **~712ms** | **~2,712ms** |

The validation cost comes from constructing ~200,000 model instances per decode (one per node, edge, result, attribute, etc.) multiplied by 8 workers in the pipeline.

### Message Size Breakdown

At the 5K scale, the knowledge graph dominates the message:

| Section | Size | Share |
|---|---|---|
| `knowledge_graph` | 2.96 MB | 75% |
| `results` | 0.81 MB | 20% |
| `auxiliary_graphs` | 0.14 MB | 3% |
| `query_graph` | <0.01 MB | <1% |

### What Each Worker Actually Touches

Not every worker needs every section. Here's the access pattern across the pipeline:

| Worker | Reads | Modifies | Ser/deser cycles | Notes |
|---|---|---|---|---|
| **merge_message** | KG + results + aux (×3 messages) | all | 3 gets + 1 save | Semantic merge, dedup, edge remapping — richest logic |
| **aragorn_score** | full message | `results.analyses.score` | 1+1 | Sets one field per analysis |
| **aragorn_omnicorp** | KG + results | node/edge attributes | 1+1 | Appends attributes |
| **filter_results_top_n** | results only | truncates list | 1+1 | `results[:N]` |
| **filter_kgraph_orphans** | full message | removes KG entries | 1+1 | Set operations on keys |
| **sort_results_score** | results | reorders list | 1+1 | Sort by score |
| **filter_analyses_top_n** | results | truncates analyses | 1+1 | Similar to filter_results |
| **finish_query** | reads for callback | none | 1 get | Read-only |
| **aragorn/bte** (entry) | query_graph | none | 1 get | Reads 2 fields |
| **aragorn_lookup/bte_lookup** | query_graph + params | params | 1 get | Small message portion |

### Where TOM Objects Add Value vs. Overhead

TOM provides a well-designed set of helpers for TRAPI operations — `update()` for merging knowledge graphs, `normalize()` for hash-based edge ID remapping, hash-based result deduplication, constraint checking, and more. These are genuinely useful for complex semantic operations.

However, in the shepherd pipeline, most workers perform straightforward dict operations (append to a list, sort by a key, truncate, set a field). For these workers, the Pydantic validation cost of constructing typed objects outweighs the benefit — there's no complex logic that benefits from the richer API.

The one worker where TOM's helpers would clearly pay for themselves is **merge_message**, which does real semantic operations (merge KGs, deduplicate results, remap edge IDs). The question is whether that one worker's benefit justifies the validation cost, or whether it's better to use TOM selectively.

### Scaling Behavior

For reference, here's how serialization cost scales with message size:

| Scale | Raw Size | Compressed | Decode Time | Encode Time | Round-trip |
|---|---|---|---|---|---|
| Small (100 elements) | 0.1 MB | 5 KB | 0.8ms | 0.2ms | 1.0ms |
| Medium (1K) | 0.8 MB | 32 KB | 12ms | 2ms | 13ms |
| Large (5K) | 3.9 MB | 135 KB | 103ms | 12ms | 115ms |
| XL (10K) | 7.8 MB | 250 KB | 350ms | 23ms | 373ms |
| XXL (50K) | 39.7 MB | 1.1 MB | 2,365ms | 119ms | 2,484ms |

Decode dominates encode by ~10×. This is because decoding must construct hundreds of thousands of Python objects (dicts, lists, strings), while encoding just walks the existing object tree.

---

## Possible Approaches (Not Mutually Exclusive)

### 1. Split Redis storage by message section
Store `{id}:kg`, `{id}:results`, `{id}:aux`, `{id}:qg` as separate keys. Workers fetch only the sections they need. Since the knowledge graph is 75% of the message and only 5 of 16 workers access it, this could reduce decode time by ~40–60% for most workers with no new dependencies.

### 2. Use TOM selectively
Use TOM objects in `merge_message` (and potentially other workers that benefit from semantic helpers) while keeping plain dicts for simple filter/sort workers. This gets the best of both worlds but means two code styles in the repo.

### 3. msgspec msgpack as the wire format (Phase 1 — done)
Branch `claude/jolly-bohr-KLfa9` swaps the codec from orjson to `msgspec.msgpack`. Performance is roughly neutral for untyped dicts, but it establishes the binary wire format. Includes backward-compatible decoding for in-flight JSON messages during rolling deploys.

### 4. Typed deserialization (future)
If/when typed Structs or similar become viable with the constraints we need (extra field preservation, hash-based identity), typed decode could skip Python dict construction entirely. This remains the theoretical ceiling but has practical constraints today — see discussion in the analysis above.

---

## Open Questions

1. Is split storage worth the added complexity in Redis key management and TTL handling?
2. For `merge_message` specifically — would it be cleaner to convert dicts → TOM objects only inside that worker, or to pass TOM objects through the pipeline?
3. Are there other workers where TOM's helpers (constraint checking, normalization) would justify the validation overhead?
4. What's the typical message size distribution in production? The cost/benefit shifts significantly between the 1K and 10K scales.

Feedback welcome from anyone familiar with the TRAPI pipeline or TOM internals.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: Serialization strategy for TRAPI messages — TOM objects, split storage, and codec options #99

Context

Performance Impact of Using TOM Objects in Shepherd

The Pipeline Cost Per Query

Message Size Breakdown

What Each Worker Actually Touches

Where TOM Objects Add Value vs. Overhead

Scaling Behavior

Possible Approaches (Not Mutually Exclusive)

1. Split Redis storage by message section

2. Use TOM selectively

3. msgspec msgpack as the wire format (Phase 1 — done)

4. Typed deserialization (future)

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Step	Current (plain dicts)	With TOM Objects
Decompress + decode to dicts	~83ms	~83ms
Pydantic `model_validate`	—	~250ms (est. ~3× decode time)
Worker logic	varies	same (but richer API)
Encode + compress	~6ms	~6ms
Per-worker overhead	~89ms	~339ms
8-worker pipeline total	~712ms	~2,712ms

Section	Size	Share
`knowledge_graph`	2.96 MB	75%
`results`	0.81 MB	20%
`auxiliary_graphs`	0.14 MB	3%
`query_graph`	<0.01 MB	<1%

Worker	Reads	Modifies	Ser/deser cycles	Notes
merge_message	KG + results + aux (×3 messages)	all	3 gets + 1 save	Semantic merge, dedup, edge remapping — richest logic
aragorn_score	full message	`results.analyses.score`	1+1	Sets one field per analysis
aragorn_omnicorp	KG + results	node/edge attributes	1+1	Appends attributes
filter_results_top_n	results only	truncates list	1+1	`results[:N]`
filter_kgraph_orphans	full message	removes KG entries	1+1	Set operations on keys
sort_results_score	results	reorders list	1+1	Sort by score
filter_analyses_top_n	results	truncates analyses	1+1	Similar to filter_results
finish_query	reads for callback	none	1 get	Read-only
aragorn/bte (entry)	query_graph	none	1 get	Reads 2 fields
aragorn_lookup/bte_lookup	query_graph + params	params	1 get	Small message portion

Scale	Raw Size	Compressed	Decode Time	Encode Time	Round-trip
Small (100 elements)	0.1 MB	5 KB	0.8ms	0.2ms	1.0ms
Medium (1K)	0.8 MB	32 KB	12ms	2ms	13ms
Large (5K)	3.9 MB	135 KB	103ms	12ms	115ms
XL (10K)	7.8 MB	250 KB	350ms	23ms	373ms
XXL (50K)	39.7 MB	1.1 MB	2,365ms	119ms	2,484ms

Uh oh!

Discussion: Serialization strategy for TRAPI messages — TOM objects, split storage, and codec options #99

Description

Context

Performance Impact of Using TOM Objects in Shepherd

The Pipeline Cost Per Query

Message Size Breakdown

What Each Worker Actually Touches

Where TOM Objects Add Value vs. Overhead

Scaling Behavior

Possible Approaches (Not Mutually Exclusive)

1. Split Redis storage by message section

2. Use TOM selectively

3. msgspec msgpack as the wire format (Phase 1 — done)

4. Typed deserialization (future)

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions