The transform pipeline stage applies caller-declared
transformations to the extracted field groups after every other
LLM stage (extract, judge, judge_escalation) and before rules /
assemble. It lets you push deduplication, normalisation, role
classification, language translation and any other post-processing
that operates on extracted data into the IDP itself — rather than
re-implementing it in every consumer.
What this doc covers: the two transformation types (
entity_resolution,llm), how scope works, chaining, the add-a-type recipe. When to read it: while declaringoptions.transformations[]. Where else to look:
- HTTP request shape:
payload-reference.md § 9.- Pipeline integration:
pipeline.md(transformstage).
Two transformation types ship in-tree. Adding more types is a
single-line union extension plus a new branch in
TransformationEngine; the public API does not change.
| Type | Cost | When to use |
|---|---|---|
entity_resolution |
Free, ms-scale | Deduplicate rows that refer to the same entity. Bridges accent variants, partial names, formatting differences in identifiers. |
llm |
One LLM call per group | Anything the declarative types cannot express: role buckets, summarisation, free-text normalisation, schema migration. |
{
"options": {
"stages": { "transform": true },
"transformations": [ /* see below */ ]
}
}transformations is always a list and is applied in declared
order, so you can chain transformations against the same target — a
common pattern is entity_resolution first (cheap, deterministic)
followed by an llm step that operates on the deduped survivors.
The list can be empty: the stage is silently a no-op even with the
toggle on. Failures of individual transformations are caught by the
engine and logged; the surrounding pipeline never fails because one
transformation misbehaved.
{
"options": {
"stages": { "transform": true },
"transformations": [
{
"type": "entity_resolution",
"target_group": "personas",
"match_by": ["dni", "nombre"],
"scope": "request"
},
{
"type": "llm",
"target_group": "personas",
"intention": "Classify each cargo into a closed taxonomy."
}
]
}
}The LLM in the second entry sees the deduped rows produced by the first one — not the originals.
Every transformation declares a scope:
task(default) — runs once per(segment, document_type)task and mutates that task's groups in place. Right for single-document transformations.request— concatenates the matchingtarget_groupacross every task in the request, applies the transformation once over the consolidated rows, and emits the result as a new entry underresult.request_transformations. Per-task groups are left untouched. Right for cross-document entity resolution — the same person mentioned in five deeds collapses into a single canonical row.
Deterministic two-phase matcher:
- DNI / identifier match. Rows whose normalised value of a given
field (
dni,cif, …) collide are merged unconditionally. The normaliser strips formatting (07.549.861-L → 07549861L) so document-to-document variants line up. - Name-variant match. Rows that lack a DNI fall back to NFKD-fold
- token-subset matching. Two rows match when one name's token set
is a subset of the other's AND they share at least
min_shared_tokenstokens. The token floor (default2) blocks collapsing strangers who happen to share a single first name.
- token-subset matching. Two rows match when one name's token set
is a subset of the other's AND they share at least
Canonical-row selection picks the most complete value per sub-field — longest string wins for names; first non-empty wins for other types.
{
"options": {
"stages": { "transform": true },
"transformations": [
{
"type": "entity_resolution",
"target_group": "personas",
"match_by": ["dni", "nombre"],
"min_shared_tokens": 2,
"scope": "request"
}
]
}
}Given personas across multiple deeds:
| nombre | dni |
|---|---|
| Andrés Contreras | |
| Andres Contreras Guillen | |
| Joaquín Sevilla | 07549861L |
| Joaquín Sevilla Rodríguez | 07.549.861-L |
→ result.request_transformations[0].fields[0].value will be:
| nombre | dni |
|---|---|
| Andres Contreras Guillen | |
| Joaquín Sevilla Rodríguez | 07549861L |
| Field | Type | Notes |
|---|---|---|
target_group |
string |
FieldGroup.name the transformation operates on. No-op if no such group is found in the task. |
match_by |
list[string] |
Field names to consider for matching, in priority order. The DNI-style field comes first; the name field is the fallback. |
min_shared_tokens |
int (default 2) |
Minimum shared tokens for a name-variant match. 1 is rarely safe; 2 bridges accent + partial-name variants without false merges. |
output_group |
string | null |
null = mutate the original group in place. Set a name to keep the original AND append the deduped view as a new group. |
scope |
"task" | "request" (default "task") |
See scope section above. |
The escape hatch. Caller supplies a one-sentence intention; the
engine renders a focused prompt against the target group's rows and
expects the LLM to return rows in the same shape. The response
replaces (or, with output_group, augments) the original group.
Use this for:
- Role classification. "Map each cargo to a closed bucket
{administrador_unico, consejero, apoderado, otros}." - Language translation. "Translate every value to English while preserving keys."
- Schema migration. "Rename
participaciontoequity_pctand emit a numeric percent." - Anonymisation. "Replace each
nombrewith a stable token of the formPERSON_NNN." - Summarisation. "Collapse the list into one summary row per
distinct
entity_cif."
{
"options": {
"stages": { "transform": true },
"transformations": [
{
"type": "llm",
"target_group": "personas",
"intention": "Normalize each cargo to a closed taxonomy: administrador_unico, consejero, apoderado, otros. Keep all other fields untouched.",
"scope": "task"
}
]
}
}| Field | Type | Notes |
|---|---|---|
target_group |
string |
Source FieldGroup.name. |
intention |
string |
One-sentence goal in any language. The LLM is prompt-engineered to be conservative — when in doubt it preserves the input. |
prompt_id |
string | null |
Optional named prompt template id from the catalog. When omitted, the default transform prompt renders the intention into a generic shell. |
output_group |
string | null |
Same semantics as the declarative type. |
scope |
enum | Same as above. |
The LLM is instructed to emit { "rows": [...] } where each row is a
JSON object whose keys match the input row's sub-field names (unless
the intention explicitly asks to add, rename or remove keys). The
engine materialises each returned row back into an ExtractedField
with sub-fields, preserving page anchors and bbox metadata from the
original row when key names match.
Each LLM transformation is one structured-output call against the
default model. Token usage is included in the request's
usage.breakdown under transform.{transformation_id[:8]}. Default
timeout per call is 600 s (override with
FLYDOCS_TRANSFORM_TIMEOUT_S).
The DTO uses a Pydantic discriminated union keyed on type. A new
declarative transformation is three steps:
- Add a Pydantic model under
interfaces/dtos/transformation.pywith a uniquetype: Literal[...]discriminator and the fields the caller will populate. - Add the new model to the
Transformationunion at the bottom of the same file. - Implement a new transformer service under
core/services/transformations/and add a branch toTransformationEngine._dispatch.
No changes to the orchestrator or the public API are required.
... → judge → judge_escalation → transform → rules → assemble
The placement is intentional:
- After judge so transformations operate on graded data — you can route only PASS-graded rows through the LLM transformer by pre-filtering them in your transformation prompt.
- Before rules so the business-rule DAG can branch on the transformed entities.
- Before assemble so the final
ExtractionResultreflects the transformations.
docs/pipeline.md— full stage table and DAG construction.src/flydocs/interfaces/dtos/transformation.py— DTO source.src/flydocs/core/services/transformations/— implementation.tests/unit/test_entity_resolution_transformer.py— declarative tests.tests/unit/test_transformation_engine.py— dispatcher + scope tests.