fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval by Meur3ault · Pull Request #421 · NevaMind-AI/memU

Meur3ault · 2026-05-15T05:17:16Z

📝 Pull Request Summary

Two parser bugs in src/memu/app/memorize.py that silently drop memory items during memorize(), plus a follow-up so the recovered items stay reachable from LLM-mode retrieval.

✅ What does this PR do?

Three atomic commits in src/memu/app/memorize.py:

1. Preserve memory items with empty <categories> (cbfac4b)

_parse_memory_element required both <content> and <categories> to be truthy:

if memory_dict.get("content") and memory_dict.get("categories"):
    return memory_dict
return None

But the prompts at prompts/memory_type/{profile,event,knowledge,behavior}.py all state "categories": [...can be empty]. Memories that don't fit a pre-configured category were dropped with no log line. Now only <content> is required; empty categories defaults to [].

2. Expand the XML root-tag whitelist (e116f4d)

_find_xml_boundaries recognised only ["item", "profile", "behaviors", "events", "knowledge", "skills"]. MemoryType in database/models.py includes tool and singular behavior/event/skill. The whitelist is extended to cover both forms of every MemoryType. Latent today because every built-in prompt wraps in <item>, but any custom prompt using semantic root tags returned [] with only a Could not find valid root tag warning.

3. Link uncategorized items to a fallback category (8284ff1)

After fix 1, items with no matched category land in the DB but with no CategoryItem relation. LLM-mode retrieval joins items via that table (retrieve.py:_format_items_for_llm), so those items remained unreachable from the LLM path. (RAG-mode was unaffected because its recall_items does a global vector search.)

An uncategorized category is now auto-created at category init (memorize_config.enable_uncategorized_fallback, default True). Items whose extracted category names match nothing get linked to it. The category is seeded with a static summary at creation (equal to its description) because retrieve.py:_rank_categories_by_summary filters on cat.summary; the seeding goes through update_category() so it persists on SQLite and Postgres backends where get_or_create_category returns a detached or copied instance. Dynamic summary updates skip the fallback since aggregating heterogeneous items into one summary is noisy and wastes tokens.

🤔 Why is this change needed?

The first two bugs cause silent data loss during the core memorize() flow — extracted facts simply never reach storage and there is no log to attribute the loss to. The third change closes the gap so that data preserved by fix 1 is also retrievable through the LLM-mode path, not just RAG-mode.

Related discussion: the long-term memory drift reported in #381 is made strictly worse by silent drops, so fixing this is upstream of any drift work.

🔍 Type of Change

✅ PR Quality Checklist

PR title follows the conventional format (fix: ×3 commits)
Changes are limited in scope and easy to review (three atomic commits, ~50 source lines, ~330 test lines)
Documentation updated where applicable (config field has a description=; no public API docs needed)
No breaking changes — the new enable_uncategorized_fallback flag defaults to True and adds one uncategorized entry to list_memory_categories(); setting it to False restores the old behaviour exactly
Related issues or discussions linked (see "Why" section)

📌 Optional

Edge cases considered:
- User pre-configures a CategoryConfig(name="uncategorized") → not duplicated, user's config wins
- User's existing uncategorized has a summary → not overwritten (not cat.summary guard)
- Reinforcement skip path: items being reinforced keep their original category links
- All three DB backends (in-memory, SQLite, Postgres) verified via update_category() flow
Follow-up tasks (left for separate issues):
- MemoryCategory.embedding is set at init but never consumed by retrieval (route_category re-embeds cat.summary instead)
- Cold-start retrieve returns [] from _rank_categories_by_summary when no category has been summarised yet — the fallback's seeded summary partially mitigates this

23 new unit tests across tests/test_xml_parser.py and tests/test_uncategorized_fallback.py, no API key needed. The fallback tests use the in-memory backend with a deterministic stub LLM client, so they exercise the real category creation, item persistence, relation linking, and summary-update code paths end-to-end. make test reports 100 passed, 1 skipped; make check passes pre-commit, mypy, and deptry (the pre-existing uv.lock 1.5.0 vs pyproject.toml 1.5.1 mismatch on main is not touched by this PR).

The _parse_memory_element check required both <content> and <categories> to be non-empty, but the memory-type prompts (profile, event, knowledge, behavior) explicitly state that the categories field may be empty. Memory items that did not match any pre-configured category were silently dropped during memorize(). Now only <content> is required; an empty or missing <categories> defaults to [].

The _find_xml_boundaries whitelist matched only "behaviors"/"events"/ "skills" (plural-only) and was missing "tool" entirely. Custom prompts using singular semantic root tags (e.g. <event>, <skill>) or the <tool> root caused the entire LLM response to be discarded. The whitelist now covers both singular and plural forms of every MemoryType value.

…M-mode retrieval Memory items whose LLM-extracted categories match none of the configured ones were left without any CategoryItem relation. LLM-mode retrieval joins items through that relation table (_format_items_for_llm), so those items were unreachable from the LLM path even though they sat in the database with a usable embedding. An 'uncategorized' category is now auto-created at category init (memorize_config.enable_uncategorized_fallback, default True). Items with no matching configured category are linked to it. Summary updates skip the fallback category since aggregating unrelated facts produces noise and wastes tokens; its static description embedding is enough for route_category to score it when nothing else matches.

Meur3ault added 3 commits May 15, 2026 14:09

Meur3ault changed the title ~~Fix/xml parser data loss~~ fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval#421

fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval#421
Meur3ault wants to merge 3 commits into
NevaMind-AI:mainfrom
Meur3ault:fix/xml-parser-data-loss

Meur3ault commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Meur3ault commented May 15, 2026

📝 Pull Request Summary

✅ What does this PR do?

🤔 Why is this change needed?

🔍 Type of Change

✅ PR Quality Checklist

📌 Optional

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant