Skip to content

fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval#421

Open
Meur3ault wants to merge 3 commits into
NevaMind-AI:mainfrom
Meur3ault:fix/xml-parser-data-loss
Open

fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval#421
Meur3ault wants to merge 3 commits into
NevaMind-AI:mainfrom
Meur3ault:fix/xml-parser-data-loss

Conversation

@Meur3ault
Copy link
Copy Markdown

📝 Pull Request Summary

Two parser bugs in src/memu/app/memorize.py that silently drop memory items during memorize(), plus a follow-up so the recovered items stay reachable from LLM-mode retrieval.


✅ What does this PR do?

Three atomic commits in src/memu/app/memorize.py:

1. Preserve memory items with empty <categories> (cbfac4b)

_parse_memory_element required both <content> and <categories> to be truthy:

if memory_dict.get("content") and memory_dict.get("categories"):
    return memory_dict
return None

But the prompts at prompts/memory_type/{profile,event,knowledge,behavior}.py all state "categories": [...can be empty]. Memories that don't fit a pre-configured category were dropped with no log line. Now only <content> is required; empty categories defaults to [].

2. Expand the XML root-tag whitelist (e116f4d)

_find_xml_boundaries recognised only ["item", "profile", "behaviors", "events", "knowledge", "skills"]. MemoryType in database/models.py includes tool and singular behavior/event/skill. The whitelist is extended to cover both forms of every MemoryType. Latent today because every built-in prompt wraps in <item>, but any custom prompt using semantic root tags returned [] with only a Could not find valid root tag warning.

3. Link uncategorized items to a fallback category (8284ff1)

After fix 1, items with no matched category land in the DB but with no CategoryItem relation. LLM-mode retrieval joins items via that table (retrieve.py:_format_items_for_llm), so those items remained unreachable from the LLM path. (RAG-mode was unaffected because its recall_items does a global vector search.)

An uncategorized category is now auto-created at category init (memorize_config.enable_uncategorized_fallback, default True). Items whose extracted category names match nothing get linked to it. The category is seeded with a static summary at creation (equal to its description) because retrieve.py:_rank_categories_by_summary filters on cat.summary; the seeding goes through update_category() so it persists on SQLite and Postgres backends where get_or_create_category returns a detached or copied instance. Dynamic summary updates skip the fallback since aggregating heterogeneous items into one summary is noisy and wastes tokens.


🤔 Why is this change needed?

The first two bugs cause silent data loss during the core memorize() flow — extracted facts simply never reach storage and there is no log to attribute the loss to. The third change closes the gap so that data preserved by fix 1 is also retrievable through the LLM-mode path, not just RAG-mode.

Related discussion: the long-term memory drift reported in #381 is made strictly worse by silent drops, so fixing this is upstream of any drift work.


🔍 Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor / cleanup
  • Other

✅ PR Quality Checklist

  • PR title follows the conventional format (fix: ×3 commits)
  • Changes are limited in scope and easy to review (three atomic commits, ~50 source lines, ~330 test lines)
  • Documentation updated where applicable (config field has a description=; no public API docs needed)
  • No breaking changes — the new enable_uncategorized_fallback flag defaults to True and adds one uncategorized entry to list_memory_categories(); setting it to False restores the old behaviour exactly
  • Related issues or discussions linked (see "Why" section)

📌 Optional

  • Edge cases considered:
    • User pre-configures a CategoryConfig(name="uncategorized") → not duplicated, user's config wins
    • User's existing uncategorized has a summary → not overwritten (not cat.summary guard)
    • Reinforcement skip path: items being reinforced keep their original category links
    • All three DB backends (in-memory, SQLite, Postgres) verified via update_category() flow
  • Follow-up tasks (left for separate issues):
    • MemoryCategory.embedding is set at init but never consumed by retrieval (route_category re-embeds cat.summary instead)
    • Cold-start retrieve returns [] from _rank_categories_by_summary when no category has been summarised yet — the fallback's seeded summary partially mitigates this

23 new unit tests across tests/test_xml_parser.py and tests/test_uncategorized_fallback.py, no API key needed. The fallback tests use the in-memory backend with a deterministic stub LLM client, so they exercise the real category creation, item persistence, relation linking, and summary-update code paths end-to-end. make test reports 100 passed, 1 skipped; make check passes pre-commit, mypy, and deptry (the pre-existing uv.lock 1.5.0 vs pyproject.toml 1.5.1 mismatch on main is not touched by this PR).

Meur3ault added 3 commits May 15, 2026 14:09
The _parse_memory_element check required both <content> and <categories>
to be non-empty, but the memory-type prompts (profile, event, knowledge,
behavior) explicitly state that the categories field may be empty.
Memory items that did not match any pre-configured category were silently
dropped during memorize(). Now only <content> is required; an empty or
missing <categories> defaults to [].
The _find_xml_boundaries whitelist matched only "behaviors"/"events"/
"skills" (plural-only) and was missing "tool" entirely. Custom prompts
using singular semantic root tags (e.g. <event>, <skill>) or the <tool>
root caused the entire LLM response to be discarded. The whitelist now
covers both singular and plural forms of every MemoryType value.
…M-mode retrieval

Memory items whose LLM-extracted categories match none of the configured
ones were left without any CategoryItem relation. LLM-mode retrieval
joins items through that relation table (_format_items_for_llm), so
those items were unreachable from the LLM path even though they sat in
the database with a usable embedding.

An 'uncategorized' category is now auto-created at category init
(memorize_config.enable_uncategorized_fallback, default True). Items
with no matching configured category are linked to it. Summary updates
skip the fallback category since aggregating unrelated facts produces
noise and wastes tokens; its static description embedding is enough
for route_category to score it when nothing else matches.
@Meur3ault Meur3ault changed the title Fix/xml parser data loss fix(memorize): silent memory item drops in XML parser and LLM-mode retrieval May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant