Skip to content

fix: resilience improvements for retry logic, context trimming, blob listing, and logging#139

Merged
Dongbumlee merged 10 commits intov2from
v2-external-memory
Mar 19, 2026
Merged

fix: resilience improvements for retry logic, context trimming, blob listing, and logging#139
Dongbumlee merged 10 commits intov2from
v2-external-memory

Conversation

@Dongbumlee
Copy link
Contributor

@Dongbumlee Dongbumlee commented Mar 13, 2026

Purpose

  • Add Qdrant-backed shared memory for cross-agent context sharing across workflow steps (analysis -> design -> convert -> documentation)
  • Fix critical blob listing bug causing intermittent "files not found" failures across all steps
  • Fix singleton cache bug causing memory store to be closed/uninitialized on subsequent runs
  • Improve Azure OpenAI retry resilience for rate limits, transient errors, and context-length overflows
  • Reduce LLM context window usage by keeping only 15 recent messages (down from 30/50)
  • Convert all runtime print() to structured logging
  • Externalize ResultGenerator prompts from inline code to text files

Does this introduce a breaking change?

  • Yes
  • No

Golden Path Validation

  • I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

  • I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

  • Blob listing fix: Agents can consistently list and find source/converted files (no more intermittent "folder is empty" errors)
  • Memory store persistence: [MEMORY] Stored memory from logs appear across all steps, NOT _flush_memory skipped — memory store not initialized
  • Cross-step memory injection: [MEMORY] Injecting N memories for {Agent} in design/convert/documentation steps
  • Memory cleanup: [MEMORY] Workflow complete — closing memory store (N memories) with N > 0
  • All 4 steps complete with PASS sign-offs on consecutive runs (not just first run)
  • No [MEMORY] _embed skipped — client=None warnings
  • Retry behavior: [AOAI_RETRY] logs show exponential backoff with 5s base delay
  • Step timings show single yaml key (no duplicate yaml_conversion)
  • Documentation reports use inline hyperlinks, not Markdown footnotes
  • SHARED_MEMORY_ENABLED=false falls back to original v2 behavior without errors

Key Changes

Critical Bug Fixes

  • Fix list_blobs_in_container trailing-slash bug — Non-recursive listing returned empty when folder_path lacked trailing /. Both name_starts_with prefix and relative_path computation now use normalized path with trailing /
  • Fix singleton instance cache for QdrantMemoryStoreAppContext._instances cache held the old closed store from previous runs, causing _initialized=False on all subsequent runs. Now cleared before re-registering

Retry and Resilience

  • Retry config: 8 retries, 5s base, 120s max (was 5/2s/30s)
  • Cooldown delays on context-trim retries
  • Retry transient errors: empty messages, 5xx server errors
  • Embedding retry (3 retries with exponential backoff) in QdrantMemoryStore

Context Window

  • Remove tool-result truncation; only summarize blob writes
  • keep_last_messages 30 -> 15; disable per-message truncation
  • Protect last message from truncation

Logging and Telemetry

  • Fix duplicate yaml_conversion/yaml telemetry key
  • Convert all print() to logger across 8+ files; remove text2art
  • Add diagnostic logging for memory storage failures (_embed, _flush_memory, invoked, flush)
  • Clear client cache between processes

Other

  • Prohibit Markdown footnotes in documentation reports
  • Workspace context injected into agent system instructions

Other Information

Configuration:

  • Enabled by default (SHARED_MEMORY_ENABLED=true)
  • Requires AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME in App Config (added to Bicep)
  • Gracefully degrades: if embedding deployment is missing, workflow runs without memory

- QdrantMemoryStore: in-process Qdrant embedded vector store, per-process isolation
- SharedMemoryContextProvider: ContextProvider that reads/writes shared memories
  - invoking(): queries Qdrant for relevant memories before each LLM call
  - invoked(): stores agent responses into shared memory after each LLM call
- OrchestratorBase: auto-initializes memory store + attaches provider to expert agents
- Enabled by default, controlled via SHARED_MEMORY_ENABLED env var
- Requires AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME for embedding generation
- mem0_async_memory: reduced max_tokens from 100K to 4K for extraction calls
- All 77 existing tests pass
… tests

- MigrationProcessor: creates QdrantMemoryStore at workflow start, disposes in finally
- Memory persists across all 4 steps (analysis→design→convert→documentation)
- OrchestratorBase: resolves memory from AppContext instead of creating its own
- SharedMemoryContextProvider: fix duck typing for isinstance checks
- 18 tests for QdrantMemoryStore (init, add, search, workflow lifecycle)
- 20 tests for SharedMemoryContextProvider (invoking, invoked, edge cases)
- All 115 tests pass (77 existing + 38 new)
- SharedMemoryContextProvider: log inject count + stored content per agent turn
- MigrationProcessor: log total memory count after each step completes
- Enables real-time monitoring of memory flow across workflow steps
- Fix: use get_bearer_token_provider() instead of async variant (AzureCliCredential await error)
- Add print() statements for memory init diagnostics (embedding deployment found/missing/failed)
- Tested locally: 20 memories across 4 steps, workflow completed successfully in 19m 25s
- Workspace context injected into agent system instructions (never trimmed)
- keep_last_messages reduced 50→20, max_total_chars 600K→400K
- ResultGenerator prompts moved to prompt_resultgenerator.txt (4 steps)
- Step transition phase shows 'Initializing {Step}' instead of step name
- flush_agent_memories() fixed: use agent.context_provider.providers
- Guard against uninitialized store in _flush_memory()
- Same-step memory skip (only search cross-step memories)
- Buffered storage (only last response per agent stored)
- Debug log for memory store resolution per step
- Tested: 17m 23s with keep_last_messages=20, all 4 steps PASS
…WARE ROUTING

- Added rule 6: route to YAML Expert if their sign-off is PENDING before Chief Architect finalizes
- Same pattern as Chief Architect PENDING fix in design coordinator
… UI fixes

- Bicep: add text-embedding-3-large model deployment (capacity 500) alongside GPT5.1
- Bicep: add AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME to App Config keys
- mem0_async_memory: replace hardcoded endpoints with env vars (AZURE_OPENAI_*)
- keep_last_messages adjusted 20→30 for analysis step stability
- Analysis executor: phase shows 'Initializing Analysis' instead of 'Analysis'
- Test assertions updated for new phase name
…fix, logging

- Fix list_blobs_in_container trailing-slash bug causing intermittent 'files not found'
- Remove tool-result truncation; only summarize save_content_to_blob writes
- Protect last message from per-message truncation
- Increase retry config: 8 retries, 5s base, 120s max with exponential backoff
- Add cooldown delay on context-trim retries to avoid triggering 429s
- Retry transient errors: empty messages, 5xx server errors
- Add embedding retry logic (3 retries) in QdrantMemoryStore
- Reduce keep_last_messages 30->15; disable per-message truncation
- Fix duplicate yaml_conversion/yaml telemetry key
- Clear OrchestratorBase._client_cache between processes
- Convert all runtime print() to logger.info/error/warning
- Remove text2art dependency
- Add debug logging to SharedMemoryContextProvider invoked/flush
- Prohibit Markdown footnotes in documentation reports
- Add diagnostic logging for _embed and _flush_memory failures
@Dongbumlee Dongbumlee changed the title V2 external memory fix: resilience improvements for retry logic, context trimming, blob listing, and logging Mar 19, 2026
@Dongbumlee Dongbumlee merged commit c09aa8c into v2 Mar 19, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant