fix: resilience improvements for retry logic, context trimming, blob listing, and logging by Dongbumlee · Pull Request #139 · microsoft/Container-Migration-Solution-Accelerator

Dongbumlee · 2026-03-13T21:43:53Z

Purpose

Add Qdrant-backed shared memory for cross-agent context sharing across workflow steps (analysis -> design -> convert -> documentation)
Fix critical blob listing bug causing intermittent "files not found" failures across all steps
Fix singleton cache bug causing memory store to be closed/uninitialized on subsequent runs
Improve Azure OpenAI retry resilience for rate limits, transient errors, and context-length overflows
Reduce LLM context window usage by keeping only 15 recent messages (down from 30/50)
Convert all runtime print() to structured logging
Externalize ResultGenerator prompts from inline code to text files

Does this introduce a breaking change?

Yes
No

Golden Path Validation

I have tested the primary workflows (the "golden path") to ensure they function correctly without errors.

Deployment Validation

I have validated the deployment process successfully and all services are running as expected with this change.

What to Check

Verify that the following are valid

Blob listing fix: Agents can consistently list and find source/converted files (no more intermittent "folder is empty" errors)
Memory store persistence: [MEMORY] Stored memory from logs appear across all steps, NOT _flush_memory skipped — memory store not initialized
Cross-step memory injection: [MEMORY] Injecting N memories for {Agent} in design/convert/documentation steps
Memory cleanup: [MEMORY] Workflow complete — closing memory store (N memories) with N > 0
All 4 steps complete with PASS sign-offs on consecutive runs (not just first run)
No [MEMORY] _embed skipped — client=None warnings
Retry behavior: [AOAI_RETRY] logs show exponential backoff with 5s base delay
Step timings show single yaml key (no duplicate yaml_conversion)
Documentation reports use inline hyperlinks, not Markdown footnotes
SHARED_MEMORY_ENABLED=false falls back to original v2 behavior without errors

Key Changes

Critical Bug Fixes

Fix list_blobs_in_container trailing-slash bug — Non-recursive listing returned empty when folder_path lacked trailing /. Both name_starts_with prefix and relative_path computation now use normalized path with trailing /
Fix singleton instance cache for QdrantMemoryStore — AppContext._instances cache held the old closed store from previous runs, causing _initialized=False on all subsequent runs. Now cleared before re-registering

Retry and Resilience

Retry config: 8 retries, 5s base, 120s max (was 5/2s/30s)
Cooldown delays on context-trim retries
Retry transient errors: empty messages, 5xx server errors
Embedding retry (3 retries with exponential backoff) in QdrantMemoryStore

Context Window

Remove tool-result truncation; only summarize blob writes
keep_last_messages 30 -> 15; disable per-message truncation
Protect last message from truncation

Logging and Telemetry

Fix duplicate yaml_conversion/yaml telemetry key
Convert all print() to logger across 8+ files; remove text2art
Add diagnostic logging for memory storage failures (_embed, _flush_memory, invoked, flush)
Clear client cache between processes

Other

Prohibit Markdown footnotes in documentation reports
Workspace context injected into agent system instructions

Other Information

Configuration:

Enabled by default (SHARED_MEMORY_ENABLED=true)
Requires AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME in App Config (added to Bicep)
Gracefully degrades: if embedding deployment is missing, workflow runs without memory

- QdrantMemoryStore: in-process Qdrant embedded vector store, per-process isolation - SharedMemoryContextProvider: ContextProvider that reads/writes shared memories - invoking(): queries Qdrant for relevant memories before each LLM call - invoked(): stores agent responses into shared memory after each LLM call - OrchestratorBase: auto-initializes memory store + attaches provider to expert agents - Enabled by default, controlled via SHARED_MEMORY_ENABLED env var - Requires AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME for embedding generation - mem0_async_memory: reduced max_tokens from 100K to 4K for extraction calls - All 77 existing tests pass

… tests - MigrationProcessor: creates QdrantMemoryStore at workflow start, disposes in finally - Memory persists across all 4 steps (analysis→design→convert→documentation) - OrchestratorBase: resolves memory from AppContext instead of creating its own - SharedMemoryContextProvider: fix duck typing for isinstance checks - 18 tests for QdrantMemoryStore (init, add, search, workflow lifecycle) - 20 tests for SharedMemoryContextProvider (invoking, invoked, edge cases) - All 115 tests pass (77 existing + 38 new)

- SharedMemoryContextProvider: log inject count + stored content per agent turn - MigrationProcessor: log total memory count after each step completes - Enables real-time monitoring of memory flow across workflow steps

- Fix: use get_bearer_token_provider() instead of async variant (AzureCliCredential await error) - Add print() statements for memory init diagnostics (embedding deployment found/missing/failed) - Tested locally: 20 memories across 4 steps, workflow completed successfully in 19m 25s

- Workspace context injected into agent system instructions (never trimmed) - keep_last_messages reduced 50→20, max_total_chars 600K→400K - ResultGenerator prompts moved to prompt_resultgenerator.txt (4 steps) - Step transition phase shows 'Initializing {Step}' instead of step name - flush_agent_memories() fixed: use agent.context_provider.providers - Guard against uninitialized store in _flush_memory() - Same-step memory skip (only search cross-step memories) - Buffered storage (only last response per agent stored) - Debug log for memory store resolution per step - Tested: 17m 23s with keep_last_messages=20, all 4 steps PASS

…WARE ROUTING - Added rule 6: route to YAML Expert if their sign-off is PENDING before Chief Architect finalizes - Same pattern as Chief Architect PENDING fix in design coordinator

… UI fixes - Bicep: add text-embedding-3-large model deployment (capacity 500) alongside GPT5.1 - Bicep: add AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME to App Config keys - mem0_async_memory: replace hardcoded endpoints with env vars (AZURE_OPENAI_*) - keep_last_messages adjusted 20→30 for analysis step stability - Analysis executor: phase shows 'Initializing Analysis' instead of 'Analysis' - Test assertions updated for new phase name

…fix, logging - Fix list_blobs_in_container trailing-slash bug causing intermittent 'files not found' - Remove tool-result truncation; only summarize save_content_to_blob writes - Protect last message from per-message truncation - Increase retry config: 8 retries, 5s base, 120s max with exponential backoff - Add cooldown delay on context-trim retries to avoid triggering 429s - Retry transient errors: empty messages, 5xx server errors - Add embedding retry logic (3 retries) in QdrantMemoryStore - Reduce keep_last_messages 30->15; disable per-message truncation - Fix duplicate yaml_conversion/yaml telemetry key - Clear OrchestratorBase._client_cache between processes - Convert all runtime print() to logger.info/error/warning - Remove text2art dependency - Add debug logging to SharedMemoryContextProvider invoked/flush - Prohibit Markdown footnotes in documentation reports - Add diagnostic logging for _embed and _flush_memory failures

Dongbumlee added 9 commits March 12, 2026 22:57

chore: add INFO-level memory tracking logs

4260c42

- SharedMemoryContextProvider: log inject count + stored content per agent turn - MigrationProcessor: log total memory count after each step completes - Enables real-time monitoring of memory flow across workflow steps

fix: resolve ruff E402 import ordering in migration_processor

3a6a793

fix: rename ambiguous variable 'l' to 'line' in mcp_mermaid.py (E741)

662b8b3

fix: add YAML Expert PENDING detection in convert coordinator STATE-A…

ae1300e

…WARE ROUTING - Added rule 6: route to YAML Expert if their sign-off is PENDING before Chief Architect finalizes - Same pattern as Chief Architect PENDING fix in design coordinator

Dongbumlee requested review from Avijit-Microsoft, Prajwal-Microsoft, Roopan-Microsoft, Vinay-Microsoft, aniaroramsft, dgp10801, nchandhi, sethsteenken and toherman-msft as code owners March 13, 2026 21:43

Dongbumlee changed the title ~~V2 external memory~~ fix: resilience improvements for retry logic, context trimming, blob listing, and logging Mar 19, 2026

Dongbumlee merged commit c09aa8c into v2 Mar 19, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resilience improvements for retry logic, context trimming, blob listing, and logging#139

fix: resilience improvements for retry logic, context trimming, blob listing, and logging#139
Dongbumlee merged 10 commits intov2from
v2-external-memory

Dongbumlee commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dongbumlee commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Does this introduce a breaking change?

Golden Path Validation

Deployment Validation

What to Check

Key Changes

Critical Bug Fixes

Retry and Resilience

Context Window

Logging and Telemetry

Other

Other Information

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dongbumlee commented Mar 13, 2026 •

edited

Loading