This document describes the RAG (Retrieval-Augmented Generation) query/response process of the chat-core system. This system provides a pipeline that finds relevant documents for user queries via Vector DB search and generates answers using an LLM.
- Multi-turn Conversation Support: Context-based responses by saving conversation history in MongoDB.
- Query Rewriting: Rewriting queries as standalone questions considering the conversation context.
- Semantic Caching: Performance optimization through caching search results.
- Forbidden Word Check: Security policy-based query filtering.
- Streaming Response: Real-time response via Server-Sent Events (SSE).
- MCP Integration: Support for external tool calls (optional).
graph TB
Client[Client]
API[FastAPI Endpoint]
ChatManager[MultiturnStreamingChatManager]
subgraph DM[Conversation Management]
MongoDB[MongoDB]
PromptLoader[Prompt Loader]
end
subgraph QP[Query Processing]
QueryManager[QueryAndCacheManager]
ForbiddenChecker[Forbidden Word Checker]
end
subgraph SS[Search System]
CacheClient[Semantic Cache]
VectorDB[Vector DB]
MCPManager[MCP Manager]
end
subgraph RG[Response Generation]
LLM[LLM]
StreamFilter[Stream Filter]
end
Client -->|POST Request| API
API --> ChatManager
ChatManager --> MongoDB
ChatManager --> PromptLoader
ChatManager --> QueryManager
ChatManager --> ForbiddenChecker
ChatManager --> CacheClient
ChatManager --> VectorDB
ChatManager --> MCPManager
ChatManager --> LLM
LLM --> StreamFilter
StreamFilter -->|SSE Stream| Client
style Client fill:#e1f5ff
style API fill:#fff4e1
style ChatManager fill:#ffe1e1
style LLM fill:#e1ffe1
sequenceDiagram
participant Client as Client
participant API as FastAPI
participant Manager as ChatManager
participant MongoDB as MongoDB
participant QueryMgr as Query Manager
participant Forbidden as Forbidden Word Checker
participant Cache as Cache Service
participant VectorDB as Vector DB
participant LLM as LLM
Client->>API: POST /v2/rag/query
API->>Manager: Process request
Manager->>MongoDB: Load conversation history
MongoDB-->>Manager: Return history
Manager->>MongoDB: Add system prompt
Manager->>Manager: Extract conversation messages
Manager->>QueryMgr: Request query rewrite
QueryMgr->>LLM: Context-based rewrite
LLM-->>QueryMgr: Rewritten query list
QueryMgr-->>Manager: Return sub_queries
opt Forbidden Word Check Enabled
Manager->>Forbidden: Execute forbidden word check
Forbidden->>VectorDB: Qdrant similarity search
VectorDB-->>Forbidden: Detection results
opt LLM Verification Mode
Forbidden->>LLM: Request LLM verification
LLM-->>Forbidden: Determine block/allow
end
Forbidden-->>Manager: Check result
alt Forbidden Word Detected
Manager-->>Client: Send search restriction notice
end
end
loop For each rewritten query
Manager->>Cache: Check cache
alt Cache Hit
Cache-->>Manager: Cached results
else Cache Miss
Manager->>VectorDB: Perform vector search
VectorDB-->>Manager: Search results
Manager->>Cache: Cache results
end
end
Manager->>Manager: Integrate search results
Manager->>LLM: RAG Prompt + Documents
LLM-->>Manager: Streaming response
Manager->>Manager: Filter SOURCES_USED
loop Response Tokens
Manager-->>Client: Send SSE token
end
Manager->>MongoDB: Save conversation block
Manager->>MongoDB: Update statistics
Manager-->>Client: Completion message
The core class that manages conversation sessions.
Key Responsibilities:
- Conversation history management (MongoDB)
- System prompt loading (Redis → MongoDB → File)
- Query rewriting and search execution
- LLM response generation and streaming
- Statistics collection
Key Methods:
| Method | Description |
|---|---|
__init__ |
Session initialization, OpenAI client and DB connection setup |
add_room_system_prompt |
Add system prompt to the chat room |
get_history |
Load conversation history from MongoDB |
stream_response_with_openai_streaming_RAG |
Generate RAG streaming response (Main process) |
search_documents |
Execute document search (Rewriting, forbidden word check, vector search) |
execute_tool_call |
Execute tool call (local_search, MCP, etc.) |
filter_streaming_sources_used_realtime |
Filter SOURCES_USED marker |
statistics |
Save token usage statistics |
The class responsible for rewriting queries.
Features:
- Context-aware query rewriting
- Utilizes the last 4 conversation histories
- Generates a standalone question or a list of sub-queries
Prompt Template: rewrite_query (Loaded from Redis/MongoDB/File)
Rewrite Example:
- Input: "Who is he?" (Previous history: "Tell me about Chul-soo Kim")
- Output:
["Who is Chul-soo Kim?"]
2-Step Verification Process:
-
Qdrant Vector Search
- Similarity search within the forbidden word collection
- Detects forbidden words when above the threshold (default 0.3)
- Match types: exact (0.95+), synonym (0.80+), contextual (0.50+)
-
LLM Verification (Optional,
forbidden_check_mode=llm)- LLM re-verifies the Qdrant detection results
- Block Criteria:
- Standalone use of a forbidden word (e.g., "person", "executive")
- Forbidden word + inquiry keyword (e.g., "Who is the executive?", "How much is the salary?")
- Allow Criteria:
- Forbidden word + procedure/method keyword (e.g., "Executive appointment procedure", "How to apply for salary")
- False positive (Only morphological match, different context)
Behavior upon Detection:
- Blocks the search
- Generates a
forbidden_guidancemessage - Sends a restriction notice to the client
CacheServiceClient:
- Communicates with the KDB Manager's cache API
- Caches search results based on vector similarity
- Cache hits for identical or similar questions
Operation Flow:
graph LR
A[Query Input] --> B{Check Cache}
B -->|Hit| C[Return Cache]
B -->|Miss| D[Vector Search]
D --> E[Cache Results]
E --> F[Return Results]
C --> F
neural_search_rerank_paragraph_v1:
- Calls KDB API
/api/v1/document/search_rerank - Parameters:
collection_name: Collection namequery: Search queryuse_paragraph: Whether to search by paragraphmetadata_filter_key: Metadata filter keymatch_values: List of filter valuestop_k: Maximum number of search results (default 100)
Search Result Format:
[
{
"p_id": "paragraph_id",
"context": "Document content",
"score": 0.85,
"metadatas": {
"file_name": "DocumentName.pdf",
"class": "Category",
"paragraph_type": "normal"
}
}
]RAG Prompt Structure:
[System Prompt]
<system message>
[User Query]
Please answer the question using the references below.
[References]
Reference [1] : [Document Name]
<Document Content>
Reference [2] : [Document Name]
<Document Content>
[Question]
<user query>
Streaming Processing:
- OpenAI API
stream=Truemode - Responses in SSE (Server-Sent Events) format
- Filters out the
SOURCES_USED:marker
Client request model:
{
"collection_name": str, # Vector DB collection
"query": str, # User query
"room_id": str, # Chat room ID
"user_id": str, # User ID
"search_type": str, # "rag" | "gpt" | "pa"
"mcp_org": str, # MCP organization (default "GO_RAG")
"use_paragraph": bool, # Use paragraph search
"metadata_filter_key": str, # Metadata filter key
"metadata_filter_values": List[str], # Filter values
"top_k": int, # Number of search results (default 100)
"temperature": float, # LLM temperature (default 0.2)
"is_test": bool # Test mode
}Conversation block (Saved in MongoDB):
{
"version": "v3",
"status": "done" | "error" | "idle",
"block_classes": List[str], # Filter classes
"search_type": "rag" | "gpt" | "pa",
"conversations": [
{
"role": "user" | "assistant" | "system" | "tool",
"content": str,
"tool_calls": List[tool],
"tool_call_id": str,
"name": str,
"created_at": str (ISO 8601)
}
],
"room_id": str,
"chat_id": str,
"user_id": str,
"is_tool_block": bool,
"is_error": bool,
"result": {
"type": "done" | "error",
"final_response": str,
"references": List[dict],
"usage": {
"prompt_tokens": int,
"completion_tokens": int,
"total_tokens": int
},
"search_decision": {
"rewritten_query": List[str]
}
}
}| Type | Description | Example |
|---|---|---|
step |
Process step | {"type": "step", "id": "uuid", "content": "Searching for relevant documents..."} |
debug |
Debug info | {"type": "debug", "id": "uuid", "content": "Query rewriting complete"} |
think |
Reasoning content | {"type": "think", "id": "uuid", "content": "Thinking..."} |
token |
Response token | {"type": "token", "content": "Hello"} |
done |
Completion | {"type": "done", "final_response": "...", "usage": {...}, "references": [...]} |
error |
Error | {"type": "error", "content": "Error message", "detail": "..."} |
data: {"type": "step", "id": "...", "content": "Understanding the intent of the user's question..."}
data: {"type": "debug", "id": "...", "content": "Query rewriting complete: \"What is a RAG system?\""}
data: {"type": "step", "id": "...", "content": "Searching for relevant documents..."}
data: {"type": "step", "id": "...", "content": "Generating an answer..."}
data: {"type": "token", "content": "The "}
data: {"type": "token", "content": "RAG system"}
...
data: {"type": "done", "search_type": "rag", "final_response": "...", "usage": {...}, "references": [...]}
graph TD
A[User Query Input] --> B{History Exists?}
B -->|No| C[Return Original Query]
B -->|Yes| D{Simple Greeting/Closing?}
D -->|Yes| C
D -->|No| E[Extract Last 4 Conversations]
E --> F[Generate Rewrite Prompt]
F --> G[Call LLM]
G --> H{Rewrite Successful?}
H -->|No| C
H -->|Yes| I[Extract Sub-query List]
I --> J{Empty List?}
J -->|Yes| C
J -->|No| K[Return Rewritten Queries]
Rewrite Prompt Structure:
[Conversation History]
user: Previous Question 1
assistant: Previous Answer 1
user: Previous Question 2
assistant: Previous Answer 2
[Current Question]
<user query>
[Instructions]
Considering the conversation context above, rewrite the current question into a standalone question.
If necessary, break it down into multiple sub-queries.
Response Format:
{"queries": ["Rewritten Query 1", "Rewritten Query 2", ...]}
graph TD
A["Query Input (Single/Multiple)"] --> B[Qdrant Vector Search]
B --> C{Similarity >= Threshold?}
C -->|No| D[Safe Query]
C -->|Yes| E[Forbidden Word Detected]
E --> F{LLM Verification Enabled?}
F -->|No| G[Confirm Forbidden Word]
F -->|Yes| H[Generate LLM Verification Prompt]
H --> I[Call LLM]
I --> J{LLM Decision}
J -->|Block| K[Restrict Search]
J -->|Allow| L[Allow Search]
K --> M[Generate forbidden_guidance]
M --> N[Return Restriction Notice]
LLM Verification Logic:
-
Step 1: Verify Qdrant Detection
- Check for false positives (extremely limited)
- Block Decision:
- Standalone use of a forbidden word
- Forbidden word + inquiry keyword
- Allow Decision:
- Forbidden word + procedure/method keyword
-
Step 2: Verify Personal Sensitive Information (If Step 1 passes)
- Block Patterns:
- Possessive + sensitive info inquiry
- Standalone sensitive info inquiry
- Allow Patterns:
- Sensitive info + method/procedure inquiry
- Block Patterns:
graph TD
A[Rewritten Query List] --> B[Iterate Each Query]
B --> C{Check Cache}
C -->|Hit| D[Use Cached Result]
C -->|Miss| E{MCP Enabled?}
E -->|Yes| F[Call MCP Search Tool]
E -->|No| G[Call Local Search Function]
F --> H[Receive Search Results]
G --> H
H --> I[Cache Results]
I --> J[Add to Integrated Result List]
D --> J
J --> K{Next Query Exists?}
K -->|Yes| B
K -->|No| L[Return Final Search Results]
Determining the Search Tool:
- If MCP is connected: MCP tools like
neural_search - If MCP is not connected:
local_search(Built-in function)
Automatic Injection of Search Parameters:
{
"query": "Rewritten query",
"collection_name": "Session collection",
"metadata_filter_key": "Session filter key",
"match_values": "Session filter values",
"top_k": "Session top_k",
"use_paragraph": "Session setting"
}graph TD
A[Receive Search Results] --> B{FAQ Type?}
B -->|Yes| C[Return FAQ Answer Directly]
B -->|No| D[Format References]
D --> E[Generate RAG Prompt]
E --> F[Call LLM Streaming]
F --> G[Receive Chunk]
G --> H{Contains SOURCES_USED?}
H -->|Yes| I[Filter and Parse]
H -->|No| J[Send Token]
I --> K[Extract Reference Numbers]
J --> L{Next Chunk?}
K --> L
L -->|Yes| G
L -->|No| M[Final Response Complete]
M --> N[Map Reference Information]
N --> O[Send Completion Message]
Parsing SOURCES_USED:
LLM Response: "...Answer content... SOURCES_USED: 1, 3, 5"
- Filtering: "...Answer content..." (Sent to client)
- Parsing:
[1, 3, 5](Reference numbers) - Mapping: Includes documents at indices 1, 3, 5 from the search result list into
references
graph TD
A[History Load Request] --> B["MongoDB Query (room_id)"]
B --> C[Iterate Each Conversation Block]
C --> D{Error Block?}
D -->|Yes| E[Skip]
D -->|No| F{search_type Matches?}
F -->|No| E
F -->|Yes| G{user_id Matches?}
G -->|No| E
G -->|Yes| H{Metadata Filter Exists?}
H -->|No| I[Add to History]
H -->|Yes| J{Filter Values Match?}
J -->|No| E
J -->|Yes| I
I --> K{Next Block Exists?}
K -->|Yes| C
K -->|No| L[History Load Complete]
Filtering Conditions:
is_error == Falsesearch_typematches (rag/gpt/pa)user_idmatchesmetadata_filter_valuesis empty, or has an intersection with the block'sblock_classes
graph TD
A[Save Conversation Block Request] --> B{Tool Block?}
B -->|Yes| C[Remove Tool Message from Memory]
C --> D[Delete Tool Message from MongoDB]
D --> E[Add Block to History]
B -->|No| E
E --> F[Save Block to MongoDB]
F --> G[Save Complete]
Tool Block Processing:
- Removes
assistantmessages withrole=toolandtool_callsfromhistoryin memory - Similarly removes them from MongoDB using the
$pulloperator - Keeps only the final assistant response
{
"prompt_tokens": int, # Prompt tokens
"completion_tokens": int, # Response tokens
"total_tokens": int # Total tokens
}Accumulation Points:
- LLM call for query rewriting
- LLM verification call for forbidden words (Optional)
- LLM call for RAG response generation
Collection: {STATISTIC_DB}.{search_type}
Document Structure:
{
"user_id": str,
"room_id": str,
"date": str, # YYYY-MM-DD
"model": str, # rag/gpt/pa
"usage": {
"prompt_tokens": int,
"completion_tokens": int,
"total_tokens": int,
"type": "usage"
}
}Update Method:
- Prevents duplicates per day (
upsert) - Updates
usagefor the same date/user/room/model
| Category | Setting | Environment Variable | Default Value | Description |
|---|---|---|---|---|
| LLM | Base URL | LLM__BASE_URL |
http://10.10.1.60:10222/v1 |
LLM API Endpoint |
| LLM | Model | LLM__MODEL |
openai/gpt-oss-120b |
LLM Model Name |
| LLM | Temperature | LLM__TEMPERATURE |
0.2 |
Response Diversity |
| LLM | Max Tokens | LLM__MAX_COMPLETION_TOKENS |
10000 |
Maximum Response Tokens |
| RAG | Forbidden Check Enabled | RAG__FORBIDDEN_CHECK_ENABLED |
true |
Whether to use forbidden word check |
| RAG | Forbidden Threshold | RAG__FORBIDDEN_CHECK_THRESHOLD |
0.3 |
Similarity threshold (0.0~1.0) |
| RAG | Forbidden Check Mode | RAG__FORBIDDEN_CHECK_MODE |
llm |
llm or qdrant |
| KDB | URL | KDB__URL |
http://10.30.1.196:28101 |
Vector DB API Endpoint |
| MCP | Gateway | MCP__GATEWAY |
http://10.10.1.25:8502 |
MCP Gateway |
| MCP | Enabled | MCP__MCP_ENABLED |
false |
Whether to use MCP |
| DB | Host | DB__HOST |
10.30.1.196 |
MongoDB Host |
| DB | Port | DB__PORT |
27017 |
MongoDB Port |
| Error Situation | Handling Method |
|---|---|
| LLM API Error | Sends SSE error event, includes error details |
| Vector DB Connection Error | Fallback to local search (404: Returns empty result) |
| Cache Service Error | Logs and skips caching, continues search |
| Forbidden Word Check Error | System error notice, blocks search |
| MongoDB Connection Error | FastAPI 500 Error Response |
| MCP Connection Failure | Fallback to local search |
| Query Rewrite Failure | Uses original query |
Error During Streaming:
{
"type": "error",
"content": "An error occurred: <error message>",
"detail": "<traceback>",
"final_response": "<partial response>",
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
}
}Error Before Streaming:
{
"type": "error",
"content": "A server error occurred: <error message>",
"detail": "<traceback>"
}RAG Query/Response API (Streaming)
Request:
{
"collection_name": "kac_law",
"query": "What is Article 1 of the Aviation Act?",
"room_id": "room_123",
"user_id": "user_456",
"search_type": "rag",
"metadata_filter_key": "class",
"metadata_filter_values": ["Law", "Regulation"],
"top_k": 100,
"temperature": 0.2
}Response: SSE Stream
data: {"type": "step", "id": "...", "content": "..."}
data: {"type": "token", "content": "..."}
data: {"type": "done", "final_response": "...", "usage": {...}, "references": [...]}
GPT Query/Response API (No RAG)
Same Request Format
Differences:
- No vector search
- Uses only conversation history
- Uses
stream_response_with_openai_streaming_GPT
PA (Planning Agent) Query/Response API
Same Request Format
Differences:
- MCP connection enabled
- Can use external tools
- RAG + MCP integration mode
Retrieves accumulated token count for the current chat room
Query Parameters:
room_id: Chat room IDuser_id: User ID
Response:
{
"tokens": 1234
}-
Semantic Cache
- Vector similarity-based caching
- Reuses search results for identical/similar queries
- TTL: 15 minutes (KDB Cache Service setting)
-
Prompt Cache
- Redis → MongoDB → File hierarchical structure
- Caches system prompts and RAG templates
-
Multiple Query Search
- Sequentially processes multiple rewritten queries
- Integrates each search result
-
Parallel Forbidden Word Checking
- Parallel execution of Qdrant searches for multiple queries
- Utilizes
aiohttp.ClientSession
-
Chunk-Level Transmission
- Token-level streaming via SSE
- Enables client real-time rendering
-
Buffer Management
- Maintains a maximum 30-character tail buffer when filtering SOURCES_USED
- Memory-efficient processing
- Blocks sensitive information inquiries
- Filters terms restricted for business use
- Precision blocking through LLM verification
- History filtering based on
user_id - Chat room isolation based on
room_id - Document access control based on
metadata_filter_values
- API level authentication (Requires implementation)
- User-specific collection access rights management
- Statistics information access control
- External tool integration (Search, calculation, files, etc.)
- Dynamic tool loading
- Automatic injection of search parameters
- Vision API integration (Separate endpoint)
- Image-based query and response
- Utilization of VLM models
- Hybrid search (Vector + BM25)
- Graph search (Knowledge Graph)
- VOC search (Voice of Customer)
chat-core/
├── routes/v2/
│ └── query.py # API Endpoints
├── utils/
│ ├── query_cache_manager.py # Query Rewriting & Cache Client
│ └── multimcpmanager.py # MCP Integrated Management
├── tools/
│ ├── rag_retriever.py # Vector Search Function
│ └── forbidden_words_checker.py # Forbidden Word Check
├── prompts/
│ └── __init__.py # Prompt Loader
├── models/
│ └── __init__.py # Pydantic Models
├── config/
│ └── __init__.py # Configuration Management
└── global_variables.py # Global Variables