Citation-first Japanese RAG core for internal knowledge, support operations, and retrieval evaluation.
This repository is built around one practical requirement:
return grounded answers with inspectable evidence, and fail conservatively when evidence is weak.
It is not a generic chatbot demo. It is a Japanese RAG implementation focused on retrieval quality, citation integrity, fallback safety, and reproducible evaluation.
Internal support and operations teams need answers they can trust.
In Japanese enterprise documents, relevant evidence is often fragmented across:
- short glossary entries
- FAQ fragments
- procedure sections
- policy/spec headings
- table rows
- code-like identifiers and mixed-script terms
This project provides a production-oriented retrieval stack that keeps citations first, makes fallback behavior explicit, and exposes enough trace information to debug retrieval and reranking failures.
Many RAG demos stop at “answer generation works”.
This repository goes further:
- citation-first answer path
- hybrid retrieval with Japanese-aware reranking
- guard and extractive fallback instead of opaque guessing
- doc-type-aware Japanese chunking
- traceable QA path for retrieval/rerank debugging
- repo-native evaluation, including retrieval-mode comparison
This makes the repository useful not only as an answer service, but also as a testbed for improving Japanese retrieval behavior in a controlled, reproducible way.
Japanese enterprise retrieval has several recurring failure modes:
- Mixed scripts (kanji, kana, alnum IDs) weaken naive lexical matching.
- Important terms are often very short (
PR2,請求書ID,承認フラグ) and easy to confuse with supersets (PR20). - Procedural evidence is distributed across neighboring lines and sections.
- Parent/child document structure matters for retrieval quality and answer readability.
- Table-like content is often semi-structured and not searchable as-is.
- Short but valid lookup queries are easy to over-block as “too general”.
This repository explicitly targets those failure modes.
- FastAPI endpoints for chat (
/chat) and retrieval inspection (/search) - Citation-first grounded answer flow with guard and extractive fallback
- Hybrid retrieval (
keyword + vector + fusion) - Japanese-aware reranking for:
- short lookup queries
- quoted terms
- code-like identifiers
- metadata-aware matches (title / section / alias / question)
- Parent/child chunk design for retrieval-vs-context separation
- Doc-type-aware Japanese chunk construction
- Lightweight deterministic smoke evaluation
- Retrieval-aware evaluation across:
bm25_onlydense_onlyhybridhybrid_rerank
These parts are already central to the repository and are intended to stay stable:
- citation-first answer path
- hybrid retrieval flow
- guard / fallback behavior
- FastAPI endpoints
- deterministic smoke evaluation workflow
- internal traceable QA path used by evaluation
These parts are already usable for controlled comparison and analysis:
- retrieval-aware eval runner
- labeled retrieval comparison cases
- gold doc / chunk labels
- abstain-labeled cases
- per-query JSONL output
- mode-level summary JSON output
These parts are intentionally still treated as tuning knobs:
- chunk target sizes and heuristics
- parent expansion thresholds
- reranker boost strengths
- larger real-world benchmark coverage
- broader corpus realism beyond the small fixed smoke corpus
The answer pipeline is intentionally layered.
Hybrid retrieval is used to gather candidate evidence from:
- keyword retrieval
- vector retrieval
- fused candidate ranking
Candidates are reranked using Japanese-aware lexical and metadata signals such as:
- quoted code-like terms
- alnum IDs
- katakana terms
- kanji terms
- short lookup cores
- title / section / alias / FAQ-question metadata
Retrieval units and answer-context units are not always the same.
- child chunks are useful for ranking precision
- parent chunks can be expanded later for grounded answer context
This allows the repository to keep retrieval precise while preserving enough context for citation-first answers.
The final answer path:
- builds grounded evidence blocks
- generates citation-tagged output
- validates output shape
- uses extractive fallback when needed
- keeps fallback explicit instead of pretending confidence
A core design goal of this repository is that retrieval failures should be inspectable.
The internal QA path exposes trace data that evaluation can reuse, including:
- before-rerank candidates
- after-rerank candidates
- after-parent-expansion candidates
- selected final answer context
- guard reason
- fallback usage
- rewritten / augmented query forms
This makes it possible to analyze where a result degraded:
- retrieval stage
- rerank stage
- grounding stage
- guard/fallback stage
That traceability is a major part of the repo’s value.
The repository includes rag_core/chunking_ja.py with document-type-aware chunk construction.
Supported chunking styles:
-
FAQ / glossary
- short chunks
- one Q&A or one term-definition per unit
-
Procedure / how-to
- medium chunks
- preserves prerequisites, steps, and notes
-
Policy / spec
- heading/section-oriented chunks
- avoids brittle over-splitting of clauses
-
Table-like text
- flattened row chunks
- keeps table title/header context in searchable form
- FAQ / glossary: 80-300 chars
- Procedure / how-to: 300-900 chars
- Policy / spec: 400-1200 chars
- Table-like text: 80-500 chars
Chunk records may include:
doc_typetitlesection_pathchunk_roleparent_chunk_idchild_chunk_idssearchable_textdisplay_text
plus backward-compatible fields such as:
doc_idsource_docsource_pageschunk_indextypequalitysearchable
searchable_text is used for indexing and retrieval when present, while display_text is preserved for answer presentation.
Returns grounded answers with citations, guard behavior, and fallback handling.
Use this endpoint when you want the full answer pipeline.
Returns retrieval-oriented information and is useful for inspecting candidate evidence.
Use this endpoint when you want to inspect retrieval results more directly.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn webapi.main:app --reloadLegacy fixed-window PDF chunking still works. Japanese doc-type-aware chunking is available as an additive option.
Example:
PYTHONPATH=. .venv/bin/python scripts/pdf_to_canonical_jsonl.py \
--pdf pdfs/your_doc.pdf \
--out index/your_doc.jsonl \
--doc-type procedure \
--title "運用手順書"Then ingest:
PYTHONPATH=. .venv/bin/python scripts/ingest_canonical_jsonl.py index/your_doc.jsonl --reseteval.runner supports two distinct positions:
- deterministic local-friendly regression mode
- retrieval-aware comparison mode
Default deterministic behavior:
- generation is stubbed
- vector retrieval is stubbed empty unless
--real-vectoris enabled - keyword retrieval remains active
This mode is useful for regression checks, rerank movement checks, and guard/fallback consistency.
PYTHONPATH=. .venv/bin/python -m eval.runner \
--cases eval/cases/smoke_cases.jsonl \
--chunks-jsonl eval/cases/smoke_chunks.jsonl \
--output runs/eval/smoke_results.jsonRetrieval-aware evaluation compares baseline modes and saves:
- per-query rows as JSONL
- mode-level aggregate summary as JSON
Supported modes:
bm25_onlydense_onlyhybridhybrid_rerank
PYTHONPATH=. .venv/bin/python -m eval.runner \
--retrieval-aware \
--cases eval/cases/retrieval_cases.jsonl \
--chunks-jsonl eval/cases/smoke_chunks.jsonl \
--modes bm25_only,dense_only,hybrid,hybrid_rerank \
--per-query-output runs/eval/retrieval_rows.jsonl \
--summary-output runs/eval/retrieval_summary.json \
--eval-k 5Per-query rows include signals such as:
gold_doc_hitgold_chunk_hitbest_rank_before_rerankbest_rank_after_rerankrerank_gainguard_reasonused_fallbackexpected_abstainabstain_correct
Mode-level summary includes:
gold_chunk_casesgold_chunk_hitsgold_doc_casesgold_doc_hitsabstain_labeled_casesabstain_expected_casesabstain_passesmean_mrr_at_kmean_ndcg_at_k
Case sets:
eval/cases/smoke_cases.jsonl: lightweight regression checkseval/cases/retrieval_cases.jsonl: labeled retrieval comparison cases (gold IDs + abstain labels)
Without --real-vector, vector retrieval is stubbed empty and dense-only results are not suitable for dense-quality conclusions.
Use --real-vector when evaluating actual vector contribution:
PYTHONPATH=. .venv/bin/python -m eval.runner \
--retrieval-aware \
--cases eval/cases/retrieval_cases.jsonl \
--chunks-jsonl eval/cases/smoke_chunks.jsonl \
--modes dense_only,hybrid,hybrid_rerank \
--per-query-output runs/eval/retrieval_rows_real_vector.jsonl \
--summary-output runs/eval/retrieval_summary_real_vector.json \
--eval-k 5 \
--real-vector- The retrieval comparison corpus is intentionally small and repo-native.
- Dense retrieval conclusions require
--real-vector; deterministic mode is primarily for regression safety. - Chunking and reranker settings are still an active tuning surface.
See LICENSE.