Build a wiki-generation layer on top of the current ChatDKU ingestion pipeline so that data is not only chunked for retrieval, but also compiled into a persistent, human-readable knowledge layer.
The goal is not to replace vector search. The goal is to add a second layer that:
- organizes source content into stable wiki pages,
- accumulates cross-source knowledge over time,
- reduces repeated synthesis at query time,
- makes important entities, concepts, and policies easier to inspect and maintain.
Why Borrow from llm-wiki-agent
Karpathy propose this idea several weeks ago in https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Available repo: https://github.com/SamurAIGPT/llm-wiki-agent
The key idea worth borrowing is not the exact implementation, but the compiled knowledge layer:
- source files remain immutable,
- ingestion produces structured wiki pages,
- pages are linked and indexed ahead of time,
- contradictions and coverage gaps are surfaced during ingestion instead of only during retrieval.
This fits ChatDKU well because current ingestion already has:
- stable file-level metadata such as
file_path, file_name, timestamps, and access fields,
- a node-generation stage in
update_data.py,
- downstream vector stores in Chroma and Postgres.
So the missing piece is a wiki layer between raw documents and retrieval.
Build a wiki-generation layer on top of the current ChatDKU ingestion pipeline so that data is not only chunked for retrieval, but also compiled into a persistent, human-readable knowledge layer.
The goal is not to replace vector search. The goal is to add a second layer that:
Why Borrow from
llm-wiki-agentKarpathy propose this idea several weeks ago in https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Available repo: https://github.com/SamurAIGPT/llm-wiki-agent
The key idea worth borrowing is not the exact implementation, but the compiled knowledge layer:
This fits ChatDKU well because current ingestion already has:
file_path,file_name, timestamps, and access fields,update_data.py,So the missing piece is a wiki layer between raw documents and retrieval.