Skip to content

Create a llm wiki during ingestion #263

@zhiwei531

Description

@zhiwei531

Build a wiki-generation layer on top of the current ChatDKU ingestion pipeline so that data is not only chunked for retrieval, but also compiled into a persistent, human-readable knowledge layer.

The goal is not to replace vector search. The goal is to add a second layer that:

  • organizes source content into stable wiki pages,
  • accumulates cross-source knowledge over time,
  • reduces repeated synthesis at query time,
  • makes important entities, concepts, and policies easier to inspect and maintain.

Why Borrow from llm-wiki-agent

Karpathy propose this idea several weeks ago in https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

Available repo: https://github.com/SamurAIGPT/llm-wiki-agent

The key idea worth borrowing is not the exact implementation, but the compiled knowledge layer:

  • source files remain immutable,
  • ingestion produces structured wiki pages,
  • pages are linked and indexed ahead of time,
  • contradictions and coverage gaps are surfaced during ingestion instead of only during retrieval.

This fits ChatDKU well because current ingestion already has:

  • stable file-level metadata such as file_path, file_name, timestamps, and access fields,
  • a node-generation stage in update_data.py,
  • downstream vector stores in Chroma and Postgres.

So the missing piece is a wiki layer between raw documents and retrieval.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions