TTYG-178 Improve README by pgan002 · Pull Request #51 · Ontotext-AD/graphrag-eval

pgan002 · 2026-02-07T07:50:40Z

Changes

Move most of the documentation from the README file into separate files in a new directory docs/. Benefits:
1. The main page (README) is shorter, and so more welcoming
2. The main page loads faster
3. The sections are shorter and so easier to read
4. The directory helps to understand the contents
Major additions for completeness
Major edits clarity
Links to documentation sections from README and from other sections
Consistent section heading case: sentence case
Join each paragraph into a single line for easier MD editing

Tests

Spelling, grammar, typos: copy-pasted text into word processor
Links: manually followed each link in README.md and docs/*

Open questions

How to format the key-definitions in § Output: as a list or table?
Move section "Aggregate metrics" to a separate file or into section Metrics?

README.md

docs/usage.md

docs/installation.md

docs/install.md

docs/steps-score.md

docs/retrieval-evaluation-using-chunk-ids.md

docs/0-intro.md

README.md

docs/installation.md

docs/usage.md

README.md

docs/config.md

nelly-hateva · 2026-04-02T06:37:57Z

@nelly-hateva what do you think about the questions in the PR description?

I re-joined each paragraph into a single line because:

Absolute links are long, so broken lines would be even harder to read
It is natural for paragraphs to be joined in lines as in word processors
Joined lines make it easier to see bullet points

I'm fine with the long lines. Regarding the questions:

Where to define metrics: in Metrics, in Output or both?

IMO in metrics. For me the user experience is that first the user evaluates, if this library can do the job (what's the purpose of the lib, what does it actually do), i.e. what metrics to expect in the first place (are some metrics the user needs missing, are there some metrics we user doesn't need; what input should be provided for these metrics). And only next, if the user finds the library suitable and decides to use it, he/she/it is interested in the actual output schema (this can be omitted, if the user directly starts using the lib and explores the outputs; then if something is unclear, he/she/it can check the docs).

Move section "Aggregate metrics" to a separate file or into section Metrics?

Currently, we mention the aggregates in both output.md and metrics.md, which I think is fine.

How to format the list of key definitions in Output?

I'm fine with the current approach (a list). We can also use JSON Schema, but for some readers it may be too difficult to read. The benefit is that we can use the schema to test for compliance of the outputs. It will look something like this

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Evaluation Output Schema",
  "description": "A list of objects representing evaluated questions from a reference Q&A dataset.",
  "type": "array",
  "items": {
    "type": "object",
    "required": [
      "template_id",
      "question_id",
      "question_text",
      "status"
    ],
    "properties": {
      "template_id": {
        "type": "string",
        "description": "The template id"
      },
      "question_id": {
        "type": "string",
        "description": "The question id"
      },
      "question_text": {
        "type": "string",
        "description": "The natural language query"
      },
      "status": {
        "type": "string",
        "enum": ["success", "error"],
        "description": "Indicates whether the evaluation succeeded"
      },
      "reference_steps": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "matches": {
              "type": "boolean",
              "description": "Added to steps which are matched during evaluation"
            }
          }
        },
        "description": "Copy of the expected steps in the Q&A dataset"
      },
      "reference_answer": {
        "type": "string",
        "description": "Copy of the expected answer"
      },
      "actual_answer": {
        "type": "string",
        "description": "Copy of the response text in the evaluation target"
      },
      "answer_reference_claims_count": {
        "type": "integer",
        "minimum": 0
      },
      "answer_actual_claims_count": {
        "type": "integer",
        "minimum": 0
      },
      "answer_matching_claims_count": {
        "type": "integer",
        "minimum": 0
      },
      "answer_recall": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_precision": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_correctness_reason": {
        "type": "string",
        "description": "LLM reasoning for claim extraction and matching"
      },
      "answer_eval_error": {
        "type": "string"
      },
      "answer_f1": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_relevance": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_relevance_error": {
        "type": "string"
      },
      "actual_steps": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "retrieval_answer_recall": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_answer_recall_error": { "type": "string" },
            "retrieval_answer_precision": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_answer_precision_error": { "type": "string" },
            "retrieval_answer_f1": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_context_recall": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_context_recall_error": { "type": "string" },
            "retrieval_context_precision": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_context_precision_error": { "type": "string" },
            "retrieval_context_f1": { "type": "number", "minimum": 0, "maximum": 1 }
          }
        }
      },
      "steps_score": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "input_tokens": {
        "type": "integer",
        "minimum": 0
      },
      "output_tokens": {
        "type": "integer",
        "minimum": 0
      },
      "total_tokens": {
        "type": "integer",
        "minimum": 0
      },
      "elapsed_sec": {
        "type": "number",
        "minimum": 0
      }
    }
  }
}

Can we refer consistently to "agent" or "question-answering system" or "chat bot"? Which one?

I like the idea of being consistent. "agent" is my preferred option.

nelly-hateva · 2026-04-08T12:13:56Z

docs/usage.md

+To evaluate answers and/or steps:
+1. Install this package ([§ Installation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/install.md))
+1. Format the dataset of questions and reference answers and/or steps ([§ Reference Q&A data](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md#reference-qa-data))
+1. Format the answers and/or steps to evaluate: [§ Target responses to evaluate](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md#target-responses-to-evaluate)
+1. To evaluate metrics that require an LLM ([§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)):
+  1. Create a reference dataset and target dataset (output from the target system) with the relevant keys ([§ Inputs](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md)):
+    1. For `answer_relevance` ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)):
+      1. Include `actual_answer` in the reference dataset
+    1. For answer correctness metrics ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)):
+      1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
+    1. For custom metrics ([§ Custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md)):
+      1. Define the metrics ([§ Configuration](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md))
+      1. Include reference and target inputs used by the metrics
+  1. Configure the LLM ([§ Configuration](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md))
+  1. Set the environment variable for your LLM provider (e.g., `OPENAI_API_KEY`) to hold your LLM access key
+1. To evaluate steps ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md)):
+  1. Include `reference_steps` in the reference data and `actual_steps` in target data
+1. Call the evaluation function with the reference data and target data: [§ Example code](#example-code)
+1. Call the aggregation function with the evaluation results: sections [Example code](#example-code), [Aggregate metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md#aggregate-metrics) and [Example aggregate output](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/examples/aggregates.yaml)


Rendered as

which is confusing, because step 2. and 3. are indistinguishable from step 5 for me, point 4 looks unfinished. Perhaps reformatting is needed for proper rendering

Thanks! This bug was introduced by reformatting recently. Fixed.

pgan002 requested review from atagarev and nelly-hateva February 7, 2026 07:50

pgan002 force-pushed the TTYG-178 branch from 1fb54c2 to ba141bc Compare February 20, 2026 01:42

pgan002 closed this Mar 14, 2026

pgan002 force-pushed the TTYG-178 branch from ba141bc to e338970 Compare March 14, 2026 01:49

pgan002 reopened this Mar 14, 2026

pgan002 requested review from atagarev and nelly-hateva and removed request for atagarev and nelly-hateva March 14, 2026 02:51

pgan002 self-assigned this Mar 14, 2026