Skip to content

TTYG-178 Improve README#51

Open
pgan002 wants to merge 135 commits intomainfrom
TTYG-178
Open

TTYG-178 Improve README#51
pgan002 wants to merge 135 commits intomainfrom
TTYG-178

Conversation

@pgan002
Copy link
Copy Markdown
Collaborator

@pgan002 pgan002 commented Feb 7, 2026

Changes

  • Move most of the documentation from the README file into separate files in a new directory docs/. Benefits:
    1. The main page (README) is shorter, and so more welcoming
    2. The main page loads faster
    3. The sections are shorter and so easier to read
    4. The directory helps to understand the contents
  • Major additions for completeness
  • Major edits clarity
  • Links to documentation sections from README and from other sections
  • Consistent section heading case: sentence case
  • Join each paragraph into a single line for easier MD editing

Tests

  • Spelling, grammar, typos: copy-pasted text into word processor
  • Links: manually followed each link in README.md and docs/*

Open questions

  • How to format the key-definitions in § Output: as a list or table?
  • Move section "Aggregate metrics" to a separate file or into section Metrics?

@pgan002 pgan002 closed this Mar 14, 2026
@pgan002 pgan002 reopened this Mar 14, 2026
@pgan002 pgan002 requested review from atagarev and nelly-hateva and removed request for atagarev and nelly-hateva March 14, 2026 02:51
@pgan002 pgan002 self-assigned this Mar 14, 2026
@pgan002 pgan002 requested a review from nelly-hateva March 30, 2026 08:09
@nelly-hateva
Copy link
Copy Markdown
Collaborator

@nelly-hateva what do you think about the questions in the PR description?

I re-joined each paragraph into a single line because:

Absolute links are long, so broken lines would be even harder to read
It is natural for paragraphs to be joined in lines as in word processors
Joined lines make it easier to see bullet points

I'm fine with the long lines. Regarding the questions:

Where to define metrics: in Metrics, in Output or both?

IMO in metrics. For me the user experience is that first the user evaluates, if this library can do the job (what's the purpose of the lib, what does it actually do), i.e. what metrics to expect in the first place (are some metrics the user needs missing, are there some metrics we user doesn't need; what input should be provided for these metrics). And only next, if the user finds the library suitable and decides to use it, he/she/it is interested in the actual output schema (this can be omitted, if the user directly starts using the lib and explores the outputs; then if something is unclear, he/she/it can check the docs).

Move section "Aggregate metrics" to a separate file or into section Metrics?

Currently, we mention the aggregates in both output.md and metrics.md, which I think is fine.

How to format the list of key definitions in Output?

I'm fine with the current approach (a list). We can also use JSON Schema, but for some readers it may be too difficult to read. The benefit is that we can use the schema to test for compliance of the outputs. It will look something like this

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Evaluation Output Schema",
  "description": "A list of objects representing evaluated questions from a reference Q&A dataset.",
  "type": "array",
  "items": {
    "type": "object",
    "required": [
      "template_id",
      "question_id",
      "question_text",
      "status"
    ],
    "properties": {
      "template_id": {
        "type": "string",
        "description": "The template id"
      },
      "question_id": {
        "type": "string",
        "description": "The question id"
      },
      "question_text": {
        "type": "string",
        "description": "The natural language query"
      },
      "status": {
        "type": "string",
        "enum": ["success", "error"],
        "description": "Indicates whether the evaluation succeeded"
      },
      "reference_steps": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "matches": {
              "type": "boolean",
              "description": "Added to steps which are matched during evaluation"
            }
          }
        },
        "description": "Copy of the expected steps in the Q&A dataset"
      },
      "reference_answer": {
        "type": "string",
        "description": "Copy of the expected answer"
      },
      "actual_answer": {
        "type": "string",
        "description": "Copy of the response text in the evaluation target"
      },
      "answer_reference_claims_count": {
        "type": "integer",
        "minimum": 0
      },
      "answer_actual_claims_count": {
        "type": "integer",
        "minimum": 0
      },
      "answer_matching_claims_count": {
        "type": "integer",
        "minimum": 0
      },
      "answer_recall": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_precision": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_correctness_reason": {
        "type": "string",
        "description": "LLM reasoning for claim extraction and matching"
      },
      "answer_eval_error": {
        "type": "string"
      },
      "answer_f1": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_relevance": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "answer_relevance_error": {
        "type": "string"
      },
      "actual_steps": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "retrieval_answer_recall": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_answer_recall_error": { "type": "string" },
            "retrieval_answer_precision": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_answer_precision_error": { "type": "string" },
            "retrieval_answer_f1": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_context_recall": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_context_recall_error": { "type": "string" },
            "retrieval_context_precision": { "type": "number", "minimum": 0, "maximum": 1 },
            "retrieval_context_precision_error": { "type": "string" },
            "retrieval_context_f1": { "type": "number", "minimum": 0, "maximum": 1 }
          }
        }
      },
      "steps_score": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "input_tokens": {
        "type": "integer",
        "minimum": 0
      },
      "output_tokens": {
        "type": "integer",
        "minimum": 0
      },
      "total_tokens": {
        "type": "integer",
        "minimum": 0
      },
      "elapsed_sec": {
        "type": "number",
        "minimum": 0
      }
    }
  }
}

Can we refer consistently to "agent" or "question-answering system" or "chat bot"? Which one?

I like the idea of being consistent. "agent" is my preferred option.

@pgan002 pgan002 force-pushed the TTYG-178 branch 2 times, most recently from a4af844 to 98395e8 Compare April 6, 2026 16:11
@pgan002 pgan002 force-pushed the TTYG-178 branch 2 times, most recently from 9cf0ba4 to 2176182 Compare April 6, 2026 16:13
@pgan002 pgan002 requested a review from nelly-hateva April 6, 2026 16:14
docs/usage.md Outdated
Comment on lines +5 to +23
To evaluate answers and/or steps:
1. Install this package ([§ Installation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/install.md))
1. Format the dataset of questions and reference answers and/or steps ([§ Reference Q&A data](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md#reference-qa-data))
1. Format the answers and/or steps to evaluate: [§ Target responses to evaluate](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md#target-responses-to-evaluate)
1. To evaluate metrics that require an LLM ([§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)):
1. Create a reference dataset and target dataset (output from the target system) with the relevant keys ([§ Inputs](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md)):
1. For `answer_relevance` ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)):
1. Include `actual_answer` in the reference dataset
1. For answer correctness metrics ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)):
1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
1. For custom metrics ([§ Custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md)):
1. Define the metrics ([§ Configuration](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md))
1. Include reference and target inputs used by the metrics
1. Configure the LLM ([§ Configuration](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md))
1. Set the environment variable for your LLM provider (e.g., `OPENAI_API_KEY`) to hold your LLM access key
1. To evaluate steps ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md)):
1. Include `reference_steps` in the reference data and `actual_steps` in target data
1. Call the evaluation function with the reference data and target data: [§ Example code](#example-code)
1. Call the aggregation function with the evaluation results: sections [Example code](#example-code), [Aggregate metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md#aggregate-metrics) and [Example aggregate output](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/examples/aggregates.yaml)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendered as

Image

which is confusing, because step 2. and 3. are indistinguishable from step 5 for me, point 4 looks unfinished. Perhaps reformatting is needed for proper rendering

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This bug was introduced by reformatting recently. Fixed.

@pgan002 pgan002 requested a review from nelly-hateva April 10, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants