Conversation
I'm fine with the long lines. Regarding the questions:
IMO in metrics. For me the user experience is that first the user evaluates, if this library can do the job (what's the purpose of the lib, what does it actually do), i.e. what metrics to expect in the first place (are some metrics the user needs missing, are there some metrics we user doesn't need; what input should be provided for these metrics). And only next, if the user finds the library suitable and decides to use it, he/she/it is interested in the actual output schema (this can be omitted, if the user directly starts using the lib and explores the outputs; then if something is unclear, he/she/it can check the docs).
Currently, we mention the aggregates in both output.md and metrics.md, which I think is fine.
I'm fine with the current approach (a list). We can also use JSON Schema, but for some readers it may be too difficult to read. The benefit is that we can use the schema to test for compliance of the outputs. It will look something like this {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Evaluation Output Schema",
"description": "A list of objects representing evaluated questions from a reference Q&A dataset.",
"type": "array",
"items": {
"type": "object",
"required": [
"template_id",
"question_id",
"question_text",
"status"
],
"properties": {
"template_id": {
"type": "string",
"description": "The template id"
},
"question_id": {
"type": "string",
"description": "The question id"
},
"question_text": {
"type": "string",
"description": "The natural language query"
},
"status": {
"type": "string",
"enum": ["success", "error"],
"description": "Indicates whether the evaluation succeeded"
},
"reference_steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"matches": {
"type": "boolean",
"description": "Added to steps which are matched during evaluation"
}
}
},
"description": "Copy of the expected steps in the Q&A dataset"
},
"reference_answer": {
"type": "string",
"description": "Copy of the expected answer"
},
"actual_answer": {
"type": "string",
"description": "Copy of the response text in the evaluation target"
},
"answer_reference_claims_count": {
"type": "integer",
"minimum": 0
},
"answer_actual_claims_count": {
"type": "integer",
"minimum": 0
},
"answer_matching_claims_count": {
"type": "integer",
"minimum": 0
},
"answer_recall": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"answer_precision": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"answer_correctness_reason": {
"type": "string",
"description": "LLM reasoning for claim extraction and matching"
},
"answer_eval_error": {
"type": "string"
},
"answer_f1": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"answer_relevance": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"answer_relevance_error": {
"type": "string"
},
"actual_steps": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"retrieval_answer_recall": { "type": "number", "minimum": 0, "maximum": 1 },
"retrieval_answer_recall_error": { "type": "string" },
"retrieval_answer_precision": { "type": "number", "minimum": 0, "maximum": 1 },
"retrieval_answer_precision_error": { "type": "string" },
"retrieval_answer_f1": { "type": "number", "minimum": 0, "maximum": 1 },
"retrieval_context_recall": { "type": "number", "minimum": 0, "maximum": 1 },
"retrieval_context_recall_error": { "type": "string" },
"retrieval_context_precision": { "type": "number", "minimum": 0, "maximum": 1 },
"retrieval_context_precision_error": { "type": "string" },
"retrieval_context_f1": { "type": "number", "minimum": 0, "maximum": 1 }
}
}
},
"steps_score": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"input_tokens": {
"type": "integer",
"minimum": 0
},
"output_tokens": {
"type": "integer",
"minimum": 0
},
"total_tokens": {
"type": "integer",
"minimum": 0
},
"elapsed_sec": {
"type": "number",
"minimum": 0
}
}
}
}
I like the idea of being consistent. "agent" is my preferred option. |
a4af844 to
98395e8
Compare
9cf0ba4 to
2176182
Compare
docs/usage.md
Outdated
| To evaluate answers and/or steps: | ||
| 1. Install this package ([§ Installation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/install.md)) | ||
| 1. Format the dataset of questions and reference answers and/or steps ([§ Reference Q&A data](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md#reference-qa-data)) | ||
| 1. Format the answers and/or steps to evaluate: [§ Target responses to evaluate](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md#target-responses-to-evaluate) | ||
| 1. To evaluate metrics that require an LLM ([§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)): | ||
| 1. Create a reference dataset and target dataset (output from the target system) with the relevant keys ([§ Inputs](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/input.md)): | ||
| 1. For `answer_relevance` ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)): | ||
| 1. Include `actual_answer` in the reference dataset | ||
| 1. For answer correctness metrics ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)): | ||
| 1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate | ||
| 1. For custom metrics ([§ Custom evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md)): | ||
| 1. Define the metrics ([§ Configuration](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md)) | ||
| 1. Include reference and target inputs used by the metrics | ||
| 1. Configure the LLM ([§ Configuration](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/config.md)) | ||
| 1. Set the environment variable for your LLM provider (e.g., `OPENAI_API_KEY`) to hold your LLM access key | ||
| 1. To evaluate steps ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md)): | ||
| 1. Include `reference_steps` in the reference data and `actual_steps` in target data | ||
| 1. Call the evaluation function with the reference data and target data: [§ Example code](#example-code) | ||
| 1. Call the aggregation function with the evaluation results: sections [Example code](#example-code), [Aggregate metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/output.md#aggregate-metrics) and [Example aggregate output](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/examples/aggregates.yaml) |
There was a problem hiding this comment.
Thanks! This bug was introduced by reformatting recently. Fixed.

Changes
docs/. Benefits:Tests
README.mdanddocs/*Open questions