diff --git a/README.md b/README.md
index 5120c07..79a2786 100644
--- a/README.md
+++ b/README.md
@@ -1,162 +1,141 @@
-# Python project template
+# Agentic GoT
 
-A simple template of Python projects, with a rigid file structure, and predisposition for unit testing and release on PyPi.
+Agentic GoT (**Graph of Thought**) is a LangChain / LangGraph based reasoning agent that solves problems by building and traversing a graph of intermediate reasoning, tool-call, scoring, and backtracking nodes, instead of a single linear chain-of-thought. It ships with:
 
-## Relevant features
+- A **runtime reasoning graph** (`GoT/core/runtime_graph.py`, `GoT/core/graph_model.py`) with typed nodes (`GoalNode`, `ReasoningNode`, `ToolNode`, `TestNode`, `CraftingNode`, `BacktrackNode`, `CompletitionNode`, `ResponseNode`) and Mermaid export for visualizing a run.
+- A pluggable **tool belt**: arithmetic (`agent_tools/math_tool.py`), web/knowledge lookup via Wikipedia and arXiv (`agent_tools/web_tool.py`), a sandboxed Python executor, and a **tool-crafting tool** that lets the agent write and persist brand-new tools for itself at runtime (`agent_tools/craft_tool.py`).
+- **Benchmark harnesses** for GSM8K, GPQA (diamond), Hendrycks MATH, and GAIA, wired into [`lm-eval-harness`](https://github.com/EleutherAI/lm-evaluation-harness) (`GoT/experiments/`), so the graph agent (and a plain baseline agent) can be scored automatically.
+- **MLflow** autologging for OpenAI/Gemini/LangChain calls, so every run is traced and inspectable.
 
-- All your project code into a single main package (`GoT/`)
-- All your project tests into a single test package (`test/`)
-- Unit testing support via [`unittest`](https://docs.python.org/3/library/unittest.html)
-- Automatic testing on all branches via GitHub Actions
-- Semi-automatic versioning via Git
-- Packaging support via [`setuptools`](https://setuptools.pypa.io/en/latest/setuptools.html)
-- Automatic release on [PyPi](https://pypi.org/) via GitHub Actions and [`semantic-release`](https://semantic-release.gitbook.io)
-- Automatic dependencies updates via [Renovate](https://docs.renovatebot.com/)
+## Requirements
 
-## Project structure
-
-Overview:
-```bash
-<root directory>
-├── GoT/             # main package (should be named after your project)
-│   ├── __init__.py         # python package marker
-│   └── __main__.py         # application entry point
-├── tests/                  # test package (should contain unit tests)
-├── .github/                # configuration of GitHub CI
-│   └── workflows/          # configuration of GitHub Workflows
-│       ├── check.yml       # runs tests on multiple OS and versions of Python
-│       └── deploy.yml      # if check succeeds, and the current branch is one of {main, master}, triggers automatic releas on PyPi
-├── LICENSE                 # license file (Apache 2.0 by default)
-├── pyproject.toml          # project configuration file as prescribed by Poetry
-├── renovate.json           # configuration of Renovate bot, for automatic dependency updates
-├── requirements.txt        # only declares a dependency on Poetry. DO NOT EDIT THIS FILE
-└── release.config.js       # script to release on PyPi, and GitHub via semantic-release
-```
-
-## TODO-list for template usage
-
-1. Use this template to create a new GitHub repository, say `GoT`
-    - this name will also be used to identify the package on PyPi
-        + so, we suggest choosing a name which has not been used on PyPi, yet
-        + we also suggest choosing a name which is a valid Python package name (i.e. `using_snake_case`)
+| Tool | Version | Notes |
+|---|---|---|
+| Python | `>=3.10, <3.14` | CI tests on 3.10–3.13, on Ubuntu/Windows/macOS |
+| [Poetry](https://python-poetry.org/) | `^2.2` | dependency & venv management |
+| [Ollama](https://ollama.com/) | any recent | optional, only needed for running local Ollama models|
 
-2. Clone the `GoT` repository
+## Quick start
 
-3. Open a shell into your local `GoT` directory and run
-    ```bash
-    ./rename-template.sh GoT
-    ```
-    
-    This will coherently rename the template's project name with the one chosen by you (i.e. `GoT`, in this example)
-
-    * __Remark__: this step is now automatic thanks to the `init.yml` workflow which is triggered when using this template to create a new repository
-
-4. Commit & push
-
-5. Ensure you like the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html). If you don't, change the content of the `LICENSE` file
-
-6. Ensure the versions-range of Python reported in `pyproject.toml` fits the versions you want to support
-    + currently defaults to `>= 3.9`
-    + if you change this, please also change the versions of Python tests should be run on in CI, by looking the file `.github/workflows/check.yml`
-
-7. Check the Python version and OS tests should be run on in CI, by looking the file `.github/workflows/check.yml`
+```bash
+# 1. Clone
+git clone https://github.com/MarkRagg/GoT.git
+cd GoT
 
-8. Add your runtime, development, and build dependencies to `pyproject.toml`
+# 2. Install Poetry (pinned version, isolated from your system Python)
+pip install -r requirements.txt
 
-9. Check the other metadata in `pyproject.toml`
+# 3. Install project + dev dependencies (creates an in-project .venv, see poetry.toml)
+poetry install
 
-10. Change the assignee for pull-requests for automatic dependency updates by editing `renovate.json`
-    + currently defaults to @gciatto
+# 4. Configure environment variables (see below)
+cp .env.example .env   # if present — otherwise just create .env, see next section
+$EDITOR .env
 
-11. Add your `PYPI_TOKEN` token as secrets of the GitHub repository
-    - this may require you to register on PyPi first
+# 5. Run the test suite to confirm everything is wired correctly
+poetry run poe test
 
-12. Generate a GitHub token and add it as a secret of the GitHub repository, named `RELEASE_TOKEN`
-    - cf. <https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token-classic>
-    - the token must allow pushing to the repository
+# 6. Run the agent on a custom prompt in graph mode
+poetry run python -m GoT --benchmark custom --mode graph --prompt "What is the square root of 144, then look up who proved it?"
+```
 
-13. Put your main (resp. test) code in `GoT/` (resp. `test/`)
+> Tip: run `poetry shell` once to activate the virtualenv, so you can drop the `poetry run` prefix for the rest of the session.
 
-## How to do stuff
+## Environment variables
 
-### Restore dev dependencies
+GoT loads environment variables from a `.env` file at import time via `python-dotenv` (see `GoT/__init__.py` and `GoT/core/llm.py`). Create a `.env` file in the repository root:
 
-1. Install Poetry if you don't have it yet
-    ```bash
-    pip install -r requirements.txt
-    ```
+```dotenv
+# Required — Gemini is the default remote LLM backend for every agent role
+# (standard reasoning, structured/graph reasoning, tool crafting, and scoring).
+# Get a key at https://aistudio.google.com/app/apikey
+GEMINI_API_KEY=your-gemini-api-key
 
-2. Install the project's dependencies
-    ```bash
-    poetry install
-    ```
+# Required only if you run benchmarks that pull gated Hugging Face datasets
+# (currently GPQA and GAIA). Get a token at https://huggingface.co/settings/tokens
+# and make sure your HF account has accepted the dataset's access terms.
+HF_TOKEN=your-huggingface-token
+```
 
-### Run Tests
-  Execute the test suite using `pytest`:
-  ```bash
-  poetry run poe test
-  ```
+| Variable | Required | Used by | Purpose |
+|---|---|---|---|
+| `GEMINI_API_KEY` | Yes (for any Gemini-backed run — the default) | `GoT/core/llm.py` | Authenticates the four `ChatGoogleGenerativeAI` roles (`remote_standard`, `remote_response_format`, `remote_score_format`, `remote_crafter`) that power reasoning, response formatting, scoring, and tool crafting. |
+| `HF_TOKEN` | Only for `--benchmark gpqa` / `--benchmark gaia` | `GoT/experiments/hf_formatter.py` | Downloads gated benchmark datasets from the Hugging Face Hub. `gsm8k` and `hendrycks_math` do not require it. |
 
-### Run Tests with Coverage
-  Execute the test suite with coverage reporting:
-  ```bash
-  poetry run poe coverage
-  ```
-  and generate a report with `poe coverage-report` or `poe coverage-html`
+### Optional / no setup needed
 
+- **Local Ollama model** — `GoT/core/llm.py` also instantiates an `ollamaLLM` pointed at `http://localhost:11434/v1` with model `ministral-3:8b`, using the dummy API key `"dummy"` (Ollama's OpenAI-compatible endpoint doesn't check it). This path is only exercised if your own code selects it; it's not required for the default Gemini-backed CLI flows. If you want to use it: install [Ollama](https://ollama.com/download), then run `ollama pull ministral-3:8b` and make sure `ollama serve` is running before invoking GoT.
+- **MLflow** — tracing is enabled automatically (`mlflow.set_experiment("marcoraggini-experiment")` plus autolog for OpenAI/Gemini/LangChain) and writes to a local `./mlruns` directory by default. Point it at a remote tracking server instead by exporting `MLFLOW_TRACKING_URI` before running GoT — no code changes needed.
+- `.env` is already covered by `.gitignore` — never commit real API keys.
 
-### Run Static Checks
-  Perform static code analysis using both `mypy` and `ruff`:
-  ```bash
-  poetry run poe static-checks
-  ```
+## Usage
 
-### Format Code
-  Format your code using `ruff`:
-  ```bash
-  poetry run poe format
-  ```
+The package entry point (`GoT/__main__.py` → `GoT.main()`) parses CLI args via `GoT/cli/parse_args.py`:
 
-> Note: you can enter a Poetry shell via `poetry shell` to avoid prefixing commands with `poetry run`.
+```bash
+poetry run python -m GoT --benchmark <gsm8k|gpqa|hendrycks_math|gaia|custom> --mode <graph|standard> [options]
+```
 
-> Tests are automatically run in CI, on all pushes on all branches.
-> There, tests are executed on multiple OS (Win, Mac, Ubuntu) and on multiple Python versions.
+| Flag | Required | Values | Description |
+|---|---|---|---|
+| `--benchmark` | Yes | `gsm8k`, `gpqa`, `hendrycks_math`, `gaia`, `custom` | Which benchmark (or ad-hoc prompt) to run. |
+| `--mode` | Yes | `graph`, `standard` | `graph` runs the full Graph-of-Thought reasoning pipeline; `standard` runs a single-pass baseline agent. |
+| `--prompt` | Only for `custom` | free text | The prompt to run when `--benchmark custom` is selected. |
+| `--max_run` | No (default `1`) | int | Number of benchmark samples/iterations to run. |
+| `--category` | No (default `algebra`) | `algebra`, `counting_and_probability`, `geometry`, `intermediate_algebra`, `number_theory`, `precalculus`, `prealgebra` | Math subject filter, only used with `--benchmark hendrycks_math`. |
 
-### Run your code as an application
+Examples:
 
-This will execute the `__main__.py` file in the `GoT` package:
 ```bash
-poetry run python -m GoT
-```
-
-the latter is possible because of the script defined in the `pyproject.toml` file.
+# Ad-hoc question, full graph reasoning
+poetry run python -m GoT --benchmark custom --mode graph --prompt "Explain and solve: integral of x^2 dx from 0 to 3"
 
-### Release a new version on PyPi
+# Baseline (non-graph) agent on 10 GSM8K problems
+poetry run python -m GoT --benchmark gsm8k --mode standard --max_run 10
 
-New versions are automatically released on PyPi via GitHub Actions, when a push is made on the `main` or `master` branch.
+# Graph agent on Hendrycks MATH, geometry category
+poetry run python -m GoT --benchmark hendrycks_math --mode graph --category geometry --max_run 5
+```
 
-The version number is updated automatically by the `semantic-release` tool, which uses the commit messages to infer the type of the release (major, minor, patch).
+Results are written as JSON in the working directory (e.g. `graph_benchmark_results.json`, `test_benchmark_results.json`, `<model_name>_eval_results.json`), and every run is traced in MLflow.
 
-It is paramount that the commit messages follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) specification,
-in order for `semantic-release` to compute version numbers correctly.
+## Development
 
-## Automatic updates of dependencies (via Renovate)
+```bash
+poetry install                    # install runtime + dev dependencies
 
-The project is configured to use [Renovate](https://docs.renovatebot.com/) to automatically open pull-requests
-to update dependencies declared in `pyproject.toml`.
+poetry run poe test               # run the pytest suite
+poetry run poe coverage           # run tests with coverage
+poetry run poe coverage-report    # print coverage summary
+poetry run poe coverage-html      # generate an HTML coverage report (htmlcov/)
 
-By default, Renovate will assign such pull-requests to the user who created the repository from this template.
+poetry run poe static-checks      # ruff check + mypy
+poetry run poe format             # auto-format with ruff
+poetry run poe format-check       # check formatting without modifying files
+poetry run poe compile            # byte-compile the package and tests (syntax check)
+```
 
-If the project has tests (which is the case for this template), Renovate will only merge such pull-requests
-if all tests pass.
+CI (`.github/workflows/check.yml`) runs the same static checks, formatting check, and coverage on every push/PR, then runs the test suite across Python 3.10–3.13 on Ubuntu, Windows, and macOS.
 
-When some test fails, Renovate will leave a comment on the pull-request, so that you can fix the issue manually.
+## Project structure
 
-To make Renovate work, you need to enable it for your repository.
-To do so, please follow the instruction at <https://docs.renovatebot.com/getting-started/installing-onboarding/#hosted-githubcom-app>
+```
+GoT/
+├── GoT/
+│   ├── __main__.py            # `python -m GoT` entry point
+│   ├── cli/parse_args.py      # argparse CLI definition
+│   ├── core/
+│   │   ├── llm.py             # LLM roles (Gemini remote + local Ollama), tool wiring
+│   │   ├── graph_model.py     # LangGraph graph definition / orchestration
+│   │   └── runtime_graph.py   # Reasoning-graph node types + Mermaid export
+│   ├── agent_tools/           # math_tool, web_tool (Wikipedia/arXiv), craft_tool, runtime_graph_tool, ai_tool (crafted tools land here)
+│   ├── experiments/           # lm-eval-harness wrappers + per-benchmark dataset formatters
+│   └── utils/utils.py         # answer parsing/normalization helpers
+├── tests/                     # unit tests (pytest)
+├── pyproject.toml             # Poetry config, dependencies, poe tasks
+└── .github/workflows/         # CI (check.yml) and release (deploy.yml)
+```
 
-Finally, please remember to enable PR auto-merging in your repository settings, otherwise Renovate will not be able to merge
-the pull-requests it opens, even if all tests pass.
-To do so, please follow the instructions available [here](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-auto-merge-for-pull-requests-in-your-repository#managing-auto-merge).
+## License
 
-> Notice that the combination between Renovate, and Semantic Release may lead to a number of releases being created automatically.
+See [`LICENSE`](./LICENSE).
diff --git a/tasks.txt b/tasks.txt
deleted file mode 100644
index 323f6d2..0000000
--- a/tasks.txt
+++ /dev/null
@@ -1,292 +0,0 @@
-gsm8k
-
-### BIGBENCH ###
-- bigbench_abstract_narrative_understanding_generate_until
-  - bigbench_abstract_narrative_understanding_multiple_choice
-  - bigbench_anachronisms_generate_until
-  - bigbench_anachronisms_multiple_choice
-  - bigbench_analogical_similarity_generate_until
-  - bigbench_analogical_similarity_multiple_choice
-  - bigbench_analytic_entailment_generate_until
-  - bigbench_analytic_entailment_multiple_choice
-  - bigbench_arithmetic_generate_until
-  - bigbench_arithmetic_multiple_choice
-  - bigbench_ascii_word_recognition_generate_until
-  - bigbench_authorship_verification_generate_until
-  - bigbench_authorship_verification_multiple_choice
-  - bigbench_auto_categorization_generate_until
-  - bigbench_auto_debugging_generate_until
-  - bigbench_bbq_lite_json_generate_until
-  - bigbench_bbq_lite_json_multiple_choice
-  - bigbench_bridging_anaphora_resolution_barqa_generate_until
-  - bigbench_causal_judgment_generate_until
-  - bigbench_causal_judgment_multiple_choice
-  - bigbench_cause_and_effect_generate_until
-  - bigbench_cause_and_effect_multiple_choice
-  - bigbench_checkmate_in_one_generate_until
-  - bigbench_checkmate_in_one_multiple_choice
-  - bigbench_chess_state_tracking_generate_until
-  - bigbench_chinese_remainder_theorem_generate_until
-  - bigbench_cifar10_classification_generate_until
-  - bigbench_cifar10_classification_multiple_choice
-  - bigbench_code_line_description_generate_until
-  - bigbench_code_line_description_multiple_choice
-  - bigbench_codenames_generate_until
-  - bigbench_color_generate_until
-  - bigbench_color_multiple_choice
-  - bigbench_common_morpheme_generate_until
-  - bigbench_common_morpheme_multiple_choice
-  - bigbench_conceptual_combinations_generate_until
-  - bigbench_conceptual_combinations_multiple_choice
-  - bigbench_conlang_translation_generate_until
-  - bigbench_contextual_parametric_knowledge_conflicts_generate_until      
-  - bigbench_contextual_parametric_knowledge_conflicts_multiple_choice     
-  - bigbench_crash_blossom_generate_until
-  - bigbench_crash_blossom_multiple_choice
-  - bigbench_crass_ai_generate_until
-  - bigbench_crass_ai_multiple_choice
-  - bigbench_cryobiology_spanish_generate_until
-  - bigbench_cryobiology_spanish_multiple_choice
-  - bigbench_cryptonite_generate_until
-  - bigbench_cs_algorithms_generate_until
-  - bigbench_cs_algorithms_multiple_choice
-  - bigbench_dark_humor_detection_generate_until
-  - bigbench_dark_humor_detection_multiple_choice
-  - bigbench_date_understanding_generate_until
-  - bigbench_date_understanding_multiple_choice
-  - bigbench_disambiguation_qa_generate_until
-  - bigbench_disambiguation_qa_multiple_choice
-  - bigbench_discourse_marker_prediction_generate_until
-  - bigbench_discourse_marker_prediction_multiple_choice
-  - bigbench_disfl_qa_generate_until
-  - bigbench_dyck_languages_generate_until
-  - bigbench_dyck_languages_multiple_choice
-  - bigbench_elementary_math_qa_generate_until
-  - bigbench_elementary_math_qa_multiple_choice
-  - bigbench_emoji_movie_generate_until
-  - bigbench_emoji_movie_multiple_choice
-  - bigbench_emojis_emotion_prediction_generate_until
-  - bigbench_emojis_emotion_prediction_multiple_choice
-  - bigbench_empirical_judgments_generate_until
-  - bigbench_empirical_judgments_multiple_choice
-  - bigbench_english_proverbs_generate_until
-  - bigbench_english_proverbs_multiple_choice
-  - bigbench_english_russian_proverbs_generate_until
-  - bigbench_english_russian_proverbs_multiple_choice
-  - bigbench_entailed_polarity_generate_until
-  - bigbench_entailed_polarity_hindi_generate_until
-  - bigbench_entailed_polarity_hindi_multiple_choice
-  - bigbench_entailed_polarity_multiple_choice
-  - bigbench_epistemic_reasoning_generate_until
-  - bigbench_epistemic_reasoning_multiple_choice
-  - bigbench_evaluating_information_essentiality_generate_until
-  - bigbench_evaluating_information_essentiality_multiple_choice
-  - bigbench_fact_checker_generate_until
-  - bigbench_fact_checker_multiple_choice
-  - bigbench_fantasy_reasoning_generate_until
-  - bigbench_fantasy_reasoning_multiple_choice
-  - bigbench_few_shot_nlg_generate_until
-  - bigbench_figure_of_speech_detection_generate_until
-  - bigbench_figure_of_speech_detection_multiple_choice
-  - bigbench_formal_fallacies_syllogisms_negation_generate_until
-  - bigbench_formal_fallacies_syllogisms_negation_multiple_choice
-  - bigbench_gem_generate_until
-  - bigbench_gender_inclusive_sentences_german_generate_until
-  - bigbench_general_knowledge_generate_until
-  - bigbench_general_knowledge_multiple_choice
-  - bigbench_generate_until
-  - bigbench_geometric_shapes_generate_until
-  - bigbench_geometric_shapes_multiple_choice
-  - bigbench_goal_step_wikihow_generate_until
-  - bigbench_goal_step_wikihow_multiple_choice
-  - bigbench_gre_reading_comprehension_generate_until
-  - bigbench_gre_reading_comprehension_multiple_choice
-  - bigbench_hhh_alignment_generate_until
-  - bigbench_hhh_alignment_multiple_choice
-  - bigbench_hindi_question_answering_generate_until
-  - bigbench_hindu_knowledge_generate_until
-  - bigbench_hindu_knowledge_multiple_choice
-  - bigbench_hinglish_toxicity_generate_until
-  - bigbench_hinglish_toxicity_multiple_choice
-  - bigbench_human_organs_senses_generate_until
-  - bigbench_human_organs_senses_multiple_choice
-  - bigbench_hyperbaton_generate_until
-  - bigbench_hyperbaton_multiple_choice
-  - bigbench_identify_math_theorems_generate_until
-  - bigbench_identify_math_theorems_multiple_choice
-  - bigbench_identify_odd_metaphor_generate_until
-  - bigbench_identify_odd_metaphor_multiple_choice
-  - bigbench_implicatures_generate_until
-  - bigbench_implicatures_multiple_choice
-  - bigbench_implicit_relations_generate_until
-  - bigbench_implicit_relations_multiple_choice
-  - bigbench_intent_recognition_generate_until
-  - bigbench_intent_recognition_multiple_choice
-  - bigbench_international_phonetic_alphabet_nli_generate_until
-  - bigbench_international_phonetic_alphabet_nli_multiple_choice
-  - bigbench_international_phonetic_alphabet_transliterate_generate_until  
-  - bigbench_intersect_geometry_generate_until
-  - bigbench_intersect_geometry_multiple_choice
-  - bigbench_irony_identification_generate_until
-  - bigbench_irony_identification_multiple_choice
-  - bigbench_kanji_ascii_generate_until
-  - bigbench_kanji_ascii_multiple_choice
-  - bigbench_kannada_generate_until
-  - bigbench_kannada_multiple_choice
-  - bigbench_key_value_maps_generate_until
-  - bigbench_key_value_maps_multiple_choice
-  - bigbench_known_unknowns_generate_until
-  - bigbench_known_unknowns_multiple_choice
-  - bigbench_language_games_generate_until
-  - bigbench_language_identification_generate_until
-  - bigbench_language_identification_multiple_choice
-  - bigbench_linguistic_mappings_generate_until
-  - bigbench_linguistics_puzzles_generate_until
-  - bigbench_list_functions_generate_until
-  - bigbench_logic_grid_puzzle_generate_until
-  - bigbench_logic_grid_puzzle_multiple_choice
-  - bigbench_logical_args_generate_until
-  - bigbench_logical_args_multiple_choice
-  - bigbench_logical_deduction_generate_until
-  - bigbench_logical_deduction_multiple_choice
-  - bigbench_logical_fallacy_detection_generate_until
-  - bigbench_logical_fallacy_detection_multiple_choice
-  - bigbench_logical_sequence_generate_until
-  - bigbench_logical_sequence_multiple_choice
-  - bigbench_mathematical_induction_generate_until
-  - bigbench_mathematical_induction_multiple_choice
-  - bigbench_matrixshapes_generate_until
-  - bigbench_metaphor_boolean_generate_until
-  - bigbench_metaphor_boolean_multiple_choice
-  - bigbench_metaphor_understanding_generate_until
-  - bigbench_metaphor_understanding_multiple_choice
-  - bigbench_minute_mysteries_qa_generate_until
-  - bigbench_misconceptions_generate_until
-  - bigbench_misconceptions_multiple_choice
-  - bigbench_misconceptions_russian_generate_until
-  - bigbench_misconceptions_russian_multiple_choice
-  - bigbench_mnist_ascii_generate_until
-  - bigbench_mnist_ascii_multiple_choice
-  - bigbench_modified_arithmetic_generate_until
-  - bigbench_moral_permissibility_generate_until
-  - bigbench_moral_permissibility_multiple_choice
-  - bigbench_movie_dialog_same_or_different_generate_until
-  - bigbench_movie_dialog_same_or_different_multiple_choice
-  - bigbench_movie_recommendation_generate_until
-  - bigbench_movie_recommendation_multiple_choice
-  - bigbench_mult_data_wrangling_generate_until
-  - bigbench_multiemo_generate_until
-  - bigbench_multiemo_multiple_choice
-  - bigbench_multiple_choice_a
-  - bigbench_multiple_choice_b
-  - bigbench_natural_instructions_generate_until
-  - bigbench_navigate_generate_until
-  - bigbench_navigate_multiple_choice
-  - bigbench_nonsense_words_grammar_generate_until
-  - bigbench_nonsense_words_grammar_multiple_choice
-  - bigbench_novel_concepts_generate_until
-  - bigbench_novel_concepts_multiple_choice
-  - bigbench_object_counting_generate_until
-  - bigbench_odd_one_out_generate_until
-  - bigbench_odd_one_out_multiple_choice
-  - bigbench_operators_generate_until
-  - bigbench_paragraph_segmentation_generate_until
-  - bigbench_parsinlu_qa_generate_until
-  - bigbench_parsinlu_qa_multiple_choice
-  - bigbench_parsinlu_reading_comprehension_generate_until
-  - bigbench_penguins_in_a_table_generate_until
-  - bigbench_penguins_in_a_table_multiple_choice
-  - bigbench_periodic_elements_generate_until
-  - bigbench_periodic_elements_multiple_choice
-  - bigbench_persian_idioms_generate_until
-  - bigbench_persian_idioms_multiple_choice
-  - bigbench_phrase_relatedness_generate_until
-  - bigbench_phrase_relatedness_multiple_choice
-  - bigbench_physical_intuition_generate_until
-  - bigbench_physical_intuition_multiple_choice
-  - bigbench_physics_generate_until
-  - bigbench_physics_multiple_choice
-  - bigbench_physics_questions_generate_until
-  - bigbench_play_dialog_same_or_different_generate_until
-  - bigbench_play_dialog_same_or_different_multiple_choice
-  - bigbench_polish_sequence_labeling_generate_until
-  - bigbench_presuppositions_as_nli_generate_until
-  - bigbench_presuppositions_as_nli_multiple_choice
-  - bigbench_qa_wikidata_generate_until
-  - bigbench_question_selection_generate_until
-  - bigbench_question_selection_multiple_choice
-  - bigbench_real_or_fake_text_generate_until
-  - bigbench_real_or_fake_text_multiple_choice
-  - bigbench_reasoning_about_colored_objects_generate_until
-  - bigbench_reasoning_about_colored_objects_multiple_choice
-  - bigbench_repeat_copy_logic_generate_until
-  - bigbench_rephrase_generate_until
-  - bigbench_riddle_sense_generate_until
-  - bigbench_riddle_sense_multiple_choice
-  - bigbench_ruin_names_generate_until
-  - bigbench_ruin_names_multiple_choice
-  - bigbench_salient_translation_error_detection_generate_until
-  - bigbench_salient_translation_error_detection_multiple_choice
-  - bigbench_scientific_press_release_generate_until
-  - bigbench_semantic_parsing_in_context_sparc_generate_until
-  - bigbench_semantic_parsing_spider_generate_until
-  - bigbench_sentence_ambiguity_generate_until
-  - bigbench_sentence_ambiguity_multiple_choice
-  - bigbench_similarities_abstraction_generate_until
-  - bigbench_similarities_abstraction_multiple_choice
-  - bigbench_simp_turing_concept_generate_until
-  - bigbench_simple_arithmetic_json_generate_until
-  - bigbench_simple_arithmetic_json_multiple_choice_generate_until
-  - bigbench_simple_arithmetic_json_subtasks_generate_until
-  - bigbench_simple_arithmetic_multiple_targets_json_generate_until        
-  - bigbench_simple_ethical_questions_generate_until
-  - bigbench_simple_ethical_questions_multiple_choice
-  - bigbench_simple_text_editing_generate_until
-  - bigbench_snarks_generate_until
-  - bigbench_snarks_multiple_choice
-  - bigbench_social_iqa_generate_until
-  - bigbench_social_iqa_multiple_choice
-  - bigbench_social_support_generate_until
-  - bigbench_social_support_multiple_choice
-  - bigbench_sports_understanding_generate_until
-  - bigbench_sports_understanding_multiple_choice
-  - bigbench_strange_stories_generate_until
-  - bigbench_strange_stories_multiple_choice
-  - bigbench_strategyqa_generate_until
-  - bigbench_strategyqa_multiple_choice
-  - bigbench_sufficient_information_generate_until
-  - bigbench_suicide_risk_generate_until
-  - bigbench_suicide_risk_multiple_choice
-  - bigbench_swahili_english_proverbs_generate_until
-  - bigbench_swahili_english_proverbs_multiple_choice
-  - bigbench_swedish_to_german_proverbs_generate_until
-  - bigbench_swedish_to_german_proverbs_multiple_choice
-  - bigbench_symbol_interpretation_generate_until
-  - bigbench_symbol_interpretation_multiple_choice
-  - bigbench_temporal_sequences_generate_until
-  - bigbench_temporal_sequences_multiple_choice
-  - bigbench_tense_generate_until
-  - bigbench_timedial_generate_until
-  - bigbench_timedial_multiple_choice
-  - bigbench_topical_chat_generate_until
-  - bigbench_tracking_shuffled_objects_generate_until
-  - bigbench_tracking_shuffled_objects_multiple_choice
-  - bigbench_understanding_fables_generate_until
-  - bigbench_understanding_fables_multiple_choice
-  - bigbench_undo_permutation_generate_until
-  - bigbench_undo_permutation_multiple_choice
-  - bigbench_unit_conversion_generate_until
-  - bigbench_unit_conversion_multiple_choice
-  - bigbench_unit_interpretation_generate_until
-  - bigbench_unit_interpretation_multiple_choice
-  - bigbench_unnatural_in_context_learning_generate_until
-  - bigbench_vitaminc_fact_verification_generate_until
-  - bigbench_vitaminc_fact_verification_multiple_choice
-  - bigbench_what_is_the_tao_generate_until
-  - bigbench_what_is_the_tao_multiple_choice
-  - bigbench_which_wiki_edit_generate_until
-  - bigbench_which_wiki_edit_multiple_choice
-  - bigbench_winowhy_generate_until
-  - bigbench_winowhy_multiple_choice
-  - bigbench_word_sorting_generate_until
-  - bigbench_word_unscrambling_generate_until
\ No newline at end of file