diff --git a/README.md b/README.md index 5120c07..79a2786 100644 --- a/README.md +++ b/README.md @@ -1,162 +1,141 @@ -# Python project template +# Agentic GoT -A simple template of Python projects, with a rigid file structure, and predisposition for unit testing and release on PyPi. +Agentic GoT (**Graph of Thought**) is a LangChain / LangGraph based reasoning agent that solves problems by building and traversing a graph of intermediate reasoning, tool-call, scoring, and backtracking nodes, instead of a single linear chain-of-thought. It ships with: -## Relevant features +- A **runtime reasoning graph** (`GoT/core/runtime_graph.py`, `GoT/core/graph_model.py`) with typed nodes (`GoalNode`, `ReasoningNode`, `ToolNode`, `TestNode`, `CraftingNode`, `BacktrackNode`, `CompletitionNode`, `ResponseNode`) and Mermaid export for visualizing a run. +- A pluggable **tool belt**: arithmetic (`agent_tools/math_tool.py`), web/knowledge lookup via Wikipedia and arXiv (`agent_tools/web_tool.py`), a sandboxed Python executor, and a **tool-crafting tool** that lets the agent write and persist brand-new tools for itself at runtime (`agent_tools/craft_tool.py`). +- **Benchmark harnesses** for GSM8K, GPQA (diamond), Hendrycks MATH, and GAIA, wired into [`lm-eval-harness`](https://github.com/EleutherAI/lm-evaluation-harness) (`GoT/experiments/`), so the graph agent (and a plain baseline agent) can be scored automatically. +- **MLflow** autologging for OpenAI/Gemini/LangChain calls, so every run is traced and inspectable. -- All your project code into a single main package (`GoT/`) -- All your project tests into a single test package (`test/`) -- Unit testing support via [`unittest`](https://docs.python.org/3/library/unittest.html) -- Automatic testing on all branches via GitHub Actions -- Semi-automatic versioning via Git -- Packaging support via [`setuptools`](https://setuptools.pypa.io/en/latest/setuptools.html) -- Automatic release on [PyPi](https://pypi.org/) via GitHub Actions and [`semantic-release`](https://semantic-release.gitbook.io) -- Automatic dependencies updates via [Renovate](https://docs.renovatebot.com/) +## Requirements -## Project structure - -Overview: -```bash - -├── GoT/ # main package (should be named after your project) -│ ├── __init__.py # python package marker -│ └── __main__.py # application entry point -├── tests/ # test package (should contain unit tests) -├── .github/ # configuration of GitHub CI -│ └── workflows/ # configuration of GitHub Workflows -│ ├── check.yml # runs tests on multiple OS and versions of Python -│ └── deploy.yml # if check succeeds, and the current branch is one of {main, master}, triggers automatic releas on PyPi -├── LICENSE # license file (Apache 2.0 by default) -├── pyproject.toml # project configuration file as prescribed by Poetry -├── renovate.json # configuration of Renovate bot, for automatic dependency updates -├── requirements.txt # only declares a dependency on Poetry. DO NOT EDIT THIS FILE -└── release.config.js # script to release on PyPi, and GitHub via semantic-release -``` - -## TODO-list for template usage - -1. Use this template to create a new GitHub repository, say `GoT` - - this name will also be used to identify the package on PyPi - + so, we suggest choosing a name which has not been used on PyPi, yet - + we also suggest choosing a name which is a valid Python package name (i.e. `using_snake_case`) +| Tool | Version | Notes | +|---|---|---| +| Python | `>=3.10, <3.14` | CI tests on 3.10–3.13, on Ubuntu/Windows/macOS | +| [Poetry](https://python-poetry.org/) | `^2.2` | dependency & venv management | +| [Ollama](https://ollama.com/) | any recent | optional, only needed for running local Ollama models| -2. Clone the `GoT` repository +## Quick start -3. Open a shell into your local `GoT` directory and run - ```bash - ./rename-template.sh GoT - ``` - - This will coherently rename the template's project name with the one chosen by you (i.e. `GoT`, in this example) - - * __Remark__: this step is now automatic thanks to the `init.yml` workflow which is triggered when using this template to create a new repository - -4. Commit & push - -5. Ensure you like the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html). If you don't, change the content of the `LICENSE` file - -6. Ensure the versions-range of Python reported in `pyproject.toml` fits the versions you want to support - + currently defaults to `>= 3.9` - + if you change this, please also change the versions of Python tests should be run on in CI, by looking the file `.github/workflows/check.yml` - -7. Check the Python version and OS tests should be run on in CI, by looking the file `.github/workflows/check.yml` +```bash +# 1. Clone +git clone https://github.com/MarkRagg/GoT.git +cd GoT -8. Add your runtime, development, and build dependencies to `pyproject.toml` +# 2. Install Poetry (pinned version, isolated from your system Python) +pip install -r requirements.txt -9. Check the other metadata in `pyproject.toml` +# 3. Install project + dev dependencies (creates an in-project .venv, see poetry.toml) +poetry install -10. Change the assignee for pull-requests for automatic dependency updates by editing `renovate.json` - + currently defaults to @gciatto +# 4. Configure environment variables (see below) +cp .env.example .env # if present — otherwise just create .env, see next section +$EDITOR .env -11. Add your `PYPI_TOKEN` token as secrets of the GitHub repository - - this may require you to register on PyPi first +# 5. Run the test suite to confirm everything is wired correctly +poetry run poe test -12. Generate a GitHub token and add it as a secret of the GitHub repository, named `RELEASE_TOKEN` - - cf. - - the token must allow pushing to the repository +# 6. Run the agent on a custom prompt in graph mode +poetry run python -m GoT --benchmark custom --mode graph --prompt "What is the square root of 144, then look up who proved it?" +``` -13. Put your main (resp. test) code in `GoT/` (resp. `test/`) +> Tip: run `poetry shell` once to activate the virtualenv, so you can drop the `poetry run` prefix for the rest of the session. -## How to do stuff +## Environment variables -### Restore dev dependencies +GoT loads environment variables from a `.env` file at import time via `python-dotenv` (see `GoT/__init__.py` and `GoT/core/llm.py`). Create a `.env` file in the repository root: -1. Install Poetry if you don't have it yet - ```bash - pip install -r requirements.txt - ``` +```dotenv +# Required — Gemini is the default remote LLM backend for every agent role +# (standard reasoning, structured/graph reasoning, tool crafting, and scoring). +# Get a key at https://aistudio.google.com/app/apikey +GEMINI_API_KEY=your-gemini-api-key -2. Install the project's dependencies - ```bash - poetry install - ``` +# Required only if you run benchmarks that pull gated Hugging Face datasets +# (currently GPQA and GAIA). Get a token at https://huggingface.co/settings/tokens +# and make sure your HF account has accepted the dataset's access terms. +HF_TOKEN=your-huggingface-token +``` -### Run Tests - Execute the test suite using `pytest`: - ```bash - poetry run poe test - ``` +| Variable | Required | Used by | Purpose | +|---|---|---|---| +| `GEMINI_API_KEY` | Yes (for any Gemini-backed run — the default) | `GoT/core/llm.py` | Authenticates the four `ChatGoogleGenerativeAI` roles (`remote_standard`, `remote_response_format`, `remote_score_format`, `remote_crafter`) that power reasoning, response formatting, scoring, and tool crafting. | +| `HF_TOKEN` | Only for `--benchmark gpqa` / `--benchmark gaia` | `GoT/experiments/hf_formatter.py` | Downloads gated benchmark datasets from the Hugging Face Hub. `gsm8k` and `hendrycks_math` do not require it. | -### Run Tests with Coverage - Execute the test suite with coverage reporting: - ```bash - poetry run poe coverage - ``` - and generate a report with `poe coverage-report` or `poe coverage-html` +### Optional / no setup needed +- **Local Ollama model** — `GoT/core/llm.py` also instantiates an `ollamaLLM` pointed at `http://localhost:11434/v1` with model `ministral-3:8b`, using the dummy API key `"dummy"` (Ollama's OpenAI-compatible endpoint doesn't check it). This path is only exercised if your own code selects it; it's not required for the default Gemini-backed CLI flows. If you want to use it: install [Ollama](https://ollama.com/download), then run `ollama pull ministral-3:8b` and make sure `ollama serve` is running before invoking GoT. +- **MLflow** — tracing is enabled automatically (`mlflow.set_experiment("marcoraggini-experiment")` plus autolog for OpenAI/Gemini/LangChain) and writes to a local `./mlruns` directory by default. Point it at a remote tracking server instead by exporting `MLFLOW_TRACKING_URI` before running GoT — no code changes needed. +- `.env` is already covered by `.gitignore` — never commit real API keys. -### Run Static Checks - Perform static code analysis using both `mypy` and `ruff`: - ```bash - poetry run poe static-checks - ``` +## Usage -### Format Code - Format your code using `ruff`: - ```bash - poetry run poe format - ``` +The package entry point (`GoT/__main__.py` → `GoT.main()`) parses CLI args via `GoT/cli/parse_args.py`: -> Note: you can enter a Poetry shell via `poetry shell` to avoid prefixing commands with `poetry run`. +```bash +poetry run python -m GoT --benchmark --mode [options] +``` -> Tests are automatically run in CI, on all pushes on all branches. -> There, tests are executed on multiple OS (Win, Mac, Ubuntu) and on multiple Python versions. +| Flag | Required | Values | Description | +|---|---|---|---| +| `--benchmark` | Yes | `gsm8k`, `gpqa`, `hendrycks_math`, `gaia`, `custom` | Which benchmark (or ad-hoc prompt) to run. | +| `--mode` | Yes | `graph`, `standard` | `graph` runs the full Graph-of-Thought reasoning pipeline; `standard` runs a single-pass baseline agent. | +| `--prompt` | Only for `custom` | free text | The prompt to run when `--benchmark custom` is selected. | +| `--max_run` | No (default `1`) | int | Number of benchmark samples/iterations to run. | +| `--category` | No (default `algebra`) | `algebra`, `counting_and_probability`, `geometry`, `intermediate_algebra`, `number_theory`, `precalculus`, `prealgebra` | Math subject filter, only used with `--benchmark hendrycks_math`. | -### Run your code as an application +Examples: -This will execute the `__main__.py` file in the `GoT` package: ```bash -poetry run python -m GoT -``` - -the latter is possible because of the script defined in the `pyproject.toml` file. +# Ad-hoc question, full graph reasoning +poetry run python -m GoT --benchmark custom --mode graph --prompt "Explain and solve: integral of x^2 dx from 0 to 3" -### Release a new version on PyPi +# Baseline (non-graph) agent on 10 GSM8K problems +poetry run python -m GoT --benchmark gsm8k --mode standard --max_run 10 -New versions are automatically released on PyPi via GitHub Actions, when a push is made on the `main` or `master` branch. +# Graph agent on Hendrycks MATH, geometry category +poetry run python -m GoT --benchmark hendrycks_math --mode graph --category geometry --max_run 5 +``` -The version number is updated automatically by the `semantic-release` tool, which uses the commit messages to infer the type of the release (major, minor, patch). +Results are written as JSON in the working directory (e.g. `graph_benchmark_results.json`, `test_benchmark_results.json`, `_eval_results.json`), and every run is traced in MLflow. -It is paramount that the commit messages follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) specification, -in order for `semantic-release` to compute version numbers correctly. +## Development -## Automatic updates of dependencies (via Renovate) +```bash +poetry install # install runtime + dev dependencies -The project is configured to use [Renovate](https://docs.renovatebot.com/) to automatically open pull-requests -to update dependencies declared in `pyproject.toml`. +poetry run poe test # run the pytest suite +poetry run poe coverage # run tests with coverage +poetry run poe coverage-report # print coverage summary +poetry run poe coverage-html # generate an HTML coverage report (htmlcov/) -By default, Renovate will assign such pull-requests to the user who created the repository from this template. +poetry run poe static-checks # ruff check + mypy +poetry run poe format # auto-format with ruff +poetry run poe format-check # check formatting without modifying files +poetry run poe compile # byte-compile the package and tests (syntax check) +``` -If the project has tests (which is the case for this template), Renovate will only merge such pull-requests -if all tests pass. +CI (`.github/workflows/check.yml`) runs the same static checks, formatting check, and coverage on every push/PR, then runs the test suite across Python 3.10–3.13 on Ubuntu, Windows, and macOS. -When some test fails, Renovate will leave a comment on the pull-request, so that you can fix the issue manually. +## Project structure -To make Renovate work, you need to enable it for your repository. -To do so, please follow the instruction at +``` +GoT/ +├── GoT/ +│ ├── __main__.py # `python -m GoT` entry point +│ ├── cli/parse_args.py # argparse CLI definition +│ ├── core/ +│ │ ├── llm.py # LLM roles (Gemini remote + local Ollama), tool wiring +│ │ ├── graph_model.py # LangGraph graph definition / orchestration +│ │ └── runtime_graph.py # Reasoning-graph node types + Mermaid export +│ ├── agent_tools/ # math_tool, web_tool (Wikipedia/arXiv), craft_tool, runtime_graph_tool, ai_tool (crafted tools land here) +│ ├── experiments/ # lm-eval-harness wrappers + per-benchmark dataset formatters +│ └── utils/utils.py # answer parsing/normalization helpers +├── tests/ # unit tests (pytest) +├── pyproject.toml # Poetry config, dependencies, poe tasks +└── .github/workflows/ # CI (check.yml) and release (deploy.yml) +``` -Finally, please remember to enable PR auto-merging in your repository settings, otherwise Renovate will not be able to merge -the pull-requests it opens, even if all tests pass. -To do so, please follow the instructions available [here](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-auto-merge-for-pull-requests-in-your-repository#managing-auto-merge). +## License -> Notice that the combination between Renovate, and Semantic Release may lead to a number of releases being created automatically. +See [`LICENSE`](./LICENSE). diff --git a/tasks.txt b/tasks.txt deleted file mode 100644 index 323f6d2..0000000 --- a/tasks.txt +++ /dev/null @@ -1,292 +0,0 @@ -gsm8k - -### BIGBENCH ### -- bigbench_abstract_narrative_understanding_generate_until - - bigbench_abstract_narrative_understanding_multiple_choice - - bigbench_anachronisms_generate_until - - bigbench_anachronisms_multiple_choice - - bigbench_analogical_similarity_generate_until - - bigbench_analogical_similarity_multiple_choice - - bigbench_analytic_entailment_generate_until - - bigbench_analytic_entailment_multiple_choice - - bigbench_arithmetic_generate_until - - bigbench_arithmetic_multiple_choice - - bigbench_ascii_word_recognition_generate_until - - bigbench_authorship_verification_generate_until - - bigbench_authorship_verification_multiple_choice - - bigbench_auto_categorization_generate_until - - bigbench_auto_debugging_generate_until - - bigbench_bbq_lite_json_generate_until - - bigbench_bbq_lite_json_multiple_choice - - bigbench_bridging_anaphora_resolution_barqa_generate_until - - bigbench_causal_judgment_generate_until - - bigbench_causal_judgment_multiple_choice - - bigbench_cause_and_effect_generate_until - - bigbench_cause_and_effect_multiple_choice - - bigbench_checkmate_in_one_generate_until - - bigbench_checkmate_in_one_multiple_choice - - bigbench_chess_state_tracking_generate_until - - bigbench_chinese_remainder_theorem_generate_until - - bigbench_cifar10_classification_generate_until - - bigbench_cifar10_classification_multiple_choice - - bigbench_code_line_description_generate_until - - bigbench_code_line_description_multiple_choice - - bigbench_codenames_generate_until - - bigbench_color_generate_until - - bigbench_color_multiple_choice - - bigbench_common_morpheme_generate_until - - bigbench_common_morpheme_multiple_choice - - bigbench_conceptual_combinations_generate_until - - bigbench_conceptual_combinations_multiple_choice - - bigbench_conlang_translation_generate_until - - bigbench_contextual_parametric_knowledge_conflicts_generate_until - - bigbench_contextual_parametric_knowledge_conflicts_multiple_choice - - bigbench_crash_blossom_generate_until - - bigbench_crash_blossom_multiple_choice - - bigbench_crass_ai_generate_until - - bigbench_crass_ai_multiple_choice - - bigbench_cryobiology_spanish_generate_until - - bigbench_cryobiology_spanish_multiple_choice - - bigbench_cryptonite_generate_until - - bigbench_cs_algorithms_generate_until - - bigbench_cs_algorithms_multiple_choice - - bigbench_dark_humor_detection_generate_until - - bigbench_dark_humor_detection_multiple_choice - - bigbench_date_understanding_generate_until - - bigbench_date_understanding_multiple_choice - - bigbench_disambiguation_qa_generate_until - - bigbench_disambiguation_qa_multiple_choice - - bigbench_discourse_marker_prediction_generate_until - - bigbench_discourse_marker_prediction_multiple_choice - - bigbench_disfl_qa_generate_until - - bigbench_dyck_languages_generate_until - - bigbench_dyck_languages_multiple_choice - - bigbench_elementary_math_qa_generate_until - - bigbench_elementary_math_qa_multiple_choice - - bigbench_emoji_movie_generate_until - - bigbench_emoji_movie_multiple_choice - - bigbench_emojis_emotion_prediction_generate_until - - bigbench_emojis_emotion_prediction_multiple_choice - - bigbench_empirical_judgments_generate_until - - bigbench_empirical_judgments_multiple_choice - - bigbench_english_proverbs_generate_until - - bigbench_english_proverbs_multiple_choice - - bigbench_english_russian_proverbs_generate_until - - bigbench_english_russian_proverbs_multiple_choice - - bigbench_entailed_polarity_generate_until - - bigbench_entailed_polarity_hindi_generate_until - - bigbench_entailed_polarity_hindi_multiple_choice - - bigbench_entailed_polarity_multiple_choice - - bigbench_epistemic_reasoning_generate_until - - bigbench_epistemic_reasoning_multiple_choice - - bigbench_evaluating_information_essentiality_generate_until - - bigbench_evaluating_information_essentiality_multiple_choice - - bigbench_fact_checker_generate_until - - bigbench_fact_checker_multiple_choice - - bigbench_fantasy_reasoning_generate_until - - bigbench_fantasy_reasoning_multiple_choice - - bigbench_few_shot_nlg_generate_until - - bigbench_figure_of_speech_detection_generate_until - - bigbench_figure_of_speech_detection_multiple_choice - - bigbench_formal_fallacies_syllogisms_negation_generate_until - - bigbench_formal_fallacies_syllogisms_negation_multiple_choice - - bigbench_gem_generate_until - - bigbench_gender_inclusive_sentences_german_generate_until - - bigbench_general_knowledge_generate_until - - bigbench_general_knowledge_multiple_choice - - bigbench_generate_until - - bigbench_geometric_shapes_generate_until - - bigbench_geometric_shapes_multiple_choice - - bigbench_goal_step_wikihow_generate_until - - bigbench_goal_step_wikihow_multiple_choice - - bigbench_gre_reading_comprehension_generate_until - - bigbench_gre_reading_comprehension_multiple_choice - - bigbench_hhh_alignment_generate_until - - bigbench_hhh_alignment_multiple_choice - - bigbench_hindi_question_answering_generate_until - - bigbench_hindu_knowledge_generate_until - - bigbench_hindu_knowledge_multiple_choice - - bigbench_hinglish_toxicity_generate_until - - bigbench_hinglish_toxicity_multiple_choice - - bigbench_human_organs_senses_generate_until - - bigbench_human_organs_senses_multiple_choice - - bigbench_hyperbaton_generate_until - - bigbench_hyperbaton_multiple_choice - - bigbench_identify_math_theorems_generate_until - - bigbench_identify_math_theorems_multiple_choice - - bigbench_identify_odd_metaphor_generate_until - - bigbench_identify_odd_metaphor_multiple_choice - - bigbench_implicatures_generate_until - - bigbench_implicatures_multiple_choice - - bigbench_implicit_relations_generate_until - - bigbench_implicit_relations_multiple_choice - - bigbench_intent_recognition_generate_until - - bigbench_intent_recognition_multiple_choice - - bigbench_international_phonetic_alphabet_nli_generate_until - - bigbench_international_phonetic_alphabet_nli_multiple_choice - - bigbench_international_phonetic_alphabet_transliterate_generate_until - - bigbench_intersect_geometry_generate_until - - bigbench_intersect_geometry_multiple_choice - - bigbench_irony_identification_generate_until - - bigbench_irony_identification_multiple_choice - - bigbench_kanji_ascii_generate_until - - bigbench_kanji_ascii_multiple_choice - - bigbench_kannada_generate_until - - bigbench_kannada_multiple_choice - - bigbench_key_value_maps_generate_until - - bigbench_key_value_maps_multiple_choice - - bigbench_known_unknowns_generate_until - - bigbench_known_unknowns_multiple_choice - - bigbench_language_games_generate_until - - bigbench_language_identification_generate_until - - bigbench_language_identification_multiple_choice - - bigbench_linguistic_mappings_generate_until - - bigbench_linguistics_puzzles_generate_until - - bigbench_list_functions_generate_until - - bigbench_logic_grid_puzzle_generate_until - - bigbench_logic_grid_puzzle_multiple_choice - - bigbench_logical_args_generate_until - - bigbench_logical_args_multiple_choice - - bigbench_logical_deduction_generate_until - - bigbench_logical_deduction_multiple_choice - - bigbench_logical_fallacy_detection_generate_until - - bigbench_logical_fallacy_detection_multiple_choice - - bigbench_logical_sequence_generate_until - - bigbench_logical_sequence_multiple_choice - - bigbench_mathematical_induction_generate_until - - bigbench_mathematical_induction_multiple_choice - - bigbench_matrixshapes_generate_until - - bigbench_metaphor_boolean_generate_until - - bigbench_metaphor_boolean_multiple_choice - - bigbench_metaphor_understanding_generate_until - - bigbench_metaphor_understanding_multiple_choice - - bigbench_minute_mysteries_qa_generate_until - - bigbench_misconceptions_generate_until - - bigbench_misconceptions_multiple_choice - - bigbench_misconceptions_russian_generate_until - - bigbench_misconceptions_russian_multiple_choice - - bigbench_mnist_ascii_generate_until - - bigbench_mnist_ascii_multiple_choice - - bigbench_modified_arithmetic_generate_until - - bigbench_moral_permissibility_generate_until - - bigbench_moral_permissibility_multiple_choice - - bigbench_movie_dialog_same_or_different_generate_until - - bigbench_movie_dialog_same_or_different_multiple_choice - - bigbench_movie_recommendation_generate_until - - bigbench_movie_recommendation_multiple_choice - - bigbench_mult_data_wrangling_generate_until - - bigbench_multiemo_generate_until - - bigbench_multiemo_multiple_choice - - bigbench_multiple_choice_a - - bigbench_multiple_choice_b - - bigbench_natural_instructions_generate_until - - bigbench_navigate_generate_until - - bigbench_navigate_multiple_choice - - bigbench_nonsense_words_grammar_generate_until - - bigbench_nonsense_words_grammar_multiple_choice - - bigbench_novel_concepts_generate_until - - bigbench_novel_concepts_multiple_choice - - bigbench_object_counting_generate_until - - bigbench_odd_one_out_generate_until - - bigbench_odd_one_out_multiple_choice - - bigbench_operators_generate_until - - bigbench_paragraph_segmentation_generate_until - - bigbench_parsinlu_qa_generate_until - - bigbench_parsinlu_qa_multiple_choice - - bigbench_parsinlu_reading_comprehension_generate_until - - bigbench_penguins_in_a_table_generate_until - - bigbench_penguins_in_a_table_multiple_choice - - bigbench_periodic_elements_generate_until - - bigbench_periodic_elements_multiple_choice - - bigbench_persian_idioms_generate_until - - bigbench_persian_idioms_multiple_choice - - bigbench_phrase_relatedness_generate_until - - bigbench_phrase_relatedness_multiple_choice - - bigbench_physical_intuition_generate_until - - bigbench_physical_intuition_multiple_choice - - bigbench_physics_generate_until - - bigbench_physics_multiple_choice - - bigbench_physics_questions_generate_until - - bigbench_play_dialog_same_or_different_generate_until - - bigbench_play_dialog_same_or_different_multiple_choice - - bigbench_polish_sequence_labeling_generate_until - - bigbench_presuppositions_as_nli_generate_until - - bigbench_presuppositions_as_nli_multiple_choice - - bigbench_qa_wikidata_generate_until - - bigbench_question_selection_generate_until - - bigbench_question_selection_multiple_choice - - bigbench_real_or_fake_text_generate_until - - bigbench_real_or_fake_text_multiple_choice - - bigbench_reasoning_about_colored_objects_generate_until - - bigbench_reasoning_about_colored_objects_multiple_choice - - bigbench_repeat_copy_logic_generate_until - - bigbench_rephrase_generate_until - - bigbench_riddle_sense_generate_until - - bigbench_riddle_sense_multiple_choice - - bigbench_ruin_names_generate_until - - bigbench_ruin_names_multiple_choice - - bigbench_salient_translation_error_detection_generate_until - - bigbench_salient_translation_error_detection_multiple_choice - - bigbench_scientific_press_release_generate_until - - bigbench_semantic_parsing_in_context_sparc_generate_until - - bigbench_semantic_parsing_spider_generate_until - - bigbench_sentence_ambiguity_generate_until - - bigbench_sentence_ambiguity_multiple_choice - - bigbench_similarities_abstraction_generate_until - - bigbench_similarities_abstraction_multiple_choice - - bigbench_simp_turing_concept_generate_until - - bigbench_simple_arithmetic_json_generate_until - - bigbench_simple_arithmetic_json_multiple_choice_generate_until - - bigbench_simple_arithmetic_json_subtasks_generate_until - - bigbench_simple_arithmetic_multiple_targets_json_generate_until - - bigbench_simple_ethical_questions_generate_until - - bigbench_simple_ethical_questions_multiple_choice - - bigbench_simple_text_editing_generate_until - - bigbench_snarks_generate_until - - bigbench_snarks_multiple_choice - - bigbench_social_iqa_generate_until - - bigbench_social_iqa_multiple_choice - - bigbench_social_support_generate_until - - bigbench_social_support_multiple_choice - - bigbench_sports_understanding_generate_until - - bigbench_sports_understanding_multiple_choice - - bigbench_strange_stories_generate_until - - bigbench_strange_stories_multiple_choice - - bigbench_strategyqa_generate_until - - bigbench_strategyqa_multiple_choice - - bigbench_sufficient_information_generate_until - - bigbench_suicide_risk_generate_until - - bigbench_suicide_risk_multiple_choice - - bigbench_swahili_english_proverbs_generate_until - - bigbench_swahili_english_proverbs_multiple_choice - - bigbench_swedish_to_german_proverbs_generate_until - - bigbench_swedish_to_german_proverbs_multiple_choice - - bigbench_symbol_interpretation_generate_until - - bigbench_symbol_interpretation_multiple_choice - - bigbench_temporal_sequences_generate_until - - bigbench_temporal_sequences_multiple_choice - - bigbench_tense_generate_until - - bigbench_timedial_generate_until - - bigbench_timedial_multiple_choice - - bigbench_topical_chat_generate_until - - bigbench_tracking_shuffled_objects_generate_until - - bigbench_tracking_shuffled_objects_multiple_choice - - bigbench_understanding_fables_generate_until - - bigbench_understanding_fables_multiple_choice - - bigbench_undo_permutation_generate_until - - bigbench_undo_permutation_multiple_choice - - bigbench_unit_conversion_generate_until - - bigbench_unit_conversion_multiple_choice - - bigbench_unit_interpretation_generate_until - - bigbench_unit_interpretation_multiple_choice - - bigbench_unnatural_in_context_learning_generate_until - - bigbench_vitaminc_fact_verification_generate_until - - bigbench_vitaminc_fact_verification_multiple_choice - - bigbench_what_is_the_tao_generate_until - - bigbench_what_is_the_tao_multiple_choice - - bigbench_which_wiki_edit_generate_until - - bigbench_which_wiki_edit_multiple_choice - - bigbench_winowhy_generate_until - - bigbench_winowhy_multiple_choice - - bigbench_word_sorting_generate_until - - bigbench_word_unscrambling_generate_until \ No newline at end of file