Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
a9bf853
chore: add libraries
MarkRagg Feb 3, 2026
8fa0f12
feat: implement first version of custom compiled graph
MarkRagg Feb 9, 2026
8856bee
chore: add lm-eval
MarkRagg Feb 9, 2026
4fb3050
fix: fix judge messages, added the first question
MarkRagg Feb 9, 2026
07f31eb
chore: add multiply
MarkRagg Feb 10, 2026
5454477
chore: create model folder
MarkRagg Feb 10, 2026
fad35d1
fix: judge agent now return an object
MarkRagg Feb 10, 2026
4dc9660
chore: add Score BaseModel
MarkRagg Feb 10, 2026
9ee5193
chore: add util methods
MarkRagg Feb 10, 2026
e711cee
feat: implement lm eval benchmark testing
MarkRagg Feb 10, 2026
433277b
chore: resolve mypy problems
MarkRagg Feb 11, 2026
3f5b531
chore; add tasks.txt, remove unused import
MarkRagg Feb 11, 2026
a26cf5c
style: ruff format
MarkRagg Feb 11, 2026
1820825
chore: add a node for mermaid graph
MarkRagg Feb 11, 2026
f32a2ff
feat: TestNode now check the tools used
MarkRagg Feb 11, 2026
05bb4a7
feat: implement mermaid print for runtime graph
MarkRagg Feb 11, 2026
845d442
chore: improve and refine runtime graph
MarkRagg Feb 12, 2026
8852fbc
chore: add description on tools
MarkRagg Feb 13, 2026
3f2f514
feat: add Response format and feedback on llm backtracking
MarkRagg Feb 13, 2026
9ee7f94
chore: improve compiled graph
MarkRagg Feb 13, 2026
09451d9
chore: add clear of runtime graph
MarkRagg Feb 16, 2026
bfcc707
chore: improve prompt
MarkRagg Feb 16, 2026
36d7570
style: ruff format
MarkRagg Feb 16, 2026
9216693
fix: correct mypy errors
MarkRagg Feb 16, 2026
8c71858
feat: update python versions
MarkRagg Feb 16, 2026
7b7c943
fix: fix little errors
MarkRagg Feb 16, 2026
9404a98
chore: create two function for benchmark or custom test
MarkRagg Feb 16, 2026
407de96
style: ruff format
MarkRagg Feb 16, 2026
4befe10
chore: add boto3 for mlflow
MarkRagg Feb 18, 2026
0e93f06
fix: load env in the right place
MarkRagg Feb 18, 2026
e85ebde
chore: change temperature for formatted output
MarkRagg Feb 18, 2026
17bfacc
chore: add reset of Runtime id
MarkRagg Feb 18, 2026
014b1f9
chore: parsing output to filter correctly the response
MarkRagg Feb 18, 2026
73f9d31
chore: add normalize number
MarkRagg Feb 18, 2026
beba81c
chore: change method of graph call
MarkRagg Feb 18, 2026
163c5bd
feat: introduce lm eval test benchmark
MarkRagg Feb 18, 2026
33441ad
style: ruff format
MarkRagg Feb 18, 2026
8135df1
chore: add print function for bench testing
MarkRagg Feb 19, 2026
ee5b32f
feat: implementing reasoning node
MarkRagg Feb 19, 2026
ea4b665
style: ruff format
MarkRagg Feb 19, 2026
7efc991
chore: improve response format
MarkRagg Feb 21, 2026
5be77bb
fix: clea graph in case of exception
MarkRagg Feb 21, 2026
4c88edd
chore: update gitignore
MarkRagg Feb 21, 2026
d066977
chore: add division and removing useless sum tools
MarkRagg Feb 23, 2026
ed00256
chore: improve math tools
MarkRagg Feb 23, 2026
8bd1077
chore: force output format on node
MarkRagg Feb 24, 2026
cd17f09
fix: id node now are correct
MarkRagg Feb 26, 2026
a2c8894
style: ruff format
MarkRagg Feb 26, 2026
9eee32d
chore: fix for mypy
MarkRagg Feb 27, 2026
2faa227
chore: uncomment logger
MarkRagg Feb 27, 2026
ad7bfe7
chore: fix benchmark test
MarkRagg Mar 4, 2026
7796d64
chore: implement crafting tools
MarkRagg Mar 4, 2026
6e25816
feat: add crafting node to the state graph
MarkRagg Mar 4, 2026
1b2602a
chore: removing reasoning and add remote llm
MarkRagg Mar 7, 2026
318f42a
chore: add filter on benchmark print
MarkRagg Mar 7, 2026
981a8a2
chore: fix node bugs
MarkRagg Mar 9, 2026
038954b
chore: save tool in files
MarkRagg Mar 10, 2026
444eaec
chore: add filter in print
MarkRagg Mar 10, 2026
4b1bdd5
chore: add score threshold
MarkRagg Mar 11, 2026
2c5cbd4
feat: now LLM can create and dynamic add a tool
MarkRagg Mar 12, 2026
fb903fb
feat: add gemini apis
MarkRagg Mar 16, 2026
b6a3050
feat: LLM can use runtime crafted tools
MarkRagg Mar 16, 2026
4be202b
chore: improve agent prompts
MarkRagg Mar 16, 2026
26b1824
style: ruff format
MarkRagg Mar 16, 2026
b8dde84
chore: fix mypy
MarkRagg Mar 16, 2026
55edd84
chore: comment unused line
MarkRagg Mar 16, 2026
11c572c
chore: remove warning
MarkRagg Mar 16, 2026
ce7a1aa
style: ruff format
MarkRagg Mar 16, 2026
2272033
feat: implement gemini llms
MarkRagg Mar 24, 2026
7ed7da3
chore: temporarly suspend pypi publication
MarkRagg Mar 24, 2026
637ba71
chore: improve prompt and models
MarkRagg Mar 25, 2026
b141e8a
fix: fix prompt for not raise exception
MarkRagg Mar 26, 2026
131ad0a
chore: print loglikehood benchmark
MarkRagg Mar 26, 2026
f1ec74f
chore: add lm eval math
MarkRagg Mar 26, 2026
4abc304
core: use 2.5 gemini flash
MarkRagg Mar 26, 2026
3b492e8
chore: use gpqa diamond as benchmark
MarkRagg Mar 26, 2026
e729552
style: poe format
MarkRagg Mar 26, 2026
bdbf353
chore: fix for mypy
MarkRagg Mar 26, 2026
50acaa5
feat: create function to run gpqa hf dataset
MarkRagg Mar 27, 2026
637d454
chore: improve evaluation
MarkRagg Mar 27, 2026
45ffef0
chore: extract output also for gemini
MarkRagg Mar 27, 2026
bba6cea
chore: control syntax of crafted functions
MarkRagg Mar 31, 2026
20f7cc1
chore: tests goes on also if one test raise an exception
MarkRagg Mar 31, 2026
fe84c94
chore: add use_benchmark
MarkRagg Mar 31, 2026
460ced9
feat: add parser in order to launch from terminal
MarkRagg Mar 31, 2026
3492209
chore: change init for parse args
MarkRagg Mar 31, 2026
bbffc72
chore: add hendrycks math
MarkRagg Apr 8, 2026
adbe91d
fix: fix hendrycks parsing response and correct solution
MarkRagg Apr 9, 2026
24ce2a7
chore: add more control on crafting
MarkRagg Apr 9, 2026
a4ce18c
fix: fix args names
MarkRagg Apr 9, 2026
4abfccb
feat: add divide thought reasoning and prompt complexity
MarkRagg Apr 10, 2026
22a423e
style: ruff format
MarkRagg Apr 10, 2026
0a90482
chore: fix mypy
MarkRagg Apr 10, 2026
aa3f3bf
fix: fix json saving for ResultEval
MarkRagg Apr 10, 2026
0cc6d38
Merge branch 'main' into develop
MarkRagg Apr 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ jobs:
npm install
npx semantic-release
env:
PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
# PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
GITHUB_TOKEN: ${{ secrets.RELEASE_TOKEN }}
RELEASE_TEST_PYPI: ${{ github.event.repository.is_template || contains(github.repository, 'template') }}
# dry run if not on main/master branch, or if initial commit
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -142,3 +142,6 @@ dmypy.json

# Pyre type checker
.pyre/

# lm eval cache
hf_cache/
56 changes: 46 additions & 10 deletions GoT/__init__.py
Original file line number Diff line number Diff line change
@@ -1,39 +1,75 @@
import json
import logging
from dotenv import load_dotenv

from lm_eval import evaluator, tasks
from GoT.model.graph_model import invoke_graph, set_prompt
from GoT.model.lm_wrapper import LangGraphLMWrapper
from GoT.model.graph_model import call_graph
from GoT.model.lm_wrapper import LangGraphBigBenchWrapper, TestBigBenchWrapper
from GoT.model.utils.parse_args import call_benchmark, defining_and_parse_args
from GoT.model.utils.utils import (
print_benchmark_result,
print_benchmark_result_loglikehood,
)

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("GoT")

load_dotenv()

def lm_eval_benchmark():
task_list = ["gsm8k"]
lm = LangGraphLMWrapper()
# Possible filter = "flexible", "none", "strict"


def lm_eval_test_benchmark():
task_name = "gpqa_diamond_zeroshot"
task_list = [task_name]
test_lm = TestBigBenchWrapper()
task_dict = tasks.get_task_dict(task_list)

results = evaluator.evaluate(
lm=test_lm,
task_dict=task_dict,
limit=2, # Limit the number of samples
log_samples=True,
# samples={task_name: [20, 25, 100]},
)

# Save results to a JSON file
with open("test_benchmark_results.json", "w") as f:
json.dump(results["samples"], f, indent=2)

print_benchmark_result(results, task_name, filter="strict-match")


def lm_eval_graph_benchmark():
# hendrycks_math_geometry
task_name = "gpqa_diamond_zeroshot"
task_list = [task_name]
lm = LangGraphBigBenchWrapper()
task_dict = tasks.get_task_dict(task_list)

results = evaluator.evaluate(
lm=lm,
# limit=1,
task_dict=task_dict,
limit=5, # Limit to 2 samples for quick testing
samples={task_name: [20, 25]},
log_samples=True,
)

# Save results to a JSON file
with open("graph_benchmark_results.json", "w") as f:
json.dump(results, f, indent=2)
json.dump(results, f, indent=2, default=str)

print_benchmark_result_loglikehood(results, task_name, filter_val="none")


def custom_test():
set_prompt("What is 4726621 + 2 * 392 - 3432?")
invoke_graph()
call_graph("Solve this integral ∫x2⋅ex2dx")


def main():
# It could be changed with custom_test() to test a custom problem instead of the benchmark
lm_eval_benchmark()
args = defining_and_parse_args()
call_benchmark(args)


# let this be the last line of this file
Expand Down
Loading