Releases: IBM/unitxt
Releases · IBM/unitxt
Unitxt 1.26.9
What's Changed
- lazy import of scipy by @assaftibm in #1959
- Fix duplicate-column sorting issue in Text2SQL evaluation utils by @oktie in #1954
- Update version to 1.26.9 by @elronbandel in #1961
Full Changelog: 1.26.8...1.26.9
Unitxt 1.26.8
What's Changed
- add ollama classification engine by @lilacheden in #1955
- make ollamaInferenceEngine handle return_meta_data by @lilacheden in #1956
- lazy import of evaluate by @assaftibm in #1957
- Update version to 1.26.8 by @elronbandel in #1958
Full Changelog: 1.26.7...1.26.8
Unitxt 1.26.7
What's Changed
- Fix examples by @elronbandel in #1907
- Fix inference tests by @elronbandel in #1912
- Minor text2sql metric fixes by @oktie in #1913
- fix mtrag by @dafnapension in #1918
- fix xlam's schema issues by @dafnapension in #1917
- Add ReflectionToolCallingMetricSyntactic for evaluating tool call predictions referenceless by @korenLazar in #1923
- Divide biggen bench into multigual and not multilingual by @martinscooper in #1922
- remove redundant split from airbench2024 by @dafnapension in #1928
- Revert BigGen Benchmark partition by @martinscooper in #1924
- fixed spit names in wiki_bio by @dafnapension in #1925
- Fix erroneous prompts in evaluation tasks (and clean some json-schema-wise) by @dafnapension in #1920
- fix the only 4 erroneous global_mmlu cards that do not pass _source_to_dataset by @dafnapension in #1916
- Normalize llm judge bench target variable by @martinscooper in #1933
- Improved multi turn evaluation to be self contained and use LLM as judge by @yoavkatz in #1929
- Add more RAG judges by @arielge in #1934
- Add ReflectionToolCallingMetric and update related metrics by @korenLazar in #1931
- potential fix for preparation file: prepare/cards/mtrag.py by @dafnapension in #1938
- Lazy load vectara hhem model because it is gated in HF by @yoavkatz in #1946
- Fixed missing sampling_seed in DiverseLabelsSampler by @yoavkatz in #1941
- Correct reflection based tool calling metrics so valid results will be 1. by @yoavkatz in #1940
- Rag metric update again by @dafnapension in #1948
- fix gpt-oss classification inference engines by @lilacheden in #1952
- Update version to 1.26.7 by @elronbandel in #1953
New Contributors
- @korenLazar made their first contribution in #1923
Full Changelog: 1.26.6...1.26.7
Unitxt 1.26.6
What's Changed
- Update pearsonr tests by @elronbandel in #1890
- return source_to_recipe to performance evaluation, once 403 is fixed by bnayahu by @dafnapension in #1891
- remove a card whose preprocess_steps do not match the contents of the loaded dataset by @dafnapension in #1893
- fix an ineffective setting of max size of loader_cache by @dafnapension in #1892
- Fix compatibility with datasets 4.0 by @elronbandel in #1861
- Improve speed in mmlu global by @elronbandel in #1895
- Remove the need for datasets<4.0.0 by @elronbandel in #1897
- Refresh README by @elronbandel in #1898
- Update Readme by @elronbandel in #1899
- Update README by @elronbandel in #1900
- Update README by @elronbandel in #1901
- Fix docs and example of how to use benchmark by @elronbandel in #1903
- Refine condition for avoiding the Benchmark wrapper by @bnayahu in #1904
- Complete transition to datasets 4.0.0 in preparation tests by @dafnapension in #1902
- Make sacrebleu faster and more efficient by @elronbandel in #1906
- Implements LogProbEngine on CrossInference and adds more granite guardian models by @martinscooper in #1905
- Remove IBM GenAI support and moved legacy GenAI metrics to use CrossProviderInferenceEngine by @yoavkatz in #1508
- GPT on rits and minor llm judge criteria changes by @martinscooper in #1909
- The special installation of networkx can be removed as well by @dafnapension in #1908
- Update version to 1.26.6 by @elronbandel in #1911
Full Changelog: 1.26.5...1.26.6
Unitxt 1.26.5
What's Changed
- For load_dataset, use_cache default value is taken from settings by @eladven in #1880
- Support watsonx.ai on-prem credentials by @pratapkishorevarma in #1883
- extend condition to also filter by field exists or not by @dafnapension in #1879
- fix performance test by @dafnapension in #1884
- Add support for inline-defined templates in the UI by @Chemafiz in #1886
- Mitigate HTTP 403 errors in pandas by @bnayahu in #1888
- Biggen benchmark and pearson correlation metric by @martinscooper in #1887
- Update version to 1.26.5 by @elronbandel in #1889
New Contributors
- @pratapkishorevarma made their first contribution in #1883
- @Chemafiz made their first contribution in #1886
Full Changelog: 1.26.4...1.26.5
Unitxt 1.26.4
What's Changed
- Add more Judgebench benchmarks by @martinscooper in #1869
- Make sqlite3 not an optional dependency by @elronbandel in #1871
- Removed legacy topicality, idk, and groundness metrics that worked only on BAM by @yoavkatz in #1875
- Bench and models by @martinscooper in #1872
- Handle a case in ToolCallPostProcessor where prediction is an empty list of tools by @yoavkatz in #1874
- Update version to 1.26.4 by @elronbandel in #1876
Full Changelog: 1.26.3...1.26.4
Unitxt 1.26.3
What's Changed
- LLM Judge: Improve context/prediction fields parsing by @martinscooper in #1856
- Fixed bug in tool inference by @yoavkatz in #1868
- Added a new MetricBasedNer that allows calculating entity similary using any Unitxt metric. by @yoavkatz in #1860
- Update version to 1.26.3 by @elronbandel in #1870
Full Changelog: 1.26.2...1.26.3
Unitxt 1.26.2
What's Changed
- Add tot dataset by @elronbandel in #1865
- Add tokenizer_name to base huggingface inference engines by @elronbandel in #1862
- Add hf to cross provider inference engine by @yoavkatz in #1866
- Update version to 1.26.2 by @elronbandel in #1867
Full Changelog: 1.26.1...1.26.2
Unitxt 1.26.1
Lock datasets dependency to <4.0.0
The latest datasets v4.0.0 release removes support for loading datasets with trust_remote_code=True. This change breaks compatibility with many datasets currently in the Unitxt catalog, as several datasets require this feature to load properly.
This patch restricts the datasets version to below 4.0.0 until we can find or develop replacements for affected datasets.
Unitxt 1.26.0 - Multi Threading
Main changes:
- Made Unitxt Thread-Safe so it can run in multi-threaded environments.
- Added an option to set sampling seed for demos (in context example). This is done by
demos_sampling_seed. It allows running the same dataset with different demo examples. - Improved printouts of instance scores with to_markdown() and summary in Unitxt. For example :
results = evaluate(predictions=predictions, data=dataset)
print(results.instance_scores.summary)
All changes:
- Add to_markdown() to InstanceScores to pretty print output by @yoavkatz in #1846
- Improved InstanceScores summary to be readible and in decent width by @yoavkatz in #1847
- Improve multi turn tool calling example by @elronbandel in #1848
- Add metrics documentation including range, directionality and references by @elronbandel in #1850
- Fix sacrebleu documentation by @elronbandel in #1851
- Add F1 score documentation to F1Fast metric class by @elronbandel in #1852
- Add more llmjudge benchmarks by @martinscooper in #1804
- Fix llama scout name and url on rits by @martinscooper in #1857
- Add demos_sampling_seed to recipe api by @elronbandel in #1858
- Add comprehensive multi threading support and tests by @elronbandel in #1853
- Update BlueBench to match the original implementation by @bnayahu in #1855
Full Changelog: 1.25.0...1.26.0