ForzaEmbed is a Python framework for benchmarking text embedding models and text processing strategies. It performs a grid search over a variety of parameters, including chunking strategies, embedding models, and similarity metrics, to find a suitable configuration for a given dataset.
- Grid Search: Systematically tests combinations of chunking strategies, chunk sizes, overlaps, embedding models, and similarity metrics.
- Model Support: Supports various embedding models through a configuration file, including API-based services and local models from Hugging Face and FastEmbed.
- Chunking Strategies: Includes multiple chunking methods:
langchain,raw,semchunk,nltk, andspacy. - Similarity Metrics: Evaluates performance using
cosine,euclidean,manhattan,dot_product, andchebyshevsimilarity metrics. - Caching: Caches generated embeddings to accelerate subsequent runs.
- Resumable Workflows: Can resume interrupted grid searches.
- Reporting: Generates reports, including text heatmaps and CSV files, to visualize and compare the performance of different parameter combinations.
- Command-Line Interface: Provides a CLI to run the pipeline, manage the database, and generate reports.
ForzaEmbed follows a systematic process to evaluate embedding configurations:
- Data Loading: Loads text data from a directory of markdown files.
- Grid Search: Iterates through a predefined grid of parameters from the
config.ymlfile. - Processing: For each combination of parameters, the tool processes the text, generates embeddings, and calculates similarity scores.
- Database Storage: All results are stored in a SQLite database.
- Report Generation: After the grid search is complete, ForzaEmbed generates reports and visualizations.
-
Clone the repository:
git clone https://github.com/berangerthomas/ForzaEmbed.git cd ForzaEmbed -
Install dependencies: This project uses
uvfor package management.pip install uv uv sync
ForzaEmbed is controlled via the command line.
To run the full grid search and reporting pipeline, use the --run flag:
python main.py --runThis command will resume the grid search if it was previously interrupted. To start from scratch, use the --no-resume flag.
To regenerate reports from a completed grid search, use the --generate-reports flag:
python main.py --generate-reportsYou can limit the comparison charts to the top N models using the --top-n argument:
python main.py --generate-reports --top-n 10| Argument | Description |
|---|---|
--db-path |
Path to the SQLite database file. |
--config-path |
Path to the YAML configuration file. |
--data-source |
Path to the directory containing markdown files. |
--run |
Run the full grid search and reporting pipeline. |
--generate-reports |
Only generate reports from existing data. |
--no-resume |
Start the grid search from scratch. |
--clear-db |
Clear the main database before running. |
--clear-cache |
Clear the embedding cache before running. |
--top-n |
Limit comparison charts to the top N models. |
--refresh-metrics |
Refresh evaluation metrics for all existing runs. |
The behavior of ForzaEmbed is controlled by the config.yml file, which is divided into several sections:
grid_search_params: Defines the parameters for the grid search, such aschunk_size,chunk_overlap,chunking_strategy, andsimilarity_metrics.models_to_test: A list of embedding models to be evaluated. You can specify thetype(e.g.,api,fastembed,huggingface),name, and other model-specific parameters.general_settings: General configuration options, such as thesimilarity_thresholdandoutput_dir.multiprocessing: Settings to configure multiprocessing.
Contributions are welcome. If you have suggestions for improvements or find any issues, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
