ForzaEmbed

ForzaEmbed is a Python framework for benchmarking text embedding models and text processing strategies. It performs a grid search over a variety of parameters, including chunking strategies, embedding models, and similarity metrics, to find a suitable configuration for a given dataset.

Features

Grid Search: Systematically tests combinations of chunking strategies, chunk sizes, overlaps, embedding models, and similarity metrics.
Model Support: Supports various embedding models through a configuration file, including API-based services and local models from Hugging Face and FastEmbed.
Chunking Strategies: Includes multiple chunking methods: langchain, raw, semchunk, nltk, and spacy.
Similarity Metrics: Evaluates performance using cosine, euclidean, manhattan, dot_product, and chebyshev similarity metrics.
Caching: Caches generated embeddings to accelerate subsequent runs.
Resumable Workflows: Can resume interrupted grid searches.
Reporting: Generates reports, including text heatmaps and CSV files, to visualize and compare the performance of different parameter combinations.
Command-Line Interface: Provides a CLI to run the pipeline, manage the database, and generate reports.

How It Works

ForzaEmbed follows a systematic process to evaluate embedding configurations:

Data Loading: Loads text data from a directory of markdown files.
Grid Search: Iterates through a predefined grid of parameters from the config.yml file.
Processing: For each combination of parameters, the tool processes the text, generates embeddings, and calculates similarity scores.
Database Storage: All results are stored in a SQLite database.
Report Generation: After the grid search is complete, ForzaEmbed generates reports and visualizations.

Installation

Clone the repository:

git clone https://github.com/berangerthomas/ForzaEmbed.git
cd ForzaEmbed

Install dependencies: This project uses uv for package management.
```
pip install uv
uv sync
```

Usage

ForzaEmbed is controlled via the command line.

Running the Grid Search

To run the full grid search and reporting pipeline, use the --run flag:

python main.py --run

This command will resume the grid search if it was previously interrupted. To start from scratch, use the --no-resume flag.

Generating Reports

To regenerate reports from a completed grid search, use the --generate-reports flag:

python main.py --generate-reports

You can limit the comparison charts to the top N models using the --top-n argument:

python main.py --generate-reports --top-n 10

Command-Line Options

Argument	Description
`--db-path`	Path to the SQLite database file.
`--config-path`	Path to the YAML configuration file.
`--data-source`	Path to the directory containing markdown files.
`--run`	Run the full grid search and reporting pipeline.
`--generate-reports`	Only generate reports from existing data.
`--no-resume`	Start the grid search from scratch.
`--clear-db`	Clear the main database before running.
`--clear-cache`	Clear the embedding cache before running.
`--top-n`	Limit comparison charts to the top N models.
`--refresh-metrics`	Refresh evaluation metrics for all existing runs.

Configuration

The behavior of ForzaEmbed is controlled by the config.yml file, which is divided into several sections:

grid_search_params: Defines the parameters for the grid search, such as chunk_size, chunk_overlap, chunking_strategy, and similarity_metrics.
models_to_test: A list of embedding models to be evaluated. You can specify the type (e.g., api, fastembed, huggingface), name, and other model-specific parameters.
general_settings: General configuration options, such as the similarity_threshold and output_dir.
multiprocessing: Settings to configure multiprocessing.

Contributing

Contributions are welcome. If you have suggestions for improvements or find any issues, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.vscode		.vscode
docs		docs
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ForzaEmbed

Features

How It Works

Installation

Usage

Running the Grid Search

Generating Reports

Command-Line Options

Configuration

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ForzaEmbed

Features

How It Works

Installation

Usage

Running the Grid Search

Generating Reports

Command-Line Options

Configuration

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages