🐝 BeeConsensus Framework

BeeConsensus is a swarm-intelligence framework designed to mitigate LLM hallucinations through multi-persona deliberation, semantic clustering, and consensus-based truth filtering.

Inspired by the collective decision-making of honeybees, this system orchestrates multiple "Scout" agents to explore different perspectives on a query before reaching a unified, factual consensus.

🛠️ How it Works

The framework follows the biological process of a bee colony selecting a new hive location:

graph TD
    UserQuery[User Query] --> Scouts[Scout Agents Generate Answers]
    Scouts --> DanceFloor[The Dance Floor: Semantic Clustering]
    DanceFloor --> Clustering[DBSCAN Clustering]
    Clustering --> QuorumCheck{Quorum Reached?}
    QuorumCheck -- YES --> FinalAnswer[Final Answer + Waggle Dance Signal]
    QuorumCheck -- NO --> ReDeliberate[Re-Deliberation Round]
    ReDeliberate --> Scouts
    FinalAnswer --> Output([Truthful Response])

Scouting: Diverse agents (Factual, Cautious, Devil's Advocate) generate independent candidate answers.
The Dance Floor: Answers are converted to semantic embeddings and grouped using DBSCAN.
Waggle Dance Signal: A confidence score is calibrated based on how strongly the majority cluster agrees.
Quorum Gate: If the dominant cluster doesn't meet the threshold, agents review each other's reasoning and try again.

🚀 Execution Guide

1. Local Setup

# Clone the repository
git clone https://github.com/khaled-kk/BeeConsensus.git
cd BeeConsensus

# Install dependencies
pip install -r requirements.txt

# Run a quick demo
python evaluate.py --mode demo

2. Kaggle/Colab Deployment

For running the full TruthfulQA dataset, cloud GPUs are highly recommended.

Environment Setup:

# 1. Install required libraries
!pip install -q transformers accelerate bitsandbytes sentence-transformers datasets

# 2. Login to Hugging Face (Required for Llama-3.1 access)
from huggingface_hub import login
login("YOUR_HF_TOKEN")

# 3. Import your script files
!find /kaggle/input -name "*.py" -exec cp {} . \;

Run Benchmark:

python evaluate.py --mode truthfulqa --model_id "meta-llama/Llama-3.1-8B-Instruct" --limit 817

⚙️ Advanced Configuration

Flag	Default	Description
`--mode`	`demo`	Evaluation mode (`demo` or `truthfulqa`)
`--model_id`	`meta-llama/Llama-3.1-8B-Instruct`	Hugging Face model ID
`--quorum`	`0.60`	Fraction of agents required for consensus (0.1 - 1.0)
`--openvino`	`False`	Enable OpenVINO acceleration for Intel hardware
`--limit`	`5`	Maximum number of questions to evaluate

📊 Benchmark Results (TruthfulQA)

Full evaluation on 817 questions using meta-llama/Llama-3.1-8B-Instruct.

Method	Accuracy	Avg Latency	Avg Confidence
Single LLM (Baseline)	44.2%	16.0s	N/A
Self-Consistency (N=5)	43.2%	79.8s	N/A
BeeConsensus (Swarm)	48.2%	50.8s	0.662

Key Findings:

Hallucination Reduction: BeeConsensus achieved a +4.0% absolute accuracy gain over the single-model baseline.
Superior to Self-Consistency: The swarm logic outperformed simple majority voting by 5.0%, demonstrating the value of persona diversity.
Improved Efficiency: Despite the complexity, BeeConsensus was 36% faster than standard Self-Consistency due to optimized quorum gates.

🔄 Resuming Progress

BeeConsensus is built for reliability. If your cloud session times out:

The script automatically saves results to benchmark_progress.csv after every question.
Upon restart, it detects the existing CSV, reconstructs accuracy stats, and skips already answered questions.

📊 Insights & Visualization

BeeConsensus includes a dedicated visualization suite to analyze swarm performance and confidence calibration.

1. Generate Plots

If you have pandas and matplotlib installed, you can generate performance charts from your latest benchmark results:

python visualize_results.py

2. Analysis Overview

The script generates three key insights in the plots/ directory:

Running Accuracy: Tracks how the swarm stabilizes performance compared to baselines over hundreds of queries.
Latency Distribution: A comparative boxplot showing the inference efficiency of BeeConsensus vs. Self-Consistency.
Confidence Calibration: Analyzes the "Waggle Dance" signal to correlate agent confidence with factual correctness.

🗺️ Roadmap

Hybrid Evaluation: Integration of BLEU/ROUGE scoring alongside semantic similarity.
Dynamic Personas: Automatically adjusting agent system prompts based on query complexity.
Multi-Modal Swarms: Expanding the consensus logic to image-to-text models.

📄 License

Distributed under the MIT License. See LICENSE for more information.

Developed with 🐝 by Khaled Walid

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
models/tiny-random-MiniCPM-o-2_6		models/tiny-random-MiniCPM-o-2_6
plots		plots
.gitignore		.gitignore
BeeConsensus_Foundation_Report.md		BeeConsensus_Foundation_Report.md
README.md		README.md
beeconsensus.py		beeconsensus.py
benchmark_progress.csv		benchmark_progress.csv
benchmark_summary.txt		benchmark_summary.txt
evaluate.py		evaluate.py
requirements.txt		requirements.txt
visualize_results.py		visualize_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐝 BeeConsensus Framework

🛠️ How it Works

🚀 Execution Guide

1. Local Setup

2. Kaggle/Colab Deployment

⚙️ Advanced Configuration

📊 Benchmark Results (TruthfulQA)

🔄 Resuming Progress

📊 Insights & Visualization

1. Generate Plots

2. Analysis Overview

🗺️ Roadmap

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐝 BeeConsensus Framework

🛠️ How it Works

🚀 Execution Guide

1. Local Setup

2. Kaggle/Colab Deployment

⚙️ Advanced Configuration

📊 Benchmark Results (TruthfulQA)

🔄 Resuming Progress

📊 Insights & Visualization

1. Generate Plots

2. Analysis Overview

🗺️ Roadmap

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages