This repository contains the material related to the TER (Travail d'Études et de Recherche) carried out during the academic year 2025–2026.
- Degree: Master 1 – Data Science
- Supervisor: Nadjib Lazaar
- Institution: Université Paris-Saclay
- Project Overview
- Project Status
- Architecture
- Quick Start with Docker
- Local Development Setup
- CLI Usage
- Configuration
- Repository Content
The TER focuses on the generation of synthetic datasets for pattern mining using generative AI, in particular Large Language Models (LLMs).
The main deliverable is a Hybrid Synthetic Transactional Data Generator combining:
- LLM semantic framing — generates realistic item names, co-occurrence pairs, and ground-truth patterns
- Statistical distribution control — power-law item frequencies, right-skewed transaction lengths
Supports both Frequent Itemset Mining (FIM) and High-Utility Itemset Mining (HUIM) modes, with a REST API backend and React web interface.
- Hybrid generation pipeline (FIM + HUIM)
- LLM-based item name and co-occurrence generation
- LLM-based ground-truth pattern generation (automatic, with manual override priority)
- Statistical model with power-law item frequencies and right-skewed transaction lengths
- Pattern embedding with transaction pool exhaustion check
- Four-step validation pipeline:
- Step 1: Structural validation (
|D|,|I|,l,ρ) - Step 2: Statistical validation (power-law α, skewness, 80/20 rule)
- Step 3: Ground-truth validation (support accuracy + Recall ≥ 0.80 via FP-Growth)
- Step 4: OFAT experimental study
- Step 1: Structural validation (
- HUIM utility structure (profit table, quantity assignment, decoupling)
- SPMF integration (FP-Growth, HUI-Miner)
- FastAPI REST backend with job lifecycle management and SSE streaming
- React web interface with real-time monitoring
- Docker packaging (backend + frontend containers)
- HUIM validation edge cases still under investigation
- OFAT curves for HUIM dimensions not yet fully generated
ter-mining_ter/
├── README.md
├── requirements.txt
├── TER_Report.pdf
├── literature/ # Academic references
├── guidance/ # Guidance materials
├── tests/ # Test suite
└── Hybrid-model/
├── docker-compose.yml # Orchestrates backend + frontend
├── Dockerfile.backend # Python 3.11 + Java runtime
├── Dockerfile.frontend # Node 22 build + nginx runtime
├── DOCKER.md # Docker packaging notes
│
├── generator/
│ ├── generate.py # CLI entry point
│ ├── validate.py # Validation entry point
│ ├── config.py # Config loading and dataclasses
│ ├── llm_client.py # OpenAI / DeepSeek / Gemini client
│ ├── semantic_engine.py # LLM: item names, co-occurrence, patterns
│ ├── statistical_model.py # Power-law frequencies + skewed lengths
│ ├── pattern_embedder.py # Ground-truth pattern embedding
│ ├── transaction_assembler.py # Core assembly engine
│ ├── huim_generator.py # HUIM utility structure
│ ├── validator.py # Four-step validation logic
│ └── utils/
│ ├── distributions.py # Power-law, skewness, 80/20 utilities
│ ├── spmf_formatter.py # SPMF format read/write
│ ├── spmf_runner.py # SPMF algorithm runner + recall
│ └── plotting.py # Validation plots
│
├── backend/
│ ├── main.py # FastAPI app
│ ├── schemas.py # Pydantic request/response models
│ ├── routers/ # generate, jobs, files, ofat
│ └── services/ # job_manager, ofat_runner
│
├── frontend/
│ ├── src/
│ │ ├── api/client.ts # HTTP + SSE wrapper
│ │ ├── components/ # React components
│ │ └── types/api.ts # TypeScript interfaces
│ ├── nginx.conf # Production reverse proxy config
│ └── vite.config.ts # Vite + Tailwind + API proxy
│
├── configs/
│ ├── api_keys.yaml # LLM API keys (fill before use)
│ ├── api_keys.yaml.template # Template for api_keys.yaml
│ ├── config_fim_example.json # FIM example config
│ └── config_huim_example.json # HUIM example config
│
├── experiments/
│ ├── run_ofat.py # OFAT experiment script
│ ├── run_fim.sh # Automated FIM OFAT experiments
│ ├── run_huim.sh # Automated HUIM OFAT experiments
│ ├── collect_metrics.py # Metric collection
│ └── results/
│ ├── figures/ # Trend curves (PNG)
│ └── tables/ # CSV results
│
└── lib/
└── spmf.jar # SPMF library (FP-Growth, HUI-Miner)
config.json
↓
[1] SemanticEngine → LLM generates item names + co-occurrence pairs
↓
[1.5] generate_patterns()
→ If groundtruth=[] and generate_patterns=true:
LLM generates patterns from item names
(skipped if groundtruth is manually set)
↓
[2] StatisticalModel → Power-law frequencies (shuffled) + skewed lengths
↓
[3] PatternEmbedder → Assigns transactions to patterns (pool check)
↓
[4] TransactionAssembler → Generates transactions by sampling
↓
[5] embed_patterns() → Forces pattern items into assigned transactions
↓
[6] HUIMGenerator → (HUIM only) Adds profits + quantities + decoupling
↓
.dat + profits.txt + groundtruth.json + metadata.json
Docker is the recommended way to run the full stack (backend + frontend) without any local Python or Node.js setup.
- Docker ≥ 24
- Docker Compose v2 (bundled with Docker Desktop)
All commands should be run from the Hybrid-model/ directory:
cd Hybrid-model/
docker compose build
docker compose upOnce running:
| Service | URL |
|---|---|
| Web UI (frontend) | http://localhost:8080 |
| Backend API | http://localhost:8000 |
| Swagger docs | http://localhost:8000/docs |
| Health check | http://localhost:8000/api/health |
Generated datasets are written to ./datasets/jobs/ on the host via a volume mount. The frontend waits for the backend to pass its health check before starting.
To run in detached mode (background):
docker compose up -dThe LLM API key is located in Hybrid-model/configs/api_keys.yaml. Simply enter it into the page, and you can use the LLM normally (there’s no need to fill in the URL or model name).
configs/api_keys.yaml is excluded from Docker builds to keep secrets out of images.
Option 1 — Mock mode (no LLM): enable "Skip LLM calls (mock)" in the web UI. No API key needed.
Option 2 — Mount your key file at runtime: create Hybrid-model/docker-compose.override.yml:
services:
backend:
volumes:
- ./datasets:/app/backend/datasets
- ./configs/api_keys.yaml:/app/configs/api_keys.yaml:roThen run as usual:
docker compose upOption 3 — Enter API key in the UI: the generation form accepts a custom API key per request.
docker compose downTo also remove built images:
docker compose down --rmi localRequirements: Python ≥ 3.11, Java JRE (for SPMF).
cd Hybrid-model/
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
# Install dependencies
pip install -r ../requirements.txt
# Fill in your LLM API key
cp configs/api_keys.yaml.template configs/api_keys.yaml
# edit configs/api_keys.yaml
# Start the backend
uvicorn backend.main:app --reload --port 8000API available at http://localhost:8000 — Swagger UI at http://localhost:8000/docs.
Requirements: Node.js ≥ 18. The backend must be running on port 8000.
cd Hybrid-model/frontend
npm install
npm run devThe app is available at http://localhost:5173.
All commands are run from Hybrid-model/.
LLM generates ground-truth patterns automatically (groundtruth is empty in config):
python generator/generate.py --config configs/config_fim_example.jsonWith a custom output path:
python generator/generate.py --config configs/config_fim_example.json --output datasets/fim/my_datasetWithout LLM (mock mode, for testing):
python generator/generate.py --config configs/config_fim_example.json --mockpython generator/generate.py --config configs/config_huim_example.json# With SPMF (full validation including Recall)
python generator/validate.py \
--dataset datasets/fim/test/my_dataset.dat \
--config configs/config_fim_example.json \
--spmf lib/spmf.jar
# Without SPMF (Step 1 + Step 2 + support accuracy only)
python generator/validate.py \
--dataset datasets/fim/test/my_dataset.dat \
--config configs/config_fim_example.json# Run all FIM OFAT experiments
./experiments/run_fim.sh
# Run all HUIM OFAT experiments
./experiments/run_huim.sh
# Run a single dimension
python experiments/run_ofat.py --mode fim --dimension densityMode A — LLM generates patterns automatically (groundtruth list is empty):
{
"groundtruth": [],
"llm": {
"generate_names": true,
"generate_patterns": true,
"num_patterns": 5,
"print_patterns_with_names": true
}
}Mode B — Manual patterns (LLM ignored for pattern generation):
{
"groundtruth": [
{"itemset": [3, 17, 42], "target_support": 0.25},
{"itemset": [5, 99], "target_support": 0.10}
],
"llm": {
"generate_names": true,
"generate_patterns": false
}
}Manual groundtruth always takes priority over LLM generation.
| Parameter | Description | Typical range |
|---|---|---|
num_transactions |
Number of transactions | 100 – 100 000 |
num_items |
Vocabulary size | 20 – 5 000 |
avg_length |
Average transaction length | 2 – 30 |
density |
ρ = total items / (|D| × |I|) | 0.001 – 0.4 |
noise_ratio |
Fraction of transactions with no pattern | 0 – 0.6 |
pattern_overlap |
Fraction of shared items across patterns | 0 = disjoint |
| Path | Description |
|---|---|
Hybrid-model/ |
Main application (generator, backend, frontend) |
Hybrid-model/DOCKER.md |
Docker packaging notes |
Hybrid-model/backend/README.md |
REST API reference |
Hybrid-model/frontend/README.md |
Frontend component guide |
literature/ |
Academic references and papers |
guidance/ |
TER guidance materials |
tests/ |
Test suite |
TER_Report.pdf |
Final TER report |