TER – Data Science 2025/2026

Synthetic Data Generation for Pattern Mining

This repository contains the material related to the TER (Travail d'Études et de Recherche) carried out during the academic year 2025–2026.

Degree: Master 1 – Data Science
Supervisor: Nadjib Lazaar
Institution: Université Paris-Saclay

Project Overview

The TER focuses on the generation of synthetic datasets for pattern mining using generative AI, in particular Large Language Models (LLMs).

The main deliverable is a Hybrid Synthetic Transactional Data Generator combining:

LLM semantic framing — generates realistic item names, co-occurrence pairs, and ground-truth patterns
Statistical distribution control — power-law item frequencies, right-skewed transaction lengths

Supports both Frequent Itemset Mining (FIM) and High-Utility Itemset Mining (HUIM) modes, with a REST API backend and React web interface.

Project Status

Completed

Hybrid generation pipeline (FIM + HUIM)
LLM-based item name and co-occurrence generation
LLM-based ground-truth pattern generation (automatic, with manual override priority)
Statistical model with power-law item frequencies and right-skewed transaction lengths
Pattern embedding with transaction pool exhaustion check
Four-step validation pipeline:
- Step 1: Structural validation (|D|, |I|, l, ρ)
- Step 2: Statistical validation (power-law α, skewness, 80/20 rule)
- Step 3: Ground-truth validation (support accuracy + Recall ≥ 0.80 via FP-Growth)
- Step 4: OFAT experimental study
HUIM utility structure (profit table, quantity assignment, decoupling)
SPMF integration (FP-Growth, HUI-Miner)
FastAPI REST backend with job lifecycle management and SSE streaming
React web interface with real-time monitoring
Docker packaging (backend + frontend containers)

Known Limitations / In Progress

HUIM validation edge cases still under investigation
OFAT curves for HUIM dimensions not yet fully generated

Architecture

Directory Structure

ter-mining_ter/
├── README.md
├── requirements.txt
├── TER_Report.pdf
├── literature/                      # Academic references
├── guidance/                        # Guidance materials
├── tests/                           # Test suite
└── Hybrid-model/
    ├── docker-compose.yml           # Orchestrates backend + frontend
    ├── Dockerfile.backend           # Python 3.11 + Java runtime
    ├── Dockerfile.frontend          # Node 22 build + nginx runtime
    ├── DOCKER.md                    # Docker packaging notes
    │
    ├── generator/
    │   ├── generate.py              # CLI entry point
    │   ├── validate.py              # Validation entry point
    │   ├── config.py                # Config loading and dataclasses
    │   ├── llm_client.py            # OpenAI / DeepSeek / Gemini client
    │   ├── semantic_engine.py       # LLM: item names, co-occurrence, patterns
    │   ├── statistical_model.py     # Power-law frequencies + skewed lengths
    │   ├── pattern_embedder.py      # Ground-truth pattern embedding
    │   ├── transaction_assembler.py # Core assembly engine
    │   ├── huim_generator.py        # HUIM utility structure
    │   ├── validator.py             # Four-step validation logic
    │   └── utils/
    │       ├── distributions.py     # Power-law, skewness, 80/20 utilities
    │       ├── spmf_formatter.py    # SPMF format read/write
    │       ├── spmf_runner.py       # SPMF algorithm runner + recall
    │       └── plotting.py          # Validation plots
    │
    ├── backend/
    │   ├── main.py                  # FastAPI app
    │   ├── schemas.py               # Pydantic request/response models
    │   ├── routers/                 # generate, jobs, files, ofat
    │   └── services/                # job_manager, ofat_runner
    │
    ├── frontend/
    │   ├── src/
    │   │   ├── api/client.ts        # HTTP + SSE wrapper
    │   │   ├── components/          # React components
    │   │   └── types/api.ts         # TypeScript interfaces
    │   ├── nginx.conf               # Production reverse proxy config
    │   └── vite.config.ts           # Vite + Tailwind + API proxy
    │
    ├── configs/
    │   ├── api_keys.yaml            # LLM API keys (fill before use)
    │   ├── api_keys.yaml.template   # Template for api_keys.yaml
    │   ├── config_fim_example.json  # FIM example config
    │   └── config_huim_example.json # HUIM example config
    │
    ├── experiments/
    │   ├── run_ofat.py              # OFAT experiment script
    │   ├── run_fim.sh               # Automated FIM OFAT experiments
    │   ├── run_huim.sh              # Automated HUIM OFAT experiments
    │   ├── collect_metrics.py       # Metric collection
    │   └── results/
    │       ├── figures/             # Trend curves (PNG)
    │       └── tables/              # CSV results
    │
    └── lib/
        └── spmf.jar                 # SPMF library (FP-Growth, HUI-Miner)

Generation Pipeline

config.json
    ↓
[1] SemanticEngine      →  LLM generates item names + co-occurrence pairs
    ↓
[1.5] generate_patterns()
                        →  If groundtruth=[] and generate_patterns=true:
                           LLM generates patterns from item names
                           (skipped if groundtruth is manually set)
    ↓
[2] StatisticalModel    →  Power-law frequencies (shuffled) + skewed lengths
    ↓
[3] PatternEmbedder     →  Assigns transactions to patterns (pool check)
    ↓
[4] TransactionAssembler → Generates transactions by sampling
    ↓
[5] embed_patterns()    →  Forces pattern items into assigned transactions
    ↓
[6] HUIMGenerator       →  (HUIM only) Adds profits + quantities + decoupling
    ↓
.dat + profits.txt + groundtruth.json + metadata.json

Quick Start with Docker

Docker is the recommended way to run the full stack (backend + frontend) without any local Python or Node.js setup.

Prerequisites

Docker ≥ 24
Docker Compose v2 (bundled with Docker Desktop)

Build and Run

All commands should be run from the Hybrid-model/ directory:

cd Hybrid-model/
docker compose build
docker compose up

Once running:

Service	URL
Web UI (frontend)	http://localhost:8080
Backend API	http://localhost:8000
Swagger docs	http://localhost:8000/docs
Health check	http://localhost:8000/api/health

Generated datasets are written to ./datasets/jobs/ on the host via a volume mount. The frontend waits for the backend to pass its health check before starting.

To run in detached mode (background):

docker compose up -d

LLM API Keys in Docker

The LLM API key is located in Hybrid-model/configs/api_keys.yaml. Simply enter it into the page, and you can use the LLM normally (there’s no need to fill in the URL or model name).

configs/api_keys.yaml is excluded from Docker builds to keep secrets out of images.

Option 1 — Mock mode (no LLM): enable "Skip LLM calls (mock)" in the web UI. No API key needed.

Option 2 — Mount your key file at runtime: create Hybrid-model/docker-compose.override.yml:

services:
  backend:
    volumes:
      - ./datasets:/app/backend/datasets
      - ./configs/api_keys.yaml:/app/configs/api_keys.yaml:ro

Then run as usual:

docker compose up

Option 3 — Enter API key in the UI: the generation form accepts a custom API key per request.

Stop Containers

docker compose down

To also remove built images:

docker compose down --rmi local

Local Development Setup

Backend

Requirements: Python ≥ 3.11, Java JRE (for SPMF).

cd Hybrid-model/

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate          # macOS/Linux
# .venv\Scripts\activate           # Windows

# Install dependencies
pip install -r ../requirements.txt

# Fill in your LLM API key
cp configs/api_keys.yaml.template configs/api_keys.yaml
# edit configs/api_keys.yaml

# Start the backend
uvicorn backend.main:app --reload --port 8000

API available at http://localhost:8000 — Swagger UI at http://localhost:8000/docs.

Frontend

Requirements: Node.js ≥ 18. The backend must be running on port 8000.

cd Hybrid-model/frontend

npm install
npm run dev

The app is available at http://localhost:5173.

CLI Usage

All commands are run from Hybrid-model/.

Generate FIM Dataset

LLM generates ground-truth patterns automatically (groundtruth is empty in config):

python generator/generate.py --config configs/config_fim_example.json

With a custom output path:

python generator/generate.py --config configs/config_fim_example.json --output datasets/fim/my_dataset

Without LLM (mock mode, for testing):

python generator/generate.py --config configs/config_fim_example.json --mock

Generate HUIM Dataset

python generator/generate.py --config configs/config_huim_example.json

Validate Dataset

# With SPMF (full validation including Recall)
python generator/validate.py \
    --dataset datasets/fim/test/my_dataset.dat \
    --config configs/config_fim_example.json \
    --spmf lib/spmf.jar

# Without SPMF (Step 1 + Step 2 + support accuracy only)
python generator/validate.py \
    --dataset datasets/fim/test/my_dataset.dat \
    --config configs/config_fim_example.json

OFAT Experiments

# Run all FIM OFAT experiments
./experiments/run_fim.sh

# Run all HUIM OFAT experiments
./experiments/run_huim.sh

# Run a single dimension
python experiments/run_ofat.py --mode fim --dimension density

Configuration

Ground-truth Patterns

Mode A — LLM generates patterns automatically (groundtruth list is empty):

{
  "groundtruth": [],
  "llm": {
    "generate_names": true,
    "generate_patterns": true,
    "num_patterns": 5,
    "print_patterns_with_names": true
  }
}

Mode B — Manual patterns (LLM ignored for pattern generation):

{
  "groundtruth": [
    {"itemset": [3, 17, 42], "target_support": 0.25},
    {"itemset": [5, 99],     "target_support": 0.10}
  ],
  "llm": {
    "generate_names": true,
    "generate_patterns": false
  }
}

Manual groundtruth always takes priority over LLM generation.

Key Parameters

Parameter	Description	Typical range
`num_transactions`	Number of transactions	100 – 100 000
`num_items`	Vocabulary size	20 – 5 000
`avg_length`	Average transaction length	2 – 30
`density`	ρ = total items / (\|D\| × \|I\|)	0.001 – 0.4
`noise_ratio`	Fraction of transactions with no pattern	0 – 0.6
`pattern_overlap`	Fraction of shared items across patterns	0 = disjoint

Repository Content

Path	Description
`Hybrid-model/`	Main application (generator, backend, frontend)
`Hybrid-model/DOCKER.md`	Docker packaging notes
`Hybrid-model/backend/README.md`	REST API reference
`Hybrid-model/frontend/README.md`	Frontend component guide
`literature/`	Academic references and papers
`guidance/`	TER guidance materials
`tests/`	Test suite
`TER_Report.pdf`	Final TER report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TER – Data Science 2025/2026

Synthetic Data Generation for Pattern Mining

Table of Contents

Project Overview

Project Status

Completed

Known Limitations / In Progress

Architecture

Directory Structure

Generation Pipeline

Quick Start with Docker

Prerequisites

Build and Run

LLM API Keys in Docker

Stop Containers

Local Development Setup

Backend

Frontend

CLI Usage

Generate FIM Dataset

Generate HUIM Dataset

Validate Dataset

OFAT Experiments

Configuration

Ground-truth Patterns

Key Parameters

Repository Content

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
Hybrid-model		Hybrid-model
guidance		guidance
literature		literature
tests		tests
.gitignore		.gitignore
README.md		README.md
TER_Report.pdf		TER_Report.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TER – Data Science 2025/2026

Synthetic Data Generation for Pattern Mining

Table of Contents

Project Overview

Project Status

Completed

Known Limitations / In Progress

Architecture

Directory Structure

Generation Pipeline

Quick Start with Docker

Prerequisites

Build and Run

LLM API Keys in Docker

Stop Containers

Local Development Setup

Backend

Frontend

CLI Usage

Generate FIM Dataset

Generate HUIM Dataset

Validate Dataset

OFAT Experiments

Configuration

Ground-truth Patterns

Key Parameters

Repository Content

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages