Skip to content

maxhykw/ter-mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TER – Data Science 2025/2026

Synthetic Data Generation for Pattern Mining

This repository contains the material related to the TER (Travail d'Études et de Recherche) carried out during the academic year 2025–2026.

  • Degree: Master 1 – Data Science
  • Supervisor: Nadjib Lazaar
  • Institution: Université Paris-Saclay

Table of Contents

  1. Project Overview
  2. Project Status
  3. Architecture
  4. Quick Start with Docker
  5. Local Development Setup
  6. CLI Usage
  7. Configuration
  8. Repository Content

Project Overview

The TER focuses on the generation of synthetic datasets for pattern mining using generative AI, in particular Large Language Models (LLMs).

The main deliverable is a Hybrid Synthetic Transactional Data Generator combining:

  • LLM semantic framing — generates realistic item names, co-occurrence pairs, and ground-truth patterns
  • Statistical distribution control — power-law item frequencies, right-skewed transaction lengths

Supports both Frequent Itemset Mining (FIM) and High-Utility Itemset Mining (HUIM) modes, with a REST API backend and React web interface.


Project Status

Completed

  • Hybrid generation pipeline (FIM + HUIM)
  • LLM-based item name and co-occurrence generation
  • LLM-based ground-truth pattern generation (automatic, with manual override priority)
  • Statistical model with power-law item frequencies and right-skewed transaction lengths
  • Pattern embedding with transaction pool exhaustion check
  • Four-step validation pipeline:
    • Step 1: Structural validation (|D|, |I|, l, ρ)
    • Step 2: Statistical validation (power-law α, skewness, 80/20 rule)
    • Step 3: Ground-truth validation (support accuracy + Recall ≥ 0.80 via FP-Growth)
    • Step 4: OFAT experimental study
  • HUIM utility structure (profit table, quantity assignment, decoupling)
  • SPMF integration (FP-Growth, HUI-Miner)
  • FastAPI REST backend with job lifecycle management and SSE streaming
  • React web interface with real-time monitoring
  • Docker packaging (backend + frontend containers)

Known Limitations / In Progress

  • HUIM validation edge cases still under investigation
  • OFAT curves for HUIM dimensions not yet fully generated

Architecture

Directory Structure

ter-mining_ter/
├── README.md
├── requirements.txt
├── TER_Report.pdf
├── literature/                      # Academic references
├── guidance/                        # Guidance materials
├── tests/                           # Test suite
└── Hybrid-model/
    ├── docker-compose.yml           # Orchestrates backend + frontend
    ├── Dockerfile.backend           # Python 3.11 + Java runtime
    ├── Dockerfile.frontend          # Node 22 build + nginx runtime
    ├── DOCKER.md                    # Docker packaging notes
    │
    ├── generator/
    │   ├── generate.py              # CLI entry point
    │   ├── validate.py              # Validation entry point
    │   ├── config.py                # Config loading and dataclasses
    │   ├── llm_client.py            # OpenAI / DeepSeek / Gemini client
    │   ├── semantic_engine.py       # LLM: item names, co-occurrence, patterns
    │   ├── statistical_model.py     # Power-law frequencies + skewed lengths
    │   ├── pattern_embedder.py      # Ground-truth pattern embedding
    │   ├── transaction_assembler.py # Core assembly engine
    │   ├── huim_generator.py        # HUIM utility structure
    │   ├── validator.py             # Four-step validation logic
    │   └── utils/
    │       ├── distributions.py     # Power-law, skewness, 80/20 utilities
    │       ├── spmf_formatter.py    # SPMF format read/write
    │       ├── spmf_runner.py       # SPMF algorithm runner + recall
    │       └── plotting.py          # Validation plots
    │
    ├── backend/
    │   ├── main.py                  # FastAPI app
    │   ├── schemas.py               # Pydantic request/response models
    │   ├── routers/                 # generate, jobs, files, ofat
    │   └── services/                # job_manager, ofat_runner
    │
    ├── frontend/
    │   ├── src/
    │   │   ├── api/client.ts        # HTTP + SSE wrapper
    │   │   ├── components/          # React components
    │   │   └── types/api.ts         # TypeScript interfaces
    │   ├── nginx.conf               # Production reverse proxy config
    │   └── vite.config.ts           # Vite + Tailwind + API proxy
    │
    ├── configs/
    │   ├── api_keys.yaml            # LLM API keys (fill before use)
    │   ├── api_keys.yaml.template   # Template for api_keys.yaml
    │   ├── config_fim_example.json  # FIM example config
    │   └── config_huim_example.json # HUIM example config
    │
    ├── experiments/
    │   ├── run_ofat.py              # OFAT experiment script
    │   ├── run_fim.sh               # Automated FIM OFAT experiments
    │   ├── run_huim.sh              # Automated HUIM OFAT experiments
    │   ├── collect_metrics.py       # Metric collection
    │   └── results/
    │       ├── figures/             # Trend curves (PNG)
    │       └── tables/              # CSV results
    │
    └── lib/
        └── spmf.jar                 # SPMF library (FP-Growth, HUI-Miner)

Generation Pipeline

config.json
    ↓
[1] SemanticEngine      →  LLM generates item names + co-occurrence pairs
    ↓
[1.5] generate_patterns()
                        →  If groundtruth=[] and generate_patterns=true:
                           LLM generates patterns from item names
                           (skipped if groundtruth is manually set)
    ↓
[2] StatisticalModel    →  Power-law frequencies (shuffled) + skewed lengths
    ↓
[3] PatternEmbedder     →  Assigns transactions to patterns (pool check)
    ↓
[4] TransactionAssembler → Generates transactions by sampling
    ↓
[5] embed_patterns()    →  Forces pattern items into assigned transactions
    ↓
[6] HUIMGenerator       →  (HUIM only) Adds profits + quantities + decoupling
    ↓
.dat + profits.txt + groundtruth.json + metadata.json

Quick Start with Docker

Docker is the recommended way to run the full stack (backend + frontend) without any local Python or Node.js setup.

Prerequisites

Build and Run

All commands should be run from the Hybrid-model/ directory:

cd Hybrid-model/
docker compose build
docker compose up

Once running:

Service URL
Web UI (frontend) http://localhost:8080
Backend API http://localhost:8000
Swagger docs http://localhost:8000/docs
Health check http://localhost:8000/api/health

Generated datasets are written to ./datasets/jobs/ on the host via a volume mount. The frontend waits for the backend to pass its health check before starting.

To run in detached mode (background):

docker compose up -d

LLM API Keys in Docker

The LLM API key is located in Hybrid-model/configs/api_keys.yaml. Simply enter it into the page, and you can use the LLM normally (there’s no need to fill in the URL or model name).

configs/api_keys.yaml is excluded from Docker builds to keep secrets out of images.

Option 1 — Mock mode (no LLM): enable "Skip LLM calls (mock)" in the web UI. No API key needed.

Option 2 — Mount your key file at runtime: create Hybrid-model/docker-compose.override.yml:

services:
  backend:
    volumes:
      - ./datasets:/app/backend/datasets
      - ./configs/api_keys.yaml:/app/configs/api_keys.yaml:ro

Then run as usual:

docker compose up

Option 3 — Enter API key in the UI: the generation form accepts a custom API key per request.

Stop Containers

docker compose down

To also remove built images:

docker compose down --rmi local

Local Development Setup

Backend

Requirements: Python ≥ 3.11, Java JRE (for SPMF).

cd Hybrid-model/

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate          # macOS/Linux
# .venv\Scripts\activate           # Windows

# Install dependencies
pip install -r ../requirements.txt

# Fill in your LLM API key
cp configs/api_keys.yaml.template configs/api_keys.yaml
# edit configs/api_keys.yaml

# Start the backend
uvicorn backend.main:app --reload --port 8000

API available at http://localhost:8000 — Swagger UI at http://localhost:8000/docs.

Frontend

Requirements: Node.js ≥ 18. The backend must be running on port 8000.

cd Hybrid-model/frontend

npm install
npm run dev

The app is available at http://localhost:5173.


CLI Usage

All commands are run from Hybrid-model/.

Generate FIM Dataset

LLM generates ground-truth patterns automatically (groundtruth is empty in config):

python generator/generate.py --config configs/config_fim_example.json

With a custom output path:

python generator/generate.py --config configs/config_fim_example.json --output datasets/fim/my_dataset

Without LLM (mock mode, for testing):

python generator/generate.py --config configs/config_fim_example.json --mock

Generate HUIM Dataset

python generator/generate.py --config configs/config_huim_example.json

Validate Dataset

# With SPMF (full validation including Recall)
python generator/validate.py \
    --dataset datasets/fim/test/my_dataset.dat \
    --config configs/config_fim_example.json \
    --spmf lib/spmf.jar

# Without SPMF (Step 1 + Step 2 + support accuracy only)
python generator/validate.py \
    --dataset datasets/fim/test/my_dataset.dat \
    --config configs/config_fim_example.json

OFAT Experiments

# Run all FIM OFAT experiments
./experiments/run_fim.sh

# Run all HUIM OFAT experiments
./experiments/run_huim.sh

# Run a single dimension
python experiments/run_ofat.py --mode fim --dimension density

Configuration

Ground-truth Patterns

Mode A — LLM generates patterns automatically (groundtruth list is empty):

{
  "groundtruth": [],
  "llm": {
    "generate_names": true,
    "generate_patterns": true,
    "num_patterns": 5,
    "print_patterns_with_names": true
  }
}

Mode B — Manual patterns (LLM ignored for pattern generation):

{
  "groundtruth": [
    {"itemset": [3, 17, 42], "target_support": 0.25},
    {"itemset": [5, 99],     "target_support": 0.10}
  ],
  "llm": {
    "generate_names": true,
    "generate_patterns": false
  }
}

Manual groundtruth always takes priority over LLM generation.

Key Parameters

Parameter Description Typical range
num_transactions Number of transactions 100 – 100 000
num_items Vocabulary size 20 – 5 000
avg_length Average transaction length 2 – 30
density ρ = total items / (|D| × |I|) 0.001 – 0.4
noise_ratio Fraction of transactions with no pattern 0 – 0.6
pattern_overlap Fraction of shared items across patterns 0 = disjoint

Repository Content

Path Description
Hybrid-model/ Main application (generator, backend, frontend)
Hybrid-model/DOCKER.md Docker packaging notes
Hybrid-model/backend/README.md REST API reference
Hybrid-model/frontend/README.md Frontend component guide
literature/ Academic references and papers
guidance/ TER guidance materials
tests/ Test suite
TER_Report.pdf Final TER report

About

Hybrid system for synthetic transactional dataset generation (FIM & HUIM) combining LLM-based semantic item and pattern generation with statistical modeling (power-law distributions, skewed transaction lengths) to produce realistic, validation-ready datasets. Includes FastAPI backend, React UI, CLI tools, Docker deployment, and full SPMF-based vali

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors