Skip to content

MantisAI/GSMA-dataset-creation

Repository files navigation

GSMA Dataset Creation (PRJ-41)

Pipelines for creating QA triplets from GSMA data for the Open Telecom LLMs project project.

Overview

This repository contains end-to-end data processing pipelines that transform GSMA technical specifications and reports into high-quality synthetic question-answer datasets for training telecom-focused language models. The pipeline processes hundreds of documents through stages including document conversion, semantic chunking, synthetic Q&A generation using large language models, similarity analysis, quality filtering, and LLM-based validation. The resulting datasets are published to HuggingFace Hub in both contrastive learning format (for embedding models) and Q&A format (for retrieval-augmented generation). The repository includes three main pipelines: PRD (technical specifications), Discover (reports and whitepapers), and Annotation (human validation workflows with domain expert workspaces).

image

Notes

This projects was developed fairly rapidly, and some design decisions we made at the outset we now consider to be sub-optimal. Owing to time constraints, and an unwillingness to recreate time consuming and expensive steps (such as question creation and validation), there is some technical debt that would ideally be resolved in a longer project. Overall, in retrospect, dvc was not a good fit for this workflow as there are very time consuming, and expensive processes which are not likely to be reproduced. Iterative (dvc developers) now have a tool called datachain which is worth evaluating as a better fit, and was designed to resolve many of the shortcomings we experienced with dvc which is otherwise well suited for creating reproducible AI (machine learning) pipelines.

Only re-run the whole pipeline if a new dataset is required

Running the complete dvc pipelines will recreate the questions data and then the validation, both of which are tasks which take a considerable amount of time, and owing to the non-deterministic nature of LLMs, the questions created will be different from the data delivered to the Open Telco project so far. For this reason I (@ivyleavedtoadflax) recommened not attempting to re-run the whole pipeline, unless it is to completely recreate the datasets that were produced for the project, which is probably not desireable.

In addition, the existing PRD pipeline will register as being in need of reproduction if you were to run dvc status or dvc repro --dry. This is because we made changes to the pipeline components as we went, but did not recreate the PRD pipeline from the start as this would have invalidated results that had already been created and annotated.

Task Order

Some tasks (such as assigning a working group) would be better completed as part of the pipeline and incorporated at an early stage. We did not do this as it would have required recreation of the questions data, as we did not receive the working group mapping until after this had been produced.

Data Format

Initially we worked with a simple json file format with one document per chunk. Later this became a bottleneck, and we switched to parquet files. To avoid having to re-run the question creation task, we did not implement parquet at the beginning of the pipleine, so you will see the initial stages of the pipeline using json files, and the later stages using parquet. Ideally we would have used parquet throughout.

Binary classifiers

The filter stages of the workflows were implemented with binary classifiers distilled from larger models. We used the framework sieves to do this. Sieves is moving rapidly, and so as to not add instability to the dependencies of this project, we did not attempt to implement the distillation process in this repository. Instead we used a sieves script to train the models externally in their own virtual environment, and simply loaded the models in the pipeline for inference.

An example script is included in the examples/ folder.

Limitations of OpenRouter

In the validation stage we need to send several hundred thousands of requests to qwen for validation, which is slow and expensive. I recommend fixing the provider to Cerebras which is optimised for high throughput inference at reasonable costs. Running with 50 concurrent tasks seems to work without issue. Running with 100 generated many 429 errors. Somewhere in between might be optimal.

Managing the validation step

The validation step requires several hundred thousand API calls. In order to manage this process effectively, we implemented an SQLite job tracker that tracks the success of the API calls in an ephemeral database stored in .dvc/.tmp/validation_checkpoints/requests.db. This ensures that we can track succesful and failed tasks for repetition, something which was not achievable with dvc alone.

If you wish to run this step with dvc the approach is to:

  1. Delete the database prior to running the validate_requests stage for the first time. Setting the --force parameter on uv run gsma validation validate-requests will have the same effect.
  2. Set a request limit (50,000 is reasonable) to reduce memory overhead
  3. Run the validate_requests stage multiple times until no further request in the queue remain. If you run the stage with dvc it will show the stage as completed after the first run, since it has no knowledge of the checkpoints database, so you will need to run dvc repro -sf pipelines/prd/validate_requests to force re-running that stage. Passing the -i parameter will make it interactive and allow you to confirm the run prior to execution.
  4. You can monitor progress of the job by running uv run scripts/check_validation_progress.py

Setup

  1. Install dependencies:
uv sync
  1. Configure AWS credentials (if using remote DVC storage):
# Set up your Mantis AWS key
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

# This may fail for some artefacts. See note below for DVC limitations.
  1. Pull latest data:
dvc pull

Usage

Running the Complete Pipeline

dvc repro

Data Structure

data/
├── raw/                    # Original source documents (DOCX, PDF)
├── prd/                    # PRD pipeline outputs
│   ├── processed/          # Markdown files from DOCX conversion
│   ├── chunks_*/           # Chunked data at different token sizes (500, 1000, 2000, 3000, 4000)
│   ├── questions_*/        # Generated Q&A pairs per chunk size
│   ├── combined/           # Merged chunks + questions with working group classification
│   ├── similarity/         # Similarity analysis results (hashes, rankings, overlaps)
│   ├── exploded/           # Question-centric format with positive/negative chunks
│   ├── filtered/           # Quality-filtered questions and chunks
│   └── validation/         # LLM validation results and final datasets
├── discover/               # Discover pipeline outputs (similar structure to PRD)
└── gsma_prd_synthetic_with_subgroups/  # Annotated dataset with subgroup classifications

Pipeline Stages

PRD Pipeline (pipelines/prd/dvc.yaml)

End-to-end pipeline for technical specifications:

  1. process_documents:

    • Converts DOCX → Markdown
    • Removes GSMA template boilerplate
    • Input: data/raw/ → Output: data/prd/processed/
  2. create_late_chunks (5 stages):

    • Creates late chunks at 500/1000/2000/3000/4000 tokens
    • Uses sentence-transformers/all-MiniLM-L6-v2 embeddings
    • Output: data/prd/chunks_{size}/
  3. generate_questions (5 stages):

    • Generates 5/10/20/30/40 synthetic Q&A pairs per chunk size
    • Uses Cerebras GPT-OSS-120B via OpenRouter
    • Output: data/prd/questions_{size}/
  4. data_combiner:

    • Merges all chunks + questions with working group classification
    • Output: data/prd/combined/
  5. similarity_hasher:

    • Adds SHA-256 content hashes for deduplication
    • Output: data/prd/similarity/hashed/
  6. similarity_ranker:

    • FAISS IVFFlat similarity search (top-K=20, threshold=0.3)
    • Output: data/prd/similarity/ranked/
  7. overlap_detector:

    • Character offset-based text overlap detection (min 50 chars)
    • Output: data/prd/similarity/overlaps/
  8. explode_questions:

    • Question-centric format (min-similarity: 0.35, max: 0.95)
    • Output: data/prd/exploded/
  9. apply_question_filter:

    • External reference classifier (filters unavailable content)
    • Output: data/prd/filtered/questions/
  10. apply_chunk_filter:

    • Procedures classifier + keyword exclusion
    • Filters: legal/procedural content, "prd@gsma.com" boilerplate
    • Output: data/prd/filtered/chunks/
  11. filter_questions_by_chunk_quality:

    • Combined quality filtering (min probability: 0.5)
    • Output: data/prd/filtered/combined/
  12. validate_requests:

    • LLM validation with Qwen 235B via Cerebras
    • 50 concurrent requests, 50k question limit
    • Output: data/prd/validation/validated/
  13. create_validation_dataset:

    • Dual format: embedding (contrastive) + QA (RAG)
    • Max 3 positives/negatives per question
    • Output: data/prd/validation/datasets/
  14. upload_embedding_dataset:

    • Uploads to HuggingFace: mantisnlp/gsma_prd_synthetic_embedding
  15. upload_qa_dataset:

    • Uploads to HuggingFace: mantisnlp/gsma_prd_synthetic_qa

Discover Pipeline (pipelines/discover/dvc.yaml)

Similar structure for reports/whitepapers (304 PDF/DOCX documents):

  • Includes web scraping with Playwright
  • PDF processing via PyMuPDF
  • Same chunking → validation → dataset creation workflow
  • Outputs: mantisnlp/gsma_discover_synthetic_embedding and mantisnlp/gsma_discover_synthetic_qa

Annotation Pipeline (pipelines/annotation/dvc.yaml)

Human validation workflow with subgroup-based tasks:

  1. add_subgroups: Adds working group/subgroup classifications to datasets
  2. upload_*_annotation: Creates Argilla workspaces for domain experts (TSG, FASG, NG, RCS, eSIM)

Running Tests

uv run pytest tests/ -v

Install in Development Mode

uv sync

CLI Commands

The gsma CLI provides comprehensive tools for document processing, question generation, validation, filtering, and annotation management.

Document Processing

# Convert DOCX to Markdown
uv run gsma process <input_dir> <output_dir>

# Remove duplicate files
uv run gsma deduplicate <directory> [--execute]

# Create chunks from Markdown files
uv run gsma chunk <input_dir> <output_dir> [--chunker late] [--chunk-size 500]

Question Generation

# Generate synthetic Q&A pairs from chunks
uv run gsma questions generate-from-chunks <input_path> <output_path> \
  --num-questions 5 \
  --model cerebras/llama3.1-70b

# Combine questions with chunks
uv run gsma questions combine-questions <questions_path> <chunks_path> <output_path>

Similarity Analysis

# Combine data with working group classification
uv run gsma similarity combine <chunks_dir> <questions_dir> <output_path>

# Add SHA-256 content hashes
uv run gsma similarity hash <input_path> <output_path>

# FAISS similarity ranking
uv run gsma similarity rank <input_path> <output_path> --top-k 20

# Detect text overlaps
uv run gsma similarity detect-overlaps <input_path> <output_path>

Quality Filtering

# Apply chunk quality filter (procedures classifier)
uv run gsma filters apply-chunk-filter <input_path> <output_path>

# Apply question filter (external reference classifier)
uv run gsma filters apply-question-filter <input_path> <output_path>

# Filter questions by chunk quality
uv run gsma filters filter-questions-by-chunk-quality <input_path> <output_path>

Validation

# Explode questions to question-centric format
uv run gsma validation explode-questions <input_path> <output_path>

# Validate Q&A pairs with LLM
uv run gsma validation validate-requests <input_path> <output_path> \
  --model cerebras/qwen-2.5-235b \
  --max-concurrent 50

Dataset Creation

# Create datasets from validation results
uv run gsma datasets create-from-validation <input_path> <output_dir>

# Upload to HuggingFace Hub
uv run gsma datasets upload <dataset_path> <repo_name>

Argilla Annotation Management

# Upload dataset for annotation
uv run gsma argilla upload --dataset-path <path> -w <workspace>

# Upload by subgroup
uv run gsma argilla upload-by-subgroup \
  --dataset-path <path> \
  --subgroup TSG \
  --sample-size 1000

# User management
uv run gsma argilla add-users -w TSG --count 10 --output-csv users.csv
uv run gsma argilla add-user -u alice -p secret123 -w TSG -w FASG
uv run gsma argilla list-users -w TSG
uv run gsma argilla list-workspaces
uv run gsma argilla list-datasets -w TSG

# Track annotation progress
uv run gsma argilla track-progress -w TSG

# Download annotated results
uv run gsma argilla download <dataset_name> --output-path <path> -w <workspace>

# Cleanup
uv run gsma argilla delete-user -u username
uv run gsma argilla delete-workspace TSG

Subgroup Classification

# Add subgroup classifications to dataset
uv run gsma add-subgroup-to-dataset \
  --dataset-repo mantisnlp/gsma_prd_synthetic \
  --working-groups data/working_groups_mapping.json \
  --output data/gsma_prd_synthetic_with_subgroups

Help

# Get help for any command
uv run gsma --help
uv run gsma <command> --help
uv run gsma <command> <subcommand> --help

About

GSMA telecom documentation dataset creation pipeline with hard negative generation for embedding training. Features concurrent LLM validation, semantic similarity ranking, and DVC-based reproducible data processing.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

 
 
 

Contributors