Skip to content

snowmuffin/convmerge

convmerge

PyPI Python versions License: MIT CI PyPI downloads Contributor Covenant

Convert Alpaca, ShareGPT, and mixed chat datasets into a unified messages JSONL for LLM supervised fine-tuning.
Fetch from HuggingFace or GitHub, normalize messy Parquet / JSON / JSONL, convert between Alpaca / ShareGPT / chat schemas, weighted-mix multiple domain sources, and deduplicate — all in one CLI pipeline.

convmerge is a data-preparation CLI and library for LLM supervised fine-tuning (SFT). It takes heterogeneous instruction-tuning datasets — Alpaca, ShareGPT, raw chat JSONL, Parquet dumps — and produces a single clean JSONL file in the standard messages format (or back to alpaca shape) that any fine-tuning framework can consume directly.

It is intentionally scoped to the pre-training-loop step: no model loading, no inference, no labeling, no training orchestration. See Out of scope below.

Repository: github.com/snowmuffin/convmerge
Status: pre-1.0; APIs and CLI may change between minor versions until 1.0.

Install

pip install convmerge                    # core: convert, dedupe, turns; normalize for .json/.jsonl
pip install "convmerge[all]"             # full CLI: fetch (HF+GitHub), parquet, YAML presets

Granular extras:

pip install "convmerge[fetch]"           # YAML manifests + GitHub (PyYAML)
pip install "convmerge[fetch-all]"       # fetch + HuggingFace (``datasets``)
pip install "convmerge[fetch-hf]"        # same dependencies as ``fetch-all`` (backward-compatible name)
pip install "convmerge[parquet]"         # Parquet input for ``normalize``
pip install "convmerge[preset]"          # YAML convert presets (`--preset`, `preset validate`)
Command / feature Extra
convert, dedupe, turns (core)
normalize on .parquet [parquet]
fetch with YAML manifest or GitHub [fetch]
fetch with HuggingFace manifest entries [fetch-all] or [fetch-hf]
convert --preset, preset [preset]
Everything above [all]

Or from a clone:

git clone https://github.com/snowmuffin/convmerge.git
cd convmerge
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,all]"

The four use cases

1. fetch — pull raw data from HF + GitHub via a YAML manifest

HuggingFace entries delegate to datasets.load_dataset(...).to_json(...), i.e. the output is a JSONL dump of the selected split. GitHub entries support a single raw URL, recursive Trees API fetch with an extension filter, or git clone (with optional git lfs pull). fetch is a reproducible downloader, not a mirror of HuggingFace's Arrow cache.

# manifest.yaml
version: 1
defaults: { output_root: ./raw, resume: true }
auth:     { hf_token_env: HF_TOKEN, github_token_env: GITHUB_TOKEN }
datasets:
  - { name: alpaca-ko, hf: MarkrAI/KoCommercial-Dataset, split: train }
  - { name: orca-raw,
      url: https://raw.githubusercontent.com/org/repo/main/data/train.jsonl }
  - { name: repo-tree,
      url: https://github.com/org/example-repo, ext: [".jsonl"] }
  - { name: big-lfs,
      url: https://github.com/org/big-lfs-repo, mode: clone, lfs: true }
convmerge fetch manifest.yaml -o ./raw
# or one-shot shortcuts:
convmerge fetch hf://org/dataset -o ./raw --split train
convmerge fetch https://github.com/org/repo -o ./raw --ext .jsonl

Tokens resolve in order CLI flag → file → env var, and are redacted from logs. See docs/fetch.md for the full schema.

2. normalize — reshape parquet / messy JSON into clean JSONL

convmerge normalize -i ./raw -o ./jsonl

Handles parquet (streamed via pyarrow), top-level JSON arrays, concatenated single-line JSON ({...}{...}{...}), and already-valid JSONL. A directory input is walked recursively and mirrored under the output directory.

3. convert — adapter + emitter pipeline

convmerge convert -i ./jsonl/alpaca.jsonl -o ./train/alpaca.messages.jsonl \
  --from alpaca --format messages

convmerge convert -i ./jsonl/mixed.jsonl -o ./train/mixed.messages.jsonl \
  --from auto --format messages         # auto-detecting chat adapter

# Optional: YAML preset (pip install "convmerge[preset]")
convmerge preset init -o convert_preset.yaml
convmerge preset validate convert_preset.yaml
convmerge convert -i ./jsonl/mixed.jsonl -o ./out.jsonl --preset convert_preset.yaml

Adapters: alpaca, sharegpt, chat (alias auto).
Emitters: messages, alpaca.

Presets and team-specific tuning: docs/custom_presets.md.

chat / auto is a heuristic adapter: it inspects the keys of each input record (messages, conversation(s), text, conversation_a/_b, instruction/input/output, …) and routes to the right branch with a configurable role map. For unusual schemas, pin an explicit adapter (alpaca, sharegpt) or override keys programmatically — see docs/format.md.

4. mix — domain-controlled weighted merge

# Inline weights
convmerge mix \
  -i ./train/code.messages.jsonl:0.4 \
     ./train/math.messages.jsonl:0.3 \
     ./train/general.messages.jsonl:0.3 \
  -o ./train/mixed.jsonl --total 100000 --seed 42

# Or via a config file (YAML requires convmerge[preset])
convmerge mix mix.yaml
# mix.yaml
seed: 42
total: 100000
output: ./train/mixed.jsonl
sources:
  - { path: ./train/code.messages.jsonl,    weight: 0.4 }
  - { path: ./train/math.messages.jsonl,    weight: 0.3 }
  - { path: ./train/general.messages.jsonl, weight: 0.3 }

Weights are normalized automatically and need not sum to 1.0. When a source has fewer records than its allocation it is clipped; pass --oversample to sample with replacement instead. A sidecar .mix.json is written alongside the output recording the exact seed, weights, and per-source counts for full reproducibility. Omit --total to merge all records from every source.

5. dedupe / turns — final cleanup + train/eval split hook

convmerge dedupe -i ./train/mixed.jsonl -o ./train/mixed.dedup.jsonl
convmerge turns  -i ./train/mixed.dedup.jsonl \
  --single-out ./train/single.jsonl \
  --multi-out  ./train/multi.jsonl

See docs/format.md for adapter / emitter schemas and docs/fetch.md for manifest details.

Out of scope

To keep the package lean and dependency-free at its core, convmerge does not include — and has no plans to include — the following:

  • Model loading / inference / training. No PyTorch, Transformers, vLLM, or similar runtime is imported by the core or any shipped extra.
  • Automatic labeling or classification of samples (e.g. topic tagging, quality scoring, safety classification). These are left to upstream tools or private pipelines.
  • RLHF / DPO / preference-dataset construction beyond passing through existing pairwise rows via the chat adapter's pairwise_mode.
  • Training-job orchestration (SkyPilot, RunPod, Modal, K8s operators).
  • Prompt templating / chat-template rendering for specific model families. Output JSONL uses the standard messages / alpaca shapes; downstream trainers apply their own template.
  • Tokenizer-aware length filtering, packing, or curriculum scheduling. Those live in the training stack, not here.
  • Scraping HTML pages or running browser automation. Structured JSON / JSONL / Parquet inputs only.

If any of these are important to your workflow, wire convmerge in as one step of a larger pipeline rather than expecting it to grow into those areas.

Development

See CONTRIBUTING.md for the full guide — setup, local checks, code conventions, and a walkthrough for adding a new adapter / emitter. CI runs Ruff + pytest on Python 3.10 – 3.12.

pip install -e ".[dev,all]"
ruff check src tests
ruff format --check src tests
pytest -q

Participation in this project is governed by the Contributor Covenant Code of Conduct.

Good first PRs: new adapters / emitters for public dataset schemas, new fetch backends (GitLab / Zenodo / Kaggle), recipe examples under examples/, and docs improvements. Browse the good first issue label for concrete starting points.

PyPI release (maintainers)

Releases run from .github/workflows/publish.yml on pushing a v* tag. Publishing authenticates via the PYPI_API_TOKEN GitHub Actions secret.

  1. Create an API token on pypi.org.
    • If the project already exists on PyPI, scope the token to the convmerge project (principle of least privilege).
    • For the very first upload (project not yet registered), PyPI does not allow project-scoped tokens — use Entire account scope for the first release, then rotate to a project-scoped token afterwards and revoke the original.
  2. In the GitHub repo, Settings → Secrets and variables → Actions → New repository secret, add PYPI_API_TOKEN with the token value.
  3. Tag and push: git tag vX.Y.Z && git push origin vX.Y.Z.

Changelog

CHANGELOG.md

License

MIT

About

Merge heterogeneous chat/text sources into a single LLM training format (JSONL)

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages