Convert Alpaca, ShareGPT, and mixed chat datasets into a unified
messagesJSONL for LLM supervised fine-tuning.
Fetch from HuggingFace or GitHub, normalize messy Parquet / JSON / JSONL, convert between Alpaca / ShareGPT / chat schemas, weighted-mix multiple domain sources, and deduplicate — all in one CLI pipeline.
convmerge is a data-preparation CLI and library for LLM supervised fine-tuning (SFT).
It takes heterogeneous instruction-tuning datasets — Alpaca, ShareGPT, raw chat JSONL,
Parquet dumps — and produces a single clean JSONL file in the standard messages format
(or back to alpaca shape) that any fine-tuning framework can consume directly.
It is intentionally scoped to the pre-training-loop step: no model loading, no inference, no labeling, no training orchestration. See Out of scope below.
Repository: github.com/snowmuffin/convmerge
Status: pre-1.0; APIs and CLI may change between minor versions until 1.0.
pip install convmerge # core: convert, dedupe, turns; normalize for .json/.jsonl
pip install "convmerge[all]" # full CLI: fetch (HF+GitHub), parquet, YAML presetsGranular extras:
pip install "convmerge[fetch]" # YAML manifests + GitHub (PyYAML)
pip install "convmerge[fetch-all]" # fetch + HuggingFace (``datasets``)
pip install "convmerge[fetch-hf]" # same dependencies as ``fetch-all`` (backward-compatible name)
pip install "convmerge[parquet]" # Parquet input for ``normalize``
pip install "convmerge[preset]" # YAML convert presets (`--preset`, `preset validate`)| Command / feature | Extra |
|---|---|
convert, dedupe, turns |
(core) |
normalize on .parquet |
[parquet] |
fetch with YAML manifest or GitHub |
[fetch] |
fetch with HuggingFace manifest entries |
[fetch-all] or [fetch-hf] |
convert --preset, preset |
[preset] |
| Everything above | [all] |
Or from a clone:
git clone https://github.com/snowmuffin/convmerge.git
cd convmerge
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,all]"HuggingFace entries delegate to
datasets.load_dataset(...).to_json(...), i.e. the output is a JSONL dump of the selected split. GitHub entries support a single raw URL, recursive Trees API fetch with an extension filter, orgit clone(with optionalgit lfs pull).fetchis a reproducible downloader, not a mirror of HuggingFace's Arrow cache.
# manifest.yaml
version: 1
defaults: { output_root: ./raw, resume: true }
auth: { hf_token_env: HF_TOKEN, github_token_env: GITHUB_TOKEN }
datasets:
- { name: alpaca-ko, hf: MarkrAI/KoCommercial-Dataset, split: train }
- { name: orca-raw,
url: https://raw.githubusercontent.com/org/repo/main/data/train.jsonl }
- { name: repo-tree,
url: https://github.com/org/example-repo, ext: [".jsonl"] }
- { name: big-lfs,
url: https://github.com/org/big-lfs-repo, mode: clone, lfs: true }convmerge fetch manifest.yaml -o ./raw
# or one-shot shortcuts:
convmerge fetch hf://org/dataset -o ./raw --split train
convmerge fetch https://github.com/org/repo -o ./raw --ext .jsonlTokens resolve in order CLI flag → file → env var, and are redacted from logs. See docs/fetch.md for the full schema.
convmerge normalize -i ./raw -o ./jsonlHandles parquet (streamed via pyarrow), top-level JSON arrays, concatenated
single-line JSON ({...}{...}{...}), and already-valid JSONL. A directory
input is walked recursively and mirrored under the output directory.
convmerge convert -i ./jsonl/alpaca.jsonl -o ./train/alpaca.messages.jsonl \
--from alpaca --format messages
convmerge convert -i ./jsonl/mixed.jsonl -o ./train/mixed.messages.jsonl \
--from auto --format messages # auto-detecting chat adapter
# Optional: YAML preset (pip install "convmerge[preset]")
convmerge preset init -o convert_preset.yaml
convmerge preset validate convert_preset.yaml
convmerge convert -i ./jsonl/mixed.jsonl -o ./out.jsonl --preset convert_preset.yamlAdapters: alpaca, sharegpt, chat (alias auto).
Emitters: messages, alpaca.
Presets and team-specific tuning: docs/custom_presets.md.
chat/autois a heuristic adapter: it inspects the keys of each input record (messages,conversation(s),text,conversation_a/_b,instruction/input/output, …) and routes to the right branch with a configurable role map. For unusual schemas, pin an explicit adapter (alpaca,sharegpt) or override keys programmatically — see docs/format.md.
# Inline weights
convmerge mix \
-i ./train/code.messages.jsonl:0.4 \
./train/math.messages.jsonl:0.3 \
./train/general.messages.jsonl:0.3 \
-o ./train/mixed.jsonl --total 100000 --seed 42
# Or via a config file (YAML requires convmerge[preset])
convmerge mix mix.yaml# mix.yaml
seed: 42
total: 100000
output: ./train/mixed.jsonl
sources:
- { path: ./train/code.messages.jsonl, weight: 0.4 }
- { path: ./train/math.messages.jsonl, weight: 0.3 }
- { path: ./train/general.messages.jsonl, weight: 0.3 }Weights are normalized automatically and need not sum to 1.0. When a source
has fewer records than its allocation it is clipped; pass --oversample to
sample with replacement instead. A sidecar .mix.json is written alongside
the output recording the exact seed, weights, and per-source counts for full
reproducibility. Omit --total to merge all records from every source.
convmerge dedupe -i ./train/mixed.jsonl -o ./train/mixed.dedup.jsonl
convmerge turns -i ./train/mixed.dedup.jsonl \
--single-out ./train/single.jsonl \
--multi-out ./train/multi.jsonlSee docs/format.md for adapter / emitter schemas and docs/fetch.md for manifest details.
To keep the package lean and dependency-free at its core, convmerge does
not include — and has no plans to include — the following:
- Model loading / inference / training. No PyTorch, Transformers, vLLM, or similar runtime is imported by the core or any shipped extra.
- Automatic labeling or classification of samples (e.g. topic tagging, quality scoring, safety classification). These are left to upstream tools or private pipelines.
- RLHF / DPO / preference-dataset construction beyond passing through
existing pairwise rows via the
chatadapter'spairwise_mode. - Training-job orchestration (SkyPilot, RunPod, Modal, K8s operators).
- Prompt templating / chat-template rendering for specific model
families. Output JSONL uses the standard
messages/alpacashapes; downstream trainers apply their own template. - Tokenizer-aware length filtering, packing, or curriculum scheduling. Those live in the training stack, not here.
- Scraping HTML pages or running browser automation. Structured JSON / JSONL / Parquet inputs only.
If any of these are important to your workflow, wire convmerge in as one
step of a larger pipeline rather than expecting it to grow into those areas.
See CONTRIBUTING.md for the full guide — setup, local checks, code conventions, and a walkthrough for adding a new adapter / emitter. CI runs Ruff + pytest on Python 3.10 – 3.12.
pip install -e ".[dev,all]"
ruff check src tests
ruff format --check src tests
pytest -qParticipation in this project is governed by the Contributor Covenant Code of Conduct.
Good first PRs: new adapters / emitters for public dataset schemas, new
fetch backends (GitLab / Zenodo / Kaggle), recipe examples under
examples/, and docs improvements. Browse the
good first issue
label for concrete starting points.
Releases run from .github/workflows/publish.yml
on pushing a v* tag. Publishing authenticates via the PYPI_API_TOKEN
GitHub Actions secret.
- Create an API token on pypi.org.
- If the project already exists on PyPI, scope the token to the
convmergeproject (principle of least privilege). - For the very first upload (project not yet registered), PyPI does not allow project-scoped tokens — use Entire account scope for the first release, then rotate to a project-scoped token afterwards and revoke the original.
- If the project already exists on PyPI, scope the token to the
- In the GitHub repo, Settings → Secrets and variables → Actions → New
repository secret, add
PYPI_API_TOKENwith the token value. - Tag and push:
git tag vX.Y.Z && git push origin vX.Y.Z.
MIT