convmerge

Convert Alpaca, ShareGPT, and mixed chat datasets into a unified messages JSONL for LLM supervised fine-tuning.
Fetch from HuggingFace or GitHub, normalize messy Parquet / JSON / JSONL, convert between Alpaca / ShareGPT / chat schemas, weighted-mix multiple domain sources, and deduplicate — all in one CLI pipeline.

convmerge is a data-preparation CLI and library for LLM supervised fine-tuning (SFT). It takes heterogeneous instruction-tuning datasets — Alpaca, ShareGPT, raw chat JSONL, Parquet dumps — and produces a single clean JSONL file in the standard messages format (or back to alpaca shape) that any fine-tuning framework can consume directly.

It is intentionally scoped to the pre-training-loop step: no model loading, no inference, no labeling, no training orchestration. See Out of scope below.

Repository: github.com/snowmuffin/convmerge
Status: pre-1.0; APIs and CLI may change between minor versions until 1.0.

Install

pip install convmerge                    # core: convert, dedupe, turns; normalize for .json/.jsonl
pip install "convmerge[all]"             # full CLI: fetch (HF+GitHub), parquet, YAML presets

Granular extras:

pip install "convmerge[fetch]"           # YAML manifests + GitHub (PyYAML)
pip install "convmerge[fetch-all]"       # fetch + HuggingFace (``datasets``)
pip install "convmerge[fetch-hf]"        # same dependencies as ``fetch-all`` (backward-compatible name)
pip install "convmerge[parquet]"         # Parquet input for ``normalize``
pip install "convmerge[preset]"          # YAML convert presets (`--preset`, `preset validate`)

Command / feature	Extra
`convert`, `dedupe`, `turns`	(core)
`normalize` on `.parquet`	`[parquet]`
`fetch` with YAML manifest or GitHub	`[fetch]`
`fetch` with HuggingFace manifest entries	`[fetch-all]` or `[fetch-hf]`
`convert --preset`, `preset`	`[preset]`
Everything above	`[all]`

Or from a clone:

git clone https://github.com/snowmuffin/convmerge.git
cd convmerge
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,all]"

The four use cases

1. `fetch` — pull raw data from HF + GitHub via a YAML manifest

HuggingFace entries delegate to datasets.load_dataset(...).to_json(...), i.e. the output is a JSONL dump of the selected split. GitHub entries support a single raw URL, recursive Trees API fetch with an extension filter, or git clone (with optional git lfs pull). fetch is a reproducible downloader, not a mirror of HuggingFace's Arrow cache.

# manifest.yaml
version: 1
defaults: { output_root: ./raw, resume: true }
auth:     { hf_token_env: HF_TOKEN, github_token_env: GITHUB_TOKEN }
datasets:
  - { name: alpaca-ko, hf: MarkrAI/KoCommercial-Dataset, split: train }
  - { name: orca-raw,
      url: https://raw.githubusercontent.com/org/repo/main/data/train.jsonl }
  - { name: repo-tree,
      url: https://github.com/org/example-repo, ext: [".jsonl"] }
  - { name: big-lfs,
      url: https://github.com/org/big-lfs-repo, mode: clone, lfs: true }

convmerge fetch manifest.yaml -o ./raw
# or one-shot shortcuts:
convmerge fetch hf://org/dataset -o ./raw --split train
convmerge fetch https://github.com/org/repo -o ./raw --ext .jsonl

Tokens resolve in order CLI flag → file → env var, and are redacted from logs. See docs/fetch.md for the full schema.

2. `normalize` — reshape parquet / messy JSON into clean JSONL

convmerge normalize -i ./raw -o ./jsonl

Handles parquet (streamed via pyarrow), top-level JSON arrays, concatenated single-line JSON ({...}{...}{...}), and already-valid JSONL. A directory input is walked recursively and mirrored under the output directory.

3. `convert` — adapter + emitter pipeline

convmerge convert -i ./jsonl/alpaca.jsonl -o ./train/alpaca.messages.jsonl \
  --from alpaca --format messages

convmerge convert -i ./jsonl/mixed.jsonl -o ./train/mixed.messages.jsonl \
  --from auto --format messages         # auto-detecting chat adapter

# Optional: YAML preset (pip install "convmerge[preset]")
convmerge preset init -o convert_preset.yaml
convmerge preset validate convert_preset.yaml
convmerge convert -i ./jsonl/mixed.jsonl -o ./out.jsonl --preset convert_preset.yaml

Adapters: alpaca, sharegpt, chat (alias auto).
Emitters: messages, alpaca.

Presets and team-specific tuning: docs/custom_presets.md.

chat / auto is a heuristic adapter: it inspects the keys of each input record (messages, conversation(s), text, conversation_a/_b, instruction/input/output, …) and routes to the right branch with a configurable role map. For unusual schemas, pin an explicit adapter (alpaca, sharegpt) or override keys programmatically — see docs/format.md.

4. `mix` — domain-controlled weighted merge

# Inline weights
convmerge mix \
  -i ./train/code.messages.jsonl:0.4 \
     ./train/math.messages.jsonl:0.3 \
     ./train/general.messages.jsonl:0.3 \
  -o ./train/mixed.jsonl --total 100000 --seed 42

# Or via a config file (YAML requires convmerge[preset])
convmerge mix mix.yaml

# mix.yaml
seed: 42
total: 100000
output: ./train/mixed.jsonl
sources:
  - { path: ./train/code.messages.jsonl,    weight: 0.4 }
  - { path: ./train/math.messages.jsonl,    weight: 0.3 }
  - { path: ./train/general.messages.jsonl, weight: 0.3 }

Weights are normalized automatically and need not sum to 1.0. When a source has fewer records than its allocation it is clipped; pass --oversample to sample with replacement instead. A sidecar .mix.json is written alongside the output recording the exact seed, weights, and per-source counts for full reproducibility. Omit --total to merge all records from every source.

5. `dedupe` / `turns` — final cleanup + train/eval split hook

convmerge dedupe -i ./train/mixed.jsonl -o ./train/mixed.dedup.jsonl
convmerge turns  -i ./train/mixed.dedup.jsonl \
  --single-out ./train/single.jsonl \
  --multi-out  ./train/multi.jsonl

See docs/format.md for adapter / emitter schemas and docs/fetch.md for manifest details.

Out of scope

To keep the package lean and dependency-free at its core, convmerge does not include — and has no plans to include — the following:

Model loading / inference / training. No PyTorch, Transformers, vLLM, or similar runtime is imported by the core or any shipped extra.
Automatic labeling or classification of samples (e.g. topic tagging, quality scoring, safety classification). These are left to upstream tools or private pipelines.
RLHF / DPO / preference-dataset construction beyond passing through existing pairwise rows via the chat adapter's pairwise_mode.
Training-job orchestration (SkyPilot, RunPod, Modal, K8s operators).
Prompt templating / chat-template rendering for specific model families. Output JSONL uses the standard messages / alpaca shapes; downstream trainers apply their own template.
Tokenizer-aware length filtering, packing, or curriculum scheduling. Those live in the training stack, not here.
Scraping HTML pages or running browser automation. Structured JSON / JSONL / Parquet inputs only.

If any of these are important to your workflow, wire convmerge in as one step of a larger pipeline rather than expecting it to grow into those areas.

Development

See CONTRIBUTING.md for the full guide — setup, local checks, code conventions, and a walkthrough for adding a new adapter / emitter. CI runs Ruff + pytest on Python 3.10 – 3.12.

pip install -e ".[dev,all]"
ruff check src tests
ruff format --check src tests
pytest -q

Participation in this project is governed by the Contributor Covenant Code of Conduct.

Good first PRs: new adapters / emitters for public dataset schemas, new fetch backends (GitLab / Zenodo / Kaggle), recipe examples under examples/, and docs improvements. Browse the good first issue label for concrete starting points.

PyPI release (maintainers)

Releases run from .github/workflows/publish.yml on pushing a v* tag. Publishing authenticates via the PYPI_API_TOKEN GitHub Actions secret.

Create an API token on pypi.org.
- If the project already exists on PyPI, scope the token to the convmerge project (principle of least privilege).
- For the very first upload (project not yet registered), PyPI does not allow project-scoped tokens — use Entire account scope for the first release, then rotate to a project-scoped token afterwards and revoke the original.
In the GitHub repo, Settings → Secrets and variables → Actions → New repository secret, add PYPI_API_TOKEN with the token value.
Tag and push: git tag vX.Y.Z && git push origin vX.Y.Z.

Changelog

CHANGELOG.md

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
docs		docs
examples		examples
src/convmerge		src/convmerge
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

convmerge

Install

The four use cases

1. `fetch` — pull raw data from HF + GitHub via a YAML manifest

2. `normalize` — reshape parquet / messy JSON into clean JSONL

3. `convert` — adapter + emitter pipeline

4. `mix` — domain-controlled weighted merge

5. `dedupe` / `turns` — final cleanup + train/eval split hook

Out of scope

Development

PyPI release (maintainers)

Changelog

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

convmerge

Install

The four use cases

1. fetch — pull raw data from HF + GitHub via a YAML manifest

2. normalize — reshape parquet / messy JSON into clean JSONL

3. convert — adapter + emitter pipeline

4. mix — domain-controlled weighted merge

5. dedupe / turns — final cleanup + train/eval split hook

Out of scope

Development

PyPI release (maintainers)

Changelog

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `fetch` — pull raw data from HF + GitHub via a YAML manifest

2. `normalize` — reshape parquet / messy JSON into clean JSONL

3. `convert` — adapter + emitter pipeline

4. `mix` — domain-controlled weighted merge

5. `dedupe` / `turns` — final cleanup + train/eval split hook

Packages