HARP — Hierarchical Active Region Pruning

Efficient, train-based data selection for finetuning large language models.

HARP organizes the training pool into a node–leaf hierarchy, finetunes only a few representative leaves, infers the unmeasured leaf utilities with empirical-Bayes posteriors, and then greedily selects data under two complementary first-order envelopes — HARP-C and HARP-E.

Paper: HARP: Efficient Data Selection for Finetuning Large Language Models (arXiv:2606.07690).

Method

HARP estimates a per-task main effect phi[g, d] for every leaf g and evaluation domain d (how much finetuning on leaf g changes utility on domain d), using a leaf-restricted, first-order utility approximation. Selection then optimizes one of two envelopes, where phi+ = max(phi, 0), phi- = max(-phi, 0), and b_d is the null (base-model) score for domain d:

Variant	Predicted domain utility `ŷ_d(S)`	Intuition
HARP-C (conservative)	`clip[0,1]( b_d + max_{g∈S} phi+_g(d) − Σ_{g∈S} phi-_g(d) )`	credits only the strongest positive leaf per domain while accumulating all harms → controls redundancy
HARP-E (expansive)	`clip[0,1]( b_d + Σ_{g∈S} phi_g(d) )`	sums positive contributions (and subtracts harms) → recovers multiple complementary regions

Both envelopes are optimized greedily and cut with the best-prefix rule: rank leaves by greedy marginal gain, then keep the prefix that maximizes the cumulative gain (argmax of the running sum). Both point selections are then capped at the budget TOP_K_POINTS.

Pipeline

Stage 1a — Hierarchy. Build a large → parent → leaf ANN hierarchy over the feature matrix X_feat (build_parent_leaf_hierarchy).
Stage 1b — Main effects. Finetune a LoRA adapter on representative leaves and evaluate on the proxy set; infer unmeasured leaves with empirical-Bayes shrinkage (bootstrap_parent_representative_independent_per_task).
Selection. Filter active tasks, then run the conservative (HARP-C) and expansive (HARP-E) greedy selections (greedy_taskaware) and apply the best-prefix cut.

This repository implements the first-order method from the paper. Main effects are aggregated additively; there are no pairwise / second-order interaction terms.

Install

pip install -r requirements.txt
# optional, recommended:
#   pip install faiss-gpu     # fast ANN leaf grouping (falls back to scikit-learn KMeans)
#   pip install vllm          # fast batched inner-eval during Stage 1b

Python ≥ 3.10.

Usage

from harp import harp

result = harp(
    X_feat=X_feat,                      # (n, d) features, one row per train example
    base_model_name="meta-llama/Llama-3.1-8B",
    train_examples=train_examples,      # tokenised train pool (list of dicts)
    eval_examples=eval_examples,        # tokenised proxy eval set
    tokenizer=tokenizer,
    eval_indices=eval_indices,          # or eval_indices_path="eval_indices_accuracy.npy"
    representatives_path="representatives.jsonl",
    eval_metric="accuracy",
    TOP_K_POINTS=10_000,                # point budget for the HARP-E selection
    lora_config={"r": 16, "lora_alpha": 32},
)

harp_c_idx = result.harp_c   # np.ndarray[int] — selected train-example indices (HARP-C)
harp_e_idx = result.harp_e   # np.ndarray[int] — selected train-example indices (HARP-E)
info       = result.info     # diagnostics: phi, task filtering, leaf orders, cut indices, ...

Inputs

Argument	Description
`X_feat`	`(n, d)` L2-normalisable feature/embedding matrix, one row per train example.
`train_examples` / `eval_examples`	Tokenised examples (`input_ids`, `attention_mask`, `labels`, …).
`tokenizer`	HF tokenizer for `base_model_name`.
`eval_indices` / `eval_indices_path`	Proxy eval-set indices, or a path to a `.npy` of them (default `eval_indices_<metric>.npy`).
`representatives_path`	JSONL of per-subject base-model scores (`acc01`); supplies the null scores `b_d` and the task grouping.
`v0`	Base-model utility; if `None`, computed from `representatives_path`.
`TOP_K_POINTS`	Point budget applied to the HARP-E selection.
`checkpoint_dir`	If set, enables Stage-1b resume-on-preempt.

Output — `HarpResult`

Field	Description
`harp_c`	`np.ndarray[int]` of selected train-example indices under HARP-C (best-prefix cut, capped at `TOP_K_POINTS`).
`harp_e`	`np.ndarray[int]` of selected train-example indices under HARP-E (best-prefix cut, capped at `TOP_K_POINTS`).
`info`	Diagnostics: per-task main effects `phi`, active-task filtering, full conservative/expansive leaf orders, cut indices, and the bootstrap summary. The per-envelope selections are also under `info["harp_c"]` / `info["harp_e"]`.

Layout

harp/
  selector.py                 # the algorithm: hierarchy, bootstrap, harp(), greedy_taskaware()
  knn_grouping.py             # ANN / KMeans leaf grouping
  data_preprocess.py          # task-aware response scoring
  vllm_inner_eval.py          # optional vLLM inner-eval server
  vllm_batched_inner_eval.py  # optional vLLM batched inner-eval

Tests

A tiny CPU-only smoke test (no GPU, no model download, no data files) covers the import + public API, the harp() entry point, the HARP-C / HARP-E selection semantics, and the Stage-1a hierarchy:

python tests/test_smoke.py        # or: pytest tests/test_smoke.py

Citation

@article{harp2026,
  title  = {HARP: Efficient Data Selection for Finetuning Large Language Models},
  year   = {2026},
  eprint = {2606.07690},
  archivePrefix = {arXiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
harp		harp
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HARP — Hierarchical Active Region Pruning

Method

Pipeline

Install

Usage

Inputs

Output — `HarpResult`

Layout

Tests

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HARP — Hierarchical Active Region Pruning

Method

Pipeline

Install

Usage

Inputs

Output — HarpResult

Layout

Tests

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Output — `HarpResult`

Packages