Skip to content

[Proposal] Boundary conventions for a clean balance ↔ svy collaborationΒ #1

@talgalili

Description

@talgalili

Hi Mamadou πŸ‘‹

I'm Tal Galili, one of the maintainers of balance (Meta's OSS package for reweighting non-probability samples against a target: IPW, CBPS, rake, poststratify, plus ASMD / plots / design-effect diagnostics). balance is pandas-native; svy is Polars-native and Rust-backed. The two libraries cover disjoint capability territory β€” balance owns propensity weighting against a target frame; svy owns design-consistent variance, calibration (GREG), replicate weights, and SAS/SPSS/Stata I/O. Neither library plans to build the other's core competency natively.

We'd like to ship (in balance) a pair of thin interop adapters so balance users can hand their adjusted weights to svy for design-consistent CIs (Taylor / BRR / bootstrap / jackknife / SDR), and svy users can get first-class IPW/CBPS against a target via balance. The items below are the concrete conventions that would make those adapters clean to build and stable to maintain; full verified code references for each are inline in the sections.

Why we're filing this before your 1.0: svy's README says "code is under finalization", and a few of the items below are SemVer-breaking if we wait. Every request below is small (most are docs or one-line defaults); the value is that we catch them while they're still cheap for you to change.

This is one bundled issue because the items are cross-referencing β€” resolving them together keeps tradeoffs visible. Happy to split into separate issues if you'd prefer; just let me know.


1. Sample class-name collision between balance.Sample and svy.Sample

Problem

Both libraries export a top-level Sample class:

from balance import Sample as BalanceSample; from svy import Sample as SvySample works, but from svy import *; from balance import * silently drops one. Users mixing both in a Jupyter notebook will hit this.

Proposed change

Pick one of:

Option A (preferred): add an alias in svy/__init__.py:

# after line 42 (`Sample,`)
from svy.core import Sample as SvySample  # noqa: E402

and extend __all__ (line 129+) with "SvySample". Zero runtime cost, gives downstream users an unambiguous import path.

Option B: add a short note to the svy README and to the balance README with a recommended import style:

from balance import Sample as BalanceSample
from svy import Sample  # or: from svy import SurveySample

I'd take Option A because it gives a forever-stable import alias. Balance maintainers are happy to mirror with a README note.


2. Sample.clone(data=...) is a footgun for the "swap one column" use case

Problem

svy/core/sample.py#L1275-L1323 β€” clone(*, data=..., design=..., rep_wgts=..., catalog=...) -> Sample:

s = Sample(new_data, new_design, catalog=src_catalog)

i.e. clone(data=...) constructs a new Sample, which runs full __init__:

Consequence: for a caller trying to do "I have a new weight column in new_data, swap the active weight to it", this fails in several ways:

  1. If new_data is missing a design-referenced column (stratum/psu/wgt), clone raises before .update_design(wgt=new_col) ever runs.
  2. If the new data has a different row count (e.g. balance's IPW with na_action="drop" dropped rows), existing rep_wgts column values are silently misaligned β€” the Rust engine assumes row alignment.
  3. SVY_ROW_INDEX identity is lost β€” breaks downstream joins.

The intended user-facing path today is Sample.use_weight(wgt) (sample.py#L1227), which only works if the new column is already in _data. So there's no clean public way to "add a new weight column and activate it" without touching _data directly.

The balance-side adapter currently works around this by shallow-copying and mutating private state (see PLAN.md Β§3/B2), which is fragile across svy versions.

Proposed change

2a. Add a public helper at svy/core/sample.py:

def with_new_weight(self, name: str, values: np.ndarray | pl.Series | Sequence[float]) -> "Sample":
    """Return a new Sample that has `name` added as a weight column and set as
    the active weight on the Design. Row alignment is required: len(values)
    must equal self._data.height.

    Unlike .clone(data=...), this does not re-run __init__ β€” it preserves
    SVY_ROW_INDEX, internal concat columns, singleton state, and MetadataStore.
    """
    if len(values) != self._data.height:
        raise ValueError(
            f"with_new_weight: got {len(values)} values, expected "
            f"{self._data.height} (Sample row count)."
        )
    if name in self._data.columns:
        raise ValueError(f"with_new_weight: column {name!r} already exists.")
    new_data = self._data.with_columns(
        pl.Series(name=name, values=values, dtype=pl.Float64)
    )
    new_sample = copy.copy(self)  # shallow: shares _metadata, _internal_design
    new_sample._data = new_data
    new_sample._design = new_sample._design.update(wgt=name)
    return new_sample

Test to add under packages/svy/tests/ (or wherever Sample tests live):

def test_with_new_weight_preserves_internal_state():
    df = pl.DataFrame({"x": [1.0, 2.0, 3.0], "stratum": ["a","a","b"],
                       "w": [1.0, 1.0, 1.0]})
    s = svy.Sample(df, svy.Design(stratum="stratum", wgt="w"))
    original_row_idx = s._data[SVY_ROW_INDEX].to_list()
    s2 = s.with_new_weight("new_w", [2.0, 2.0, 2.0])

    assert s2._design.wgt == "new_w"
    assert s2._data[SVY_ROW_INDEX].to_list() == original_row_idx  # identity preserved
    assert "new_w" in s2._data.columns
    # original untouched
    assert s._design.wgt == "w"
    assert "new_w" not in s._data.columns
    # internal concat cols preserved (present if and only if stratum is str)
    # ...


def test_with_new_weight_rejects_row_count_mismatch():
    df = pl.DataFrame({"x": [1.0, 2.0, 3.0], "w": [1.0, 1.0, 1.0]})
    s = svy.Sample(df, svy.Design(wgt="w"))
    with pytest.raises(ValueError, match="2 values, expected 3"):
        s.with_new_weight("new_w", [2.0, 2.0])

2b. Add a docstring warning to Sample.clone (line ~1275) explicitly noting "this re-runs __init__; for simple weight-column swaps use use_weight (if the column exists) or with_new_weight (if you're adding one)."

Why this is 1.0-critical

Without with_new_weight, every downstream adapter β€” ours and everyone else's β€” reaches into sample._data / sample._design. That's private-API coupling right at the 1.0 boundary. A two-line helper here saves the whole ecosystem from it.


3. Stability statement for the public API surface we depend on

Problem

svy's __init__.py#L129-L244 already mass-exports a stable-looking surface, but there's no explicit 1.0 stability commitment. balance would pin svy>=1.0,<2.0 in a balance[survey] extra; that pin is only meaningful if certain symbols don't churn inside 1.x.

Proposed change

In svy's CHANGELOG.md for 1.0, tag the following as stable for 1.x (SemVer-breaking changes go to 2.0):

Core classes / constructors (svy/core/design.py, svy/core/sample.py):

  • svy.Design(row_index=, stratum=, wgt=, prob=, hit=, mos=, psu=, ssu=, pop_size=, wr=, rep_wgts=) β€” all kwargs stable.
  • svy.PopSize(psu, ssu).
  • svy.RepWeights(method, prefix, n_reps, fay_coef=, df=, padding=).
  • svy.Sample(data, design, *, catalog=, questionnaire=).

Sample lifecycle helpers (svy/core/sample.py):

  • Sample.use_weight(wgt) (L1227).
  • Sample.update_design(**kwargs) (L1218).
  • Sample.clone(*, data=, design=, rep_wgts=, catalog=) (L1275).
  • Sample.with_new_weight(name, values) β€” once added per Β§2.
  • The .data, .design, .rep_wgts, .weighting, .estimation properties.

Weighting facet (svy/weighting/base.py):

  • sample.weighting.calibrate(*, controls, by, scale, bounded, wgt_name, update_design_wgts, ignore_reps, strict, trimming) (L303).
  • sample.weighting.{rake, poststratify, adjust, normalize, trim, create_brr_wgts, create_jk_wgts, create_bs_wgts, create_sdr_wgts, calibrate_matrix}.

Estimation facet (svy/estimation/base.py):

  • sample.estimation.{mean, total, prop, ratio, median}(y, *, by, where, method, deff, fay_coef, as_factor, alpha, drop_nulls) and ci_method on prop.

If there are specific kwargs you already know might change, calling them out explicitly (e.g. "variance_center is experimental") is ideal.

Why this is cheap

It's a documentation statement; it doesn't constrain future work β€” just commits to following SemVer on a scoped surface. This is what lets balance ship balance[survey] without it becoming a compatibility nightmare.


4. Shared column-name convention (opt-in, mostly docs)

Problem

Neither library enforces column names today. Consequences for adapters:

  • Without convention: every interop call needs explicit mapping β€” to_svy_sample(s, design_columns={"stratum": "strat_col", "psu": "cluster_col", "fpc": "N_hh", "wgt": "wt"}).
  • With convention: to_svy_sample(s) just works.

Proposed change

4a. In svy docs (packages/svy/README.md + docs/), add a "Recommended column names" section:

weight        β€” the active sampling / design weight
stratum       β€” the stratum column
psu, ssu      β€” primary / secondary sampling unit IDs
fpc           β€” finite population correction
repweight_N   β€” replicate weights (repweight_1, repweight_2, ...)

Update svy example datasets to use these names where practical, or note in each example "the column hhweight corresponds to the recommended weight".

4b. Default RepWeights.prefix to "repweight_" (currently a required kwarg).

File: svy/core/design.py#L93 β€” change signature:

class RepWeights(msgspec.Struct, frozen=True):
    method: _EstimationMethod | str
    prefix: str = "repweight_"   # was: no default, required
    n_reps: int = 0              # keep required if you prefer; prefix is the key win

and the corresponding make_rep_weights(...) factory at design.py#L311.

Existing validation (non-empty, whitespace-stripped) already handles the string; the only __post_init__ rule that needs checking is the case where someone relies on the required-ness of prefix β€” but making it default doesn't break any call that passed it explicitly.

Test to add under packages/svy/tests/:

def test_rep_weights_default_prefix_is_repweight():
    rw = svy.RepWeights(method="bootstrap", n_reps=500)
    assert rw.prefix == "repweight_"
    # and columns generate as repweight_1, repweight_2, ...
    assert rw._generate_columns(padding=0)[:2] == ["repweight_1", "repweight_2"]

Why this is 1.0-aligned

The default-prefix change is cheap now, SemVer-breaking after 1.0. The naming convention is pure documentation but works best if endorsed publicly in v1 docs.


5. MetadataStore sidecar serialization spec

Problem

svy's MetadataStore (svy/metadata/variable_meta.py) holds variable labels, value labels, measurement types, missing-value kinds, and na_as_level flags β€” all real statistical metadata that svy-io materializes from SAS/SPSS/Stata files.

pandas has no equivalent container. When a svy.Sample is round-tripped through pandas for interop (e.g. sample.data.to_pandas() into balance), all of this metadata silently vanishes. This is a statistical-correctness issue: missing-value kinds (MNAR vs MAR vs MCAR distinctions) disappear, ordered categorical ordering disappears, variable labels disappear.

Proposed change

A small, stable serialization spec: MetadataStore.to_dict() -> dict and MetadataStore.from_dict(d) -> MetadataStore, producing a JSON-roundtrippable shape. That lets adapters attach the serialized form as a sidecar (e.g. pandas.DataFrame.attrs["svy_metadata"] = store.to_dict()) and rebuild on the way back.

I'm not asking for balance integration here β€” just a stable, documented shape so any adapter has something to lean on. One stable contract at 1.0 avoids everyone reinventing their own encoding.

Rough sketch (you know the internals better):

class MetadataStore:
    def to_dict(self) -> dict:
        return {
            "version": 1,
            "variables": {
                name: meta.to_dict()  # VariableMeta already has fields
                for name, meta in self._store.items()
            },
            "catalog": self._catalog.to_dict() if self._catalog else None,
        }

    @classmethod
    def from_dict(cls, d: dict, *, catalog: LabellingCatalog | None = None) -> "MetadataStore":
        # Validate version, rebuild VariableMeta entries, return store.
        ...

Tests under packages/svy/tests/:

def test_metadata_roundtrip_json():
    s = svy.Sample(pl.DataFrame({"q1": [1, 2, 3]}))
    s.set_var_label("q1", "How satisfied are you?")
    s.set_value_labels("q1", {1: "low", 2: "med", 3: "high"})
    s.set_missing("q1", dont_know=[-99], refused=[-98])

    d = s.meta.to_dict()
    serialized = json.dumps(d)   # must be JSON-round-trippable
    restored = MetadataStore.from_dict(json.loads(serialized))

    resolved = restored.resolve_labels("q1")
    assert resolved.var_label == "How satisfied are you?"
    assert resolved.value_labels[2] == "med"

If this is more than you want to take on pre-1.0, a documented promise that the shape will be stabilized by 1.1 also works for our purposes.


6. Row-alignment guard for rep_wgts

Problem (soft ask)

If a user mutates Sample._data in a way that changes row count but the replicate weight columns are still present with the right count, _validate_design (sample.py#L648-L678) won't catch the misalignment β€” it only checks that the expected column names exist, not that the non-rep data rows line up with rep weight rows.

This isn't our bug to hit (the with_new_weight helper in Β§2 makes it impossible on our side), but it's a sharp edge for anyone doing in-place with_columns on _data.

Proposed change

Optional: add a length check to _validate_design β€” the rep weight columns are always the same length as _data.height by construction, so no extra work. What could be added is a hashed-row-identity check against the original construction β€” but that's overkill. A length check + clearly documented invariant ("the user must not modify _data in ways that change row count after construction") is probably enough.

This is the lowest-priority item in the list β€” happy to drop if it's noise.


7. Summary β€” the ask

# Item Location 1.0-critical? Effort
1 SurveySample alias + README note svy/__init__.py + README yes trivial
2 Sample.with_new_weight(name, values) + clone docstring warning svy/core/sample.py yes small method
3 Stability statement in CHANGELOG CHANGELOG.md yes docs only
4 RepWeights.prefix = "repweight_" default + doc convention svy/core/design.py + docs yes one-line default
5 MetadataStore.to_dict / from_dict spec svy/metadata/variable_meta.py ideal (1.1 ok) small
6 Row-alignment guard svy/core/sample.py no small

What balance will do in parallel

  • Ship balance.diagnostics.standalone((df, weights, targets)) β€” decouples ASMD / love-plot from the BalanceFrame lifecycle so diagnostics survive the interop round-trip. (Pure balance work; no svy dependency.)
  • Reserve balance/interop/svy.py on a feature branch β€” ~80 lines β€” gated on svy hitting PyPI.
  • Add a "Design-based inference" section to the balance README pointing users at svy.
  • Add a CI matrix entry svy: ["latest-pypi"] in balance that becomes live automatically the day svy ships.
  • Propose svy.interop.nonprob.ipw_to_target (or a separate svy-nonprob package, depending on your preference) for the reverse direction, at svy 1.0.

Context

I've verified the claims here against packages/svy/src/svy/ at v0.17.1 (pyproject.toml: version = "0.17.1"). If any symbol has moved on main, I'd be grateful for a correction.

Thanks for the work you're doing on svy β€” especially the R-survey parity target, the svy-io Rust ReadStat implementation, and the replicate-weight padding auto-detection. These are all things balance users will benefit from downstream. Looking forward to coordinating on the conventions before 1.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions