Skip to content
This repository was archived by the owner on Mar 10, 2026. It is now read-only.
This repository was archived by the owner on Mar 10, 2026. It is now read-only.

Improved handling for zero-vectors representing proportional responses in TaylorEstimator #70

@iamchrisearle

Description

@iamchrisearle

In some survey edge cases I find myself with a zero-vector response on proportional data: "Do you have a PhD" is a survey question where full sample is all 0 as a toy example.

import pandas as pd
from samplics.estimation import TaylorEstimator
from samplics.utils.types import PopParam, 

# Setup data
test_df = pd.DataFrame(
    {
        "stratum": [
            "province_a",
            "province_a",
            "province_a",
            "province_b",
            "province_b",
            "province_b",
        ],
        "var": [0, 0, 0, 0, 0, 0],  # NOTE: All the same value of 0
        "wcol": [0.2, 0.3, 0.7, 0.2, 0.1, 0.7],
        "domain": ["dom_1", "dom_1", "dom_2", "dom_3", "dom_3", "dom_3"],
        "psu": [1, 2, 3, 4, 5, 6],
    }
)

# inspect
test_df

# NOTE: setup up with PopParam.prop
te = TaylorEstimator(param=PopParam.prop, alpha=0.95)
te.estimate(
    y=test_df["var"],
    samp_weight=test_df["wcol"],
    stratum=test_df["stratum"],
    domain=test_df["domain"],
    psu=test_df["psu"]
)

te.to_dataframe()
_param _domain _level _estimate _stderror _lci _uci _cv
0 PopParam.prop dom_1 0 1 0 1 1 0
1 PopParam.prop dom_2 0 1 0 1 1 0
2 PopParam.prop dom_3 0 1 0 1 1 0

So the point estimates for a zero-vector are 1, because the PopParam.prop uses pd.dummies to create a boolean vector of the categories of input:

# Breakpoin at `y_dummies = pd.get_dummies(y)` in expansion.py to recreate
>>> y_dummies
      0
0  True
1  True
2  True
3  True
4  True
5  True

Using PopParam.mean with as_factor = True still kicks into this dummies block resulting in the same bool vector.

If I switch to PopParam.mean to avoid the dummy encoding, the protection blocks present in the PopParam.prop logic branch are not there:

# This catches the edge case in the `PopParam.prop` branch nicely
# however, at this point the incorrect point estimate has already been made
        if point_est1[level] == 0:
            lower_ci[level] = 0
            upper_ci[level] = 0
            coef_var[level] = 0

...

# But in `PopParam.mean` (in domain non-None case)
# This will fail with zero-division as the (correct) self.point_est[key] is 0
        self.coef_var[key] = (
            math.sqrt(self.variance[key]) / self.point_est[key]
        )

Thoughts on adding a if self.point_est[key] == 0-like catch block in the PopParam.mean coef_var calculation? Something like this works for my use case:

                    if self.point_est[key] == 0:
                        self.coef_var[key] = 0.0
                    else:
                        self.coef_var[key] = (
                            math.sqrt(self.variance[key]) / self.point_est[key]
                        )

But requires using PopParam.mean to get the correct point estimate.

Or any ideas about how to handle the get_dummies resulting in a non-zero point estimate for all zero inputs? I was thinking that since pandas is used for dummies, y could be allowed to be passed as as a pd Series and retain categories?

import numpy as np
import pandas as pd

y = np.array([0,0,0,0])
y_series = pd.Series(y, dtype="category")
y_series = y_series.cat.set_categories([0, 1])

print(pd.get_dummies(y_series).to_markdown())
0 1
0 1 0
1 1 0
2 1 0
3 1 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions