Skip to content

Non-credible sample from a categorical variable #2486

@QuantAkt

Description

@QuantAkt

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV: 1.19.0
python: 3.9.22
'linux-x86_64' for WSL2 Ubuntu 24.04.2

Error Description

When trying to create a sample from a categorical variable using GaussianCopulaSynthesizer, I notice extremely unlikely outcomes. See example below.

Steps to reproduce

from sdv.metadata import Metadata
from sdv.single_table import GaussianCopulaSynthesizer

from IPython.display import display
import numpy as np
import pandas as pd

# size of sample
nsamp = 10_000
# since SDV-API does not expose its RNG I can only fix the input sample
rng = np.random.default_rng(seed=4711)

# create categorical
df = pd.DataFrame(pd.Categorical([0, 1, 2, 3]))

# create sample
smp_orig = df.sample(n=nsamp, weights=[0.25, 0.25, 0.25, 0.25], 
                replace=True, random_state=rng, ignore_index=True)
# single "category" variable
display(smp_orig.dtypes)
# value counts look credible ~2500 each
display(smp_orig.value_counts())

# create synthetic sample
metadata = Metadata.detect_from_dataframe(
    data=smp_orig,
    table_name="smp_orig")
# single categorical variable detected => looks OK
print(metadata)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(smp_orig)
smp_synth = synthesizer.sample(num_rows=nsamp)
# looks formally OK on first sight
display(smp_synth.head())

# but the value counts are off the charts e.g 1:5691, 3: 3725, 2:584, 0: 0 
display(smp_synth.value_counts())

# version information
from sdv import version
print(version.public) # SDV: 1.19.0
import sys
print(sys.version) # python: 3.9.22
import sysconfig
sysconfig.get_platform() # 'linux-x86_64' for WSL2 Ubuntu Ubuntu 24.04.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingunder discussionIssue is currently being discussed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions