Skip to content

csvstats: number of bin calculations is too high for data/random-data.csv dataset #31

@Notgnoshi

Description

@Notgnoshi
# This generates a NaN as the first delta
$ csvdelta -i -c timestamp data/random-data.csv
$ cargo run --bin csvstats -- -c timestamp-deltas data/random-data.csv -H
Stats for column "timestamp-deltas":
    count: 1173
    filtered: 1 (total: 1174)
    Q1: 0.0010528564453125
    median: 0.0010530948638916016
    Q3: 0.0010530948638916016
    min: 0.0010089874267578125 at index: 0
    max: 0.0010700225830078125 at index: 1172
    mean: 0.0010526164006902329
    stddev: 0.000029847184932617655

2025-03-10T00:00:38.348237Z  INFO csvizmo::plot: Using 1350 bins with width 0.0000

1350 is more bins than there are samples. Either I have a bug in my Freedman Diaconis rule calculation, or it doesn't give the kind of results I want.

A dataset like this isn't quite normal, so I think the KDE estimation isn't quite right, and a histogram isn't the most useful way of visualizing this data.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcsvstatsA gizmo to calculate summary statistics and plot a histogram from a CSV

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions