Add native Polars DataFrame support#99
Add native Polars DataFrame support#99stewjb wants to merge 10 commits intospotfiresoftware:mainfrom
Conversation
…tetime, scatter compat - Fix Categorical/Enum dtype: was incorrectly trying to recurse into dtype.categories (which doesn't exist on the dtype object); now casts series to Utf8 and maps to SBDF_STRINGTYPEID directly - Add Enum dtype support (previously raised SBDFError) - Warn on UInt64 export: values above Int64 max will overflow silently - Warn on timezone-aware Datetime export: tz info is not preserved in SBDF - Warn on Decimal export: marked experimental, precision may be lost - Fix scatter() compatibility: add AttributeError fallback to set_at_idx() for older Polars versions within the supported range - Add tests for all of the above Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add polars to test_requirements_default.txt so SbdfPolarsTest is actually executed in CI (previously skipped due to missing import) - Add spotfire[polars] row to extras table in README - Add usage note explaining Spotfire's bundled Python lacks Polars and that SPKs bundling Polars will be ~44 MB larger than typical packages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Raise SBDFError for unknown output_format values (previously fell through silently to Pandas) - Emit SBDFWarning when Categorical/Enum columns are exported as String, consistent with existing UInt64 and timezone warnings - Add test_invalid_output_format: verifies bad output_format raises - Add test_write_polars_empty: verifies empty DataFrame exports cleanly - Add test_write_polars_series_nulls: verifies null preservation in Series - Add test_polars_categorical_warns: verifies Categorical warning fires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A Polars Series of [None, None, None] has dtype pl.Null (no type can be inferred). Previously this raised SBDFError with "unknown dtype". Now it exports as an all-invalid String column, consistent with how all-None Pandas columns are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI static analysis runs mypy without polars installed; add type: ignore[import-not-found] so mypy skips the missing stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explain non-obvious choices that would otherwise prompt review questions: - Why dtype.__class__.__name__ instead of isinstance() - Why scatter()/set_at_idx() try/except exists and which versions it covers - Why is_object_numpy_type() cpdef wrapper is needed for a cdef attribute - Why the output_format polars path short-circuits before pd.concat - Why the Null dtype path returns a placeholder array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…olars versions (>= 0.20) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds native Polars support to Spotfire’s SBDF import/export layer to avoid Pandas conversions (improving memory usage and performance for large datasets), and wires it up as an optional extra.
Changes:
- Add
polarsas an optional dependency (spotfire[polars]) and enable it in dev/test setups. - Extend
sbdf.export_data()to acceptpolars.DataFrame/polars.Seriesdirectly, with dtype→SBDF mapping. - Extend
sbdf.import_data()withoutput_formatto optionally construct a nativepolars.DataFramewithout creating a Pandas DataFrame.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
spotfire/sbdf.pyx |
Implements Polars import/export paths and dtype mappings; adds output_format to import_data(). |
spotfire/sbdf.pyi |
Updates type stub for import_data() to include output_format. |
spotfire/test/test_sbdf.py |
Adds Polars-focused unit tests for export/import/roundtrip + warnings. |
pyproject.toml |
Adds polars extra and includes it in dev extra. |
test_requirements_default.txt |
Installs Polars for test runs. |
README.md |
Documents spotfire[polars] and the new import/export behavior. |
.gitignore |
Ignores .venv, uv.lock, and .claude. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| context.set_valuetype_id(_export_infer_valuetype_from_polars_dtype(series.dtype, f"column '{col}'")) | ||
| invalids = series.is_null().to_numpy() | ||
| context.set_arrays(_export_polars_series_to_numpy(context, series), invalids) | ||
| column_metadata.append({}) |
There was a problem hiding this comment.
In the Polars export path, invalids are derived from series.is_null(), which does not mark floating-point NaN values as invalid. In the existing Pandas path pd.isnull() treats NaN as missing, so exporting a Polars float column containing NaN will write NaNs as real values instead of SBDF invalids (behavior mismatch vs Pandas and likely incorrect for Spotfire missing-values semantics). Consider treating NaN as invalid for Float32/Float64 columns (e.g., combine is_null() with is_nan() when applicable).
There was a problem hiding this comment.
added series.is_null() and series.is_nan() for floats to handle like pandas does.
| if na_value is not None: | ||
| return np.asarray(series.fill_null(na_value).to_numpy(allow_copy=True), | ||
| dtype=context.get_numpy_dtype()) | ||
| else: |
There was a problem hiding this comment.
_export_polars_series_to_numpy converts to an object ndarray when na_value is None. For Polars Datetime / Duration series, to_numpy() already produces datetime64 / timedelta64 arrays that the existing SBDF exporters can handle, so forcing dtype=object will box scalars and create an unnecessary copy (hurting the performance goal of this PR). Consider special-casing datetime/timespan to keep the native NumPy dtype (ideally normalized to the SBDF-supported resolution) instead of casting to object.
| else: | |
| else: | |
| # For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object. | |
| if dtype_name in ("Datetime", "Duration"): | |
| return series.to_numpy(allow_copy=True) |
There was a problem hiding this comment.
datetime and duration go to numpy early now.
- Move output_format validation to top of import_data() for fail-fast behaviour before the file is opened - Raise SBDFError in _import_polars_dtype fallback instead of silently returning Utf8 for unknown SBDF type IDs - Treat NaN as invalid (missing) for Float32/Float64 columns, matching Pandas pd.isnull() behaviour; add test_write_polars_float_nan - Keep native datetime64/timedelta64 arrays for Datetime/Duration columns instead of boxing to object dtype (avoids unnecessary copy) - Add @overload signatures to sbdf.pyi so callers get pd.DataFrame for the default output_format="pandas" and Any for output_format="polars" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@vrane-tibco @bbassett-tibco @mpanke-tibco thanks for considering this PR. Let me know if you all have thoughts. |
Closes #98
Summary
export_data()now acceptspolars.DataFrameandpolars.Seriesdirectly, mapping Polars dtypes to SBDF types without any Pandas intermediary. Supported types: Boolean, Int8/16/32, Int64, Float32/64, Utf8/String, Date, Datetime, Duration, Time, Binary, Decimal, Categorical.import_data()gains anoutput_formatparameter (default"pandas"for backwards compatibility). Whenoutput_format="polars", apolars.DataFrameis built directly from the raw numpy arrays — no Pandas DataFrame is created at any point.spotfire[polars]), following the same pattern asspotfire[geo]andspotfire[plot].Performance benefit
The previous workaround required
polars_df.to_pandas()before export, which doubles peak memory usage and adds 2–5 seconds of conversion time at 10M rows. The native path eliminates this entirely for export.Spotfire data function context
When running inside a Spotfire data function, SBDF import and export happen automatically via
data_function.py— users never callimport_dataorexport_datadirectly. This has two implications:Export (output variables): Full benefit. A user can build a
polars.DataFramein their script and return it as an output variable —export_data()handles it natively with no conversion.Import (input variables): No benefit from
import_data(output_format="polars"). Input data is always loaded by the framework viasbdf.import_data(self._file)(nooutput_formatargument), so input variables always arrive in the script aspd.DataFrame. Users who want Polars for processing would still need to callpl.from_pandas(input_df)themselves. Fixing this properly would require changes todata_function.pyand a mechanism for users to declare their preference — out of scope for this PR.In short: the
output_formatparameter onimport_datais primarily useful outside the Spotfire data function context (e.g. standalone scripts using thespotfirepackage directly). Inside a data function, only the export side benefits.Test plan
test_write_polars_basic— export DataFrame with common types, re-import as Pandas and verify datatest_write_polars_nulls— null values are preserved through the roundtriptest_write_polars_series— Polars Series export workstest_import_as_polars— import withoutput_format="polars"returns a nativepolars.DataFrametest_polars_roundtrip— full Polars → SBDF → Polars roundtrip🤖 Generated with Claude Code