Add native Polars DataFrame support by stewjb · Pull Request #99 · spotfiresoftware/spotfire-python

stewjb · 2026-03-24T01:49:28Z

Closes #98

Summary

Export: export_data() now accepts polars.DataFrame and polars.Series directly, mapping Polars dtypes to SBDF types without any Pandas intermediary. Supported types: Boolean, Int8/16/32, Int64, Float32/64, Utf8/String, Date, Datetime, Duration, Time, Binary, Decimal, Categorical.
Import: import_data() gains an output_format parameter (default "pandas" for backwards compatibility). When output_format="polars", a polars.DataFrame is built directly from the raw numpy arrays — no Pandas DataFrame is created at any point.
Dependency: Polars is added as an optional dependency (spotfire[polars]), following the same pattern as spotfire[geo] and spotfire[plot].

Performance benefit

The previous workaround required polars_df.to_pandas() before export, which doubles peak memory usage and adds 2–5 seconds of conversion time at 10M rows. The native path eliminates this entirely for export.

Spotfire data function context

When running inside a Spotfire data function, SBDF import and export happen automatically via data_function.py — users never call import_data or export_data directly. This has two implications:

Export (output variables): Full benefit. A user can build a polars.DataFrame in their script and return it as an output variable — export_data() handles it natively with no conversion.

Import (input variables): No benefit from import_data(output_format="polars"). Input data is always loaded by the framework via sbdf.import_data(self._file) (no output_format argument), so input variables always arrive in the script as pd.DataFrame. Users who want Polars for processing would still need to call pl.from_pandas(input_df) themselves. Fixing this properly would require changes to data_function.py and a mechanism for users to declare their preference — out of scope for this PR.

In short: the output_format parameter on import_data is primarily useful outside the Spotfire data function context (e.g. standalone scripts using the spotfire package directly). Inside a data function, only the export side benefits.

Test plan

test_write_polars_basic — export DataFrame with common types, re-import as Pandas and verify data
test_write_polars_nulls — null values are preserved through the roundtrip
test_write_polars_series — Polars Series export works
test_import_as_polars — import with output_format="polars" returns a native polars.DataFrame
test_polars_roundtrip — full Polars → SBDF → Polars roundtrip
All 72 existing tests continue to pass (1 pre-existing skip unrelated to this change)
pylint 10.00/10, mypy clean, cython-lint clean

🤖 Generated with Claude Code

…tetime, scatter compat - Fix Categorical/Enum dtype: was incorrectly trying to recurse into dtype.categories (which doesn't exist on the dtype object); now casts series to Utf8 and maps to SBDF_STRINGTYPEID directly - Add Enum dtype support (previously raised SBDFError) - Warn on UInt64 export: values above Int64 max will overflow silently - Warn on timezone-aware Datetime export: tz info is not preserved in SBDF - Warn on Decimal export: marked experimental, precision may be lost - Fix scatter() compatibility: add AttributeError fallback to set_at_idx() for older Polars versions within the supported range - Add tests for all of the above Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add polars to test_requirements_default.txt so SbdfPolarsTest is actually executed in CI (previously skipped due to missing import) - Add spotfire[polars] row to extras table in README - Add usage note explaining Spotfire's bundled Python lacks Polars and that SPKs bundling Polars will be ~44 MB larger than typical packages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Raise SBDFError for unknown output_format values (previously fell through silently to Pandas) - Emit SBDFWarning when Categorical/Enum columns are exported as String, consistent with existing UInt64 and timezone warnings - Add test_invalid_output_format: verifies bad output_format raises - Add test_write_polars_empty: verifies empty DataFrame exports cleanly - Add test_write_polars_series_nulls: verifies null preservation in Series - Add test_polars_categorical_warns: verifies Categorical warning fires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

A Polars Series of [None, None, None] has dtype pl.Null (no type can be inferred). Previously this raised SBDFError with "unknown dtype". Now it exports as an all-invalid String column, consistent with how all-None Pandas columns are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CI static analysis runs mypy without polars installed; add type: ignore[import-not-found] so mypy skips the missing stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Explain non-obvious choices that would otherwise prompt review questions: - Why dtype.__class__.__name__ instead of isinstance() - Why scatter()/set_at_idx() try/except exists and which versions it covers - Why is_object_numpy_type() cpdef wrapper is needed for a cdef attribute - Why the output_format polars path short-circuits before pd.concat - Why the Null dtype path returns a placeholder array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…olars versions (>= 0.20) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds native Polars support to Spotfire’s SBDF import/export layer to avoid Pandas conversions (improving memory usage and performance for large datasets), and wires it up as an optional extra.

Changes:

Add polars as an optional dependency (spotfire[polars]) and enable it in dev/test setups.
Extend sbdf.export_data() to accept polars.DataFrame / polars.Series directly, with dtype→SBDF mapping.
Extend sbdf.import_data() with output_format to optionally construct a native polars.DataFrame without creating a Pandas DataFrame.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`spotfire/sbdf.pyx`	Implements Polars import/export paths and dtype mappings; adds `output_format` to `import_data()`.
`spotfire/sbdf.pyi`	Updates type stub for `import_data()` to include `output_format`.
`spotfire/test/test_sbdf.py`	Adds Polars-focused unit tests for export/import/roundtrip + warnings.
`pyproject.toml`	Adds `polars` extra and includes it in `dev` extra.
`test_requirements_default.txt`	Installs Polars for test runs.
`README.md`	Documents `spotfire[polars]` and the new import/export behavior.
`.gitignore`	Ignores `.venv`, `uv.lock`, and `.claude`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spotfire/sbdf.pyi

Copilot · 2026-03-24T04:04:56Z

spotfire/sbdf.pyx

+        context.set_valuetype_id(_export_infer_valuetype_from_polars_dtype(series.dtype, f"column '{col}'"))
+        invalids = series.is_null().to_numpy()
+        context.set_arrays(_export_polars_series_to_numpy(context, series), invalids)
+        column_metadata.append({})


In the Polars export path, invalids are derived from series.is_null(), which does not mark floating-point NaN values as invalid. In the existing Pandas path pd.isnull() treats NaN as missing, so exporting a Polars float column containing NaN will write NaNs as real values instead of SBDF invalids (behavior mismatch vs Pandas and likely incorrect for Spotfire missing-values semantics). Consider treating NaN as invalid for Float32/Float64 columns (e.g., combine is_null() with is_nan() when applicable).

added series.is_null() and series.is_nan() for floats to handle like pandas does.

Copilot · 2026-03-24T04:04:57Z

spotfire/sbdf.pyx

+    if na_value is not None:
+        return np.asarray(series.fill_null(na_value).to_numpy(allow_copy=True),
+                          dtype=context.get_numpy_dtype())
+    else:


_export_polars_series_to_numpy converts to an object ndarray when na_value is None. For Polars Datetime / Duration series, to_numpy() already produces datetime64 / timedelta64 arrays that the existing SBDF exporters can handle, so forcing dtype=object will box scalars and create an unnecessary copy (hurting the performance goal of this PR). Consider special-casing datetime/timespan to keep the native NumPy dtype (ideally normalized to the SBDF-supported resolution) instead of casting to object.

Suggested change

else:

else:

# For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object.

if dtype_name in ("Datetime", "Duration"):

return series.to_numpy(allow_copy=True)

datetime and duration go to numpy early now.

spotfire/sbdf.pyx

@overload

- Move output_format validation to top of import_data() for fail-fast behaviour before the file is opened - Raise SBDFError in _import_polars_dtype fallback instead of silently returning Utf8 for unknown SBDF type IDs - Treat NaN as invalid (missing) for Float32/Float64 columns, matching Pandas pd.isnull() behaviour; add test_write_polars_float_nan - Keep native datetime64/timedelta64 arrays for Datetime/Duration columns instead of boxing to object dtype (avoids unnecessary copy) - Add @overload signatures to sbdf.pyi so callers get pd.DataFrame for the default output_format="pandas" and Any for output_format="polars" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb · 2026-03-24T18:56:22Z

@vrane-tibco @bbassett-tibco @mpanke-tibco thanks for considering this PR. Let me know if you all have thoughts.

stewjb and others added 9 commits March 23, 2026 20:06

feat: polars functionality

4868db9

linting and testing

82492e5

Fix mypy error for polars import in test file

441cddb

CI static analysis runs mypy without polars installed; add type: ignore[import-not-found] so mypy skips the missing stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove set_at_idx fallback; scatter() is available in all supported P…

bf8e984

…olars versions (>= 0.20) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mpanke-tibco requested a review from Copilot March 24, 2026 04:01

Copilot started reviewing on behalf of mpanke-tibco March 24, 2026 04:01 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

stewjb force-pushed the main branch from 79d62d1 to 00d81cf Compare March 25, 2026 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native Polars DataFrame support#99

Add native Polars DataFrame support#99
stewjb wants to merge 10 commits intospotfiresoftware:mainfrom
stewjb:main

stewjb commented Mar 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

stewjb Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

stewjb Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

stewjb commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    else:
+    else:
+        # For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object.
+        if dtype_name in ("Datetime", "Duration"):
+            return series.to_numpy(allow_copy=True)

Conversation

stewjb commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance benefit

Spotfire data function context

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

stewjb Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

stewjb Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stewjb commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stewjb commented Mar 24, 2026 •

edited

Loading