Skip to content

feat: add support for generic DataFrame apis like polars, cuDF, modin etc✨#207

Draft
KarelZe wants to merge 5 commits into
mainfrom
dataframe-api-isolated
Draft

feat: add support for generic DataFrame apis like polars, cuDF, modin etc✨#207
KarelZe wants to merge 5 commits into
mainfrom
dataframe-api-isolated

Conversation

@KarelZe
Copy link
Copy Markdown
Owner

@KarelZe KarelZe commented May 5, 2026

better DataFrame support

task

  • I'd like to provide support for a wide range of dataframe apis, so that tclf can be used with Polars, cuDF, pandas etc.
  • I'd propose to use narwhals for it.
  • Previously, I wanted to implement using the __dataframe__ protocol (see here feat: start with support for __dataframe__ api🆕 #103), which seems to be fallen out of favour.

background

  • The ecosystem is shifting away from the __dataframe__ interchange protocol due to its complexity and brittleness. Notably:
  • Scikit-learn and skrub are currently discussing or implementing Narwhals to simplify DataFrame input/output handling.
  • Polars has de-emphasized the interchange protocol in favor of the Arrow PyCapsule Interface. As of version 1.23.0, Polars only uses the old protocol as a fallback, citing implementation inconsistencies and bugs in external libraries (like pandas < 2.2) that led to unreliable results.
  • Narwhals provides a lightweight, transparent wrapper that allows tclf to treat various DataFrames as a unified object without adding heavy dependencies or sacrificing performance.
    references
  • RFC adopt narwhals for dataframe support scikit-learn/scikit-learn#31049
  • DISCUSSION - Replacing the dataframe API with narwhals skrub-data/skrub#1818

Potential Implementation

# https://github.com/scikit-learn/scikit-learn/pull/31127/changes#diff-9ae0e7df7811fcd30e802947898d26c1d83267255d7abf79b20a25233f86cabeR370
import narwhals.stable.v2 as nw

        if nw.dependencies.is_into_dataframe(X):
            X = nw.from_native(X)
            dtypes = X.schema.dtypes()

tasks

  • remove pandas and maybe even numpy as a runtime dependency
  • update implementation in ClassicalClassifier to use narwhals
  • add additional tests for polars and maybe cuDF.
  • update documentation
  • run benchmarks before and after test

@KarelZe KarelZe self-assigned this May 5, 2026
@KarelZe KarelZe added the enhancement New feature or request label May 5, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.50%. Comparing base (d3d193e) to head (075680e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/tclf/classical_classifier.py 87.50% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #207      +/-   ##
==========================================
- Coverage   98.46%   96.50%   -1.96%     
==========================================
  Files           2        2              
  Lines         130      143      +13     
  Branches       10       12       +2     
==========================================
+ Hits          128      138      +10     
- Misses          0        1       +1     
- Partials        2        4       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request outlines a plan to support multiple DataFrame libraries via Narwhals and updates workspace spell-check settings. Feedback points out incorrect API usage in the proposed code snippet and provides recommendations for dependency management, specifically regarding the necessity of numpy and the potential to make pandas optional.

Comment thread better-df-support.md Outdated
Comment thread better-df-support.md Outdated
@KarelZe KarelZe force-pushed the dataframe-api-isolated branch from dca61b6 to ec13e82 Compare May 5, 2026 06:56
@KarelZe
Copy link
Copy Markdown
Owner Author

KarelZe commented May 5, 2026

@gemini-code-assist please review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the project to be DataFrame-agnostic by replacing the pandas dependency with narwhals, enabling support for Polars, cuDF, and other compatible libraries. Key changes include the introduction of an internal _Frame class for numpy-backed column access and updates to the fit and predict methods to handle diverse input types. A critical issue was identified in the predict method where input DataFrames are converted to numpy arrays without ensuring that column alignment matches the training data, which could lead to incorrect predictions.

Comment thread src/tclf/classical_classifier.py Outdated
@KarelZe
Copy link
Copy Markdown
Owner Author

KarelZe commented May 5, 2026

@gemini-code-assist final round of review :-D

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the library to be DataFrame-agnostic by replacing the pandas dependency with narwhals, enabling support for Polars, cuDF, and other compatible formats. It introduces an internal _Frame class to maintain column-based access for numpy arrays and updates the ClassicalClassifier's fit and predict methods accordingly. Feedback identifies a potential crash when fitting numpy arrays without explicit feature names and recommends replacing assert statements with robust error handling to ensure reliability in optimized Python environments.

Comment on lines +484 to +489
self.columns_ = self.features
if _is_frame(X):
nw_X = nw.from_native(X, eager_only=True)
if self.columns_ is None:
self.columns_ = nw_X.columns
X = nw_X.select(self.columns_).to_numpy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When X is a non-DataFrame input (like a numpy array) and self.features is None, self.columns_ will remain None. This will cause a failure later in predict when initializing _Frame. It is better to validate this early in fit with a clear error message.

        self.columns_ = self.features
        if _is_frame(X):
            nw_X = nw.from_native(X, eager_only=True)
            if self.columns_ is None:
                self.columns_ = nw_X.columns
            X = nw_X.select(self.columns_).to_numpy()
        elif self.columns_ is None:
            raise ValueError(
                "The 'features' parameter must be provided when fitting on a non-DataFrame "
                "input (e.g., a numpy array) so that column names can be associated with "
                "the data."
            )

Comment thread src/tclf/classical_classifier.py Outdated
npt.NDArray: Predicted traget values for X.
"""
check_is_fitted(self)
assert self.columns_ is not None # set during fit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert for input validation or state checking in a library is generally discouraged as assertions can be optimized away with the -O flag. Since self.columns_ should be set during fit, consider relying on check_is_fitted(self) or raising a explicit ValueError if the state is invalid.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant