feat: add support for generic DataFrame apis like polars, cuDF, modin etc✨#207
feat: add support for generic DataFrame apis like polars, cuDF, modin etc✨#207KarelZe wants to merge 5 commits into
polars, cuDF, modin etc✨#207Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #207 +/- ##
==========================================
- Coverage 98.46% 96.50% -1.96%
==========================================
Files 2 2
Lines 130 143 +13
Branches 10 12 +2
==========================================
+ Hits 128 138 +10
- Misses 0 1 +1
- Partials 2 4 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Code Review
This pull request outlines a plan to support multiple DataFrame libraries via Narwhals and updates workspace spell-check settings. Feedback points out incorrect API usage in the proposed code snippet and provides recommendations for dependency management, specifically regarding the necessity of numpy and the potential to make pandas optional.
dca61b6 to
ec13e82
Compare
|
@gemini-code-assist please review |
There was a problem hiding this comment.
Code Review
This pull request transitions the project to be DataFrame-agnostic by replacing the pandas dependency with narwhals, enabling support for Polars, cuDF, and other compatible libraries. Key changes include the introduction of an internal _Frame class for numpy-backed column access and updates to the fit and predict methods to handle diverse input types. A critical issue was identified in the predict method where input DataFrames are converted to numpy arrays without ensuring that column alignment matches the training data, which could lead to incorrect predictions.
|
@gemini-code-assist final round of review :-D |
There was a problem hiding this comment.
Code Review
This pull request migrates the library to be DataFrame-agnostic by replacing the pandas dependency with narwhals, enabling support for Polars, cuDF, and other compatible formats. It introduces an internal _Frame class to maintain column-based access for numpy arrays and updates the ClassicalClassifier's fit and predict methods accordingly. Feedback identifies a potential crash when fitting numpy arrays without explicit feature names and recommends replacing assert statements with robust error handling to ensure reliability in optimized Python environments.
| self.columns_ = self.features | ||
| if _is_frame(X): | ||
| nw_X = nw.from_native(X, eager_only=True) | ||
| if self.columns_ is None: | ||
| self.columns_ = nw_X.columns | ||
| X = nw_X.select(self.columns_).to_numpy() |
There was a problem hiding this comment.
When X is a non-DataFrame input (like a numpy array) and self.features is None, self.columns_ will remain None. This will cause a failure later in predict when initializing _Frame. It is better to validate this early in fit with a clear error message.
self.columns_ = self.features
if _is_frame(X):
nw_X = nw.from_native(X, eager_only=True)
if self.columns_ is None:
self.columns_ = nw_X.columns
X = nw_X.select(self.columns_).to_numpy()
elif self.columns_ is None:
raise ValueError(
"The 'features' parameter must be provided when fitting on a non-DataFrame "
"input (e.g., a numpy array) so that column names can be associated with "
"the data."
)| npt.NDArray: Predicted traget values for X. | ||
| """ | ||
| check_is_fitted(self) | ||
| assert self.columns_ is not None # set during fit |
There was a problem hiding this comment.
|



better DataFrame support
task
narwhalsfor it.__dataframe__protocol (see here feat: start with support for __dataframe__ api🆕 #103), which seems to be fallen out of favour.background
__dataframe__interchange protocol due to its complexity and brittleness. Notably:tclfto treat various DataFrames as a unified object without adding heavy dependencies or sacrificing performance.references
Potential Implementation
tasks
pandasand maybe evennumpyas a runtime dependencyClassicalClassifierto usenarwhalspolarsand maybecuDF.