233 feature implementation of pcr#254
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Principal Component Regression (PCR) estimator to chemotools.models, with a comprehensive sklearn-compatibility test suite.
Changes:
- Introduces
PrincipalComponentRegression(PCA + LinearRegression) as a sklearn-style estimator/transformer. - Exposes PCR from
chemotools.modelsvia__init__.py. - Adds extensive tests comparing predictions/attributes/scores vs an equivalent sklearn pipeline.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
chemotools/models/_principal_component_regression.py |
New PCR estimator implementation (fit/transform/predict + sklearn param validation). |
chemotools/models/__init__.py |
Exports PrincipalComponentRegression from the chemotools.models public API. |
tests/models/test_principal_component_regression.py |
New tests for sklearn estimator compliance and parity with a PCA+LinearRegression pipeline. |
Comments suppressed due to low confidence (1)
chemotools/models/init.py:16
chemotools.modelsemits aFutureWarningon every import, but this PR addsPrincipalComponentRegressionto this module. That means users importing PCR via the new public API (from chemotools.models import PrincipalComponentRegression) will now see a warning message that only discussesPLSRegression, which is confusing and noisy. Consider moving the warning so it only triggers whenPLSRegressionis imported/used, or updating the warning text/scope to reflect the expanded contents ofchemotools.models.
from chemotools.models._cross_decomposition import PLSRegression
from ._principal_component_regression import PrincipalComponentRegression
__all__ = ["PLSRegression", "PrincipalComponentRegression"]
# Show deprecation notice on module import
warnings.warn(
"chemotools.models.PLSRegression extends sklearn's PLSRegression with "
"explained_x_variance_ratio_ and explained_y_variance_ratio_ attributes. "
"This feature is being contributed to scikit-learn (see PR #32722). "
"Once available in sklearn, this module may be deprecated. "
"Track progress at: https://github.com/scikit-learn/scikit-learn/pull/32722",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Train PCA model | ||
| self.pca_ = PCA(n_components=self.n_components).fit(X) | ||
| x_scores = self.pca_.transform(X) | ||
|
|
||
| # Train linear regression model | ||
| self.lr_ = LinearRegression().fit(x_scores, y) | ||
|
|
There was a problem hiding this comment.
The estimator exposes a copy parameter, but it is only used in validate_data and is not passed through to the underlying PCA(copy=...) (nor documented). This makes copy behavior differ from sklearn's PCA(copy=...) and from what users might expect in a PCR wrapper. Consider either propagating copy to PCA (and possibly LinearRegression) or renaming/removing it to avoid implying sklearn-PCA semantics.
| from chemotools.models._principal_component_regression import ( | ||
| PrincipalComponentRegression, | ||
| ) |
There was a problem hiding this comment.
The tests import PrincipalComponentRegression from the private module chemotools.models._principal_component_regression, even though the PR adds it to chemotools.models.__init__ and __all__. To test the supported public API (and avoid coupling tests to internal module paths), import it from chemotools.models instead.
| from chemotools.models._principal_component_regression import ( | |
| PrincipalComponentRegression, | |
| ) | |
| from chemotools.models import PrincipalComponentRegression |
| ------- | ||
| x_scores : np.ndarray of shape (n_samples,n_components) | ||
| The transformed data. | ||
| """ |
There was a problem hiding this comment.
transform() calls self.pca_.transform(X) without verifying the estimator is fitted or validating feature count/dtype. This will raise an AttributeError instead of a scikit-learn NotFittedError, and skips validate_data(..., reset=False) checks that predict() already performs. Add check_is_fitted(self, ["pca_", "lr_"]) (or at least "pca_") and validate X similarly to predict() before transforming.
| """ | |
| """ | |
| check_is_fitted(self, ["pca_", "lr_"]) | |
| X = validate_data( | |
| self, | |
| X, | |
| ensure_2d=True, | |
| reset=False, | |
| copy=self.copy, | |
| dtype=np.float64, | |
| ) |
| """ | ||
| Description | ||
|
|
||
| Parameters | ||
| ---------- | ||
| n_components : int, default = 2 | ||
| The number of components used to calculate the PCA model | ||
| # add comments on parameter constraints | ||
|
|
There was a problem hiding this comment.
The class docstring has several issues that will surface in generated docs: (1) it states n_components default is 2 but __init__ defaults to None; (2) the example import uses chemotools.decomposition which does not exist in this repo (should be chemotools.models); and (3) it contains a leftover placeholder line # add comments on parameter constraints. Please align the docstring with the actual public API and remove the placeholder.
|
|
||
| def test_y_dimensions(self): | ||
| """ | ||
| Test that y dimesions predicted is the right one. |
There was a problem hiding this comment.
Several new test names/docstrings contain spelling mistakes that reduce readability/searchability (e.g., "dimesions" in the docstring). Please correct the spelling ("dimensions").
| Test that y dimesions predicted is the right one. | |
| Test that y dimensions predicted is the right one. |
| def test_attributs(self): | ||
| """ | ||
| Test all the attributes are presents | ||
| """ |
There was a problem hiding this comment.
Several new identifiers have typos that make the test suite harder to navigate (e.g., test_attributs). Please rename to test_attributes to match the wording used elsewhere in the repo (see tests/models/test_cross_decomposition.py).
| err_msg=f"Attribute {attr} should match sklearn exactly", | ||
| ) | ||
|
|
||
| def test_tranform_as_sklearn(self): |
There was a problem hiding this comment.
Typo in test name test_tranform_as_sklearn makes it harder to discover and is inconsistent with the test_same_transform_as_sklearn naming used elsewhere (e.g., PLS tests). Consider renaming it to test_transform_as_sklearn (or test_same_transform_as_sklearn).
| def test_tranform_as_sklearn(self): | |
| def test_transform_as_sklearn(self): |
| _parameter_constraints: dict = { | ||
| "n_components": [ | ||
| Interval(Integral, 0, None, closed="left"), | ||
| Interval(RealNotInt, 0, 1, closed="neither"), | ||
| StrOptions({"mle"}), | ||
| None, | ||
| ], | ||
| "copy": ["boolean"], | ||
| } | ||
|
|
||
| def __init__( | ||
| self, | ||
| n_components: int | None = None, | ||
| copy: bool = True, | ||
| ): | ||
| self.n_components = n_components | ||
| self.copy = copy |
There was a problem hiding this comment.
The parameter constraints indicate n_components may be an int, a float in (0, 1), the string "mle", or None, but the type hint in __init__ restricts it to int | None. This mismatch will confuse users/type-checkers and doesn't reflect the actual accepted values passed through to sklearn.decomposition.PCA. Consider widening the annotation (and updating the docstring) to match the supported types.
Adding Principal Component Regression (PCR) algorithm.