PyDimRed is a dimenstionality reduction (DR) library built around sklearn. It integrates three main functionalities: data transformation, model evaluation, and data plotting.
Use the package manager pip to install PyDimRed.
The dependencies for this library are joblib, seaborn, numpy, sci-kit learn, umap, pacmap, trimap, open t-SNE, and pandas.
To install PyDimRed via pip simply run the following:
pip install PyDimRedOr clone the repo, cd into it and run the following:
cd PyDimRed
pip install .PyDimRed documentation is made using sphinx and Read-The-Docs theme.
As stated previously, PyDimRed is centered around 3 functionalities and a DR utility module that will be explored through examples.
Data transformation is undertaken using the PyDimRed.transform module. A dimensionality reduction model is instanciated using the PyDimRed.transform.TransformWrapper class. As the name of the class indicates, it is a wrapper class that takes any object satisfying the definition of a Transform (an object that must have certain methods with specific signatures).
There are two ways of instanciating a TransformWrapper object:
- By specifying the
method : strparameter. This will create a built-in DR model from the libraries in the install section. - By passing a
base_modelparameter. Using thebase_modelargument allows for the use of a user defined DR model.
Here is a quick example on the use of TransformWrapper to transform the iris dataset to 2 dimensions with a built in DR method.
from PyDimRed.transform import TransformWrapper
from PyDimRed.plot import display
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = TransformWrapper(method = "PCA", d = 2)
X_reduced = model.fit_transform(X, y)
display(
X = X_reduced,
y = y,
title = "PCA reduction of iris data set to 2 dimensions",
x_label = "1st principal component",
y_label = "2nd principal component",
hue_label = "iris features",
)The model can be customized by passing a mode as argument for further customization.
from PyDimRed.transform import TransformWrapper
from PyDimRed.plot import display
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
X, y = load_iris(return_X_y=True)
model = TransformWrapper(base_model = TSNE(n_components=2, perplexity=30.0))
X_reduced = model.fit_transform(X, y)
display(
X = X_reduced,
y = y,
title = "PCA reduction of iris data set to 2 dimensions",
x_label = "1st principal component",
y_label = "2nd principal component",
hue_label = "iris features",
)The evaluation module PyDimRed.evaluation provides two main dimensionality reduction evalution methods in the class PyDimRed.evaluation.ModelEvaluator. The methods are PyDimRed.evaluation.ModelEvaluator.grid_search_dr_performance and PyDimRed.evaluation.ModelEvaluator.cross_validation.
cross_validationwraps aroundsklearn.model_selection.GridSearchCV. In cross validation, an estimator is trained on validation data and predictions are made on test data. This would be achieved via a call tofit()and thentransform()on different data sets. Some DR methods liketrimap.TRIMAPandsklearn.manifold.TSNEdo not support transforming non-training data. Consequently, standard cross validation cannot be used, and hence thegrid_search_dr_performanceis used.grid_search_dr_performanceis a work-around solution for DR methods that only implement afit_transform()method, which fits the model on a dataset and returns the transformed data. Models that only have this method are generally unable to transform other data points that were not part of the original train data. This method fits a DR model on the training data and transforms the training data. It then fits a new DR model on the validation data and transforms the validation data. Finally and estimator is used to obtain a performance value. For classification an sklearn pipeline with a Standard Scaler and 1-Nearest Neighbour classifier is used by default and for regression a 1-NN regressor can be used for example.
We start off by showing an example to display the use of grid_search_dr_performance with trimap.
from PyDimRed.plot import display_heatmap_df
from PyDimRed.evaluation import ModelEvaluator
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
X, y = load_iris(return_X_y=True)
params = [{"method" : ["TRIMAP"], "n_nbrs" : range(2,20,3), "n_outliers" : range(2,20,3)}]
estimator = Pipeline(steps=[("Scaler", StandardScaler()), ("OneNN", KNeighborsClassifier(1))])
K = 4
n_repeats = 15
model_eval = ModelEvaluator(
X=X,
y=y,
parameters=params,
estimator = estimator,
K=K,
n_repeats=n_repeats,
n_jobs=-1,
)
best_score, best_params, results = model_eval.grid_search_dr_performance()
print(f"The best accuracy is {best_score} with the following parameters {best_params}")
display_heatmap_df(
results,
"n_nbrs",
"n_outliers",
"score",
title=f"Accuracy of 1-NN as a function of number of inliers and outliers in TRIMAP with d=2. K={K} cross validation with {n_repeats} repetitions"
)
display_heatmap_df(
results,
"n_nbrs",
"n_outliers",
"variance",
title=f"Variance of 1-NN as a function of number of inliers and outliers in TRIMAP with d=2. K={K} cross validation with {n_repeats} repetitions"
)In this example we instantiate a ModelEvaluator object to then call grid_search_dr_performance. The data sets and parameters are passed during instantiation. The estimator parameter is explicitly passed as an argument, even if it is the default value, to highlight its importance. The estimator's purpose is to what defines the performance of a DR model. After the train and validation data has been transformed, it is trained on the train transformed data via fit() and evaluated on the transformed validation data via score(). It is convenient to use an sklearn.pipeline.Pipeline as it easily allows chaining (e.g scaling) of the data.
In comparison to what cross_validation returns, the results variable contains less data as in cross_validation the result of sklearn.model_selection.GridSearchCV is returned.
The method will create parameter combinations accoring to sklearn's ParameterGrid class. In short, a sort of 'cross product' is taken between all parameter values in a dictionary, and when different dictionaries are in a list a union of parameters is computed. Note that n_jobs refers to the number of jobs in joblib.
We now look at an example of cross validation.
from PyDimRed.plot import display_heatmap_df
from PyDimRed.evaluation import ModelEvaluator
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
params = [{"method" : ["PACMAP"]} , {"method" : ["TSNE", "UMAP", "PACMAP"], "n_nbrs" : range(2,10,1) }]
model_eval = ModelEvaluator(X,y,params,K=20, n_jobs=-1)
bestScore, bestParams, results = model_eval.cross_validation()
display_heatmap_df(results,'param_Transform__method','param_Transform__n_nbrs', 'mean_test_score')Plotting is built over the seaborn library. Methods in the module PyDimRed.plot have the goal to present transformed data or model evaluation results. The methods and the resulting plots are:
-
displayPlot a 2-D dataset on a seaborn relplot. -
display_groupPlot a grid-like arrangement of sub-plots. Can be used to qualitatively compare effectivenes of DR methods, or vary a parameter and see the effect of it visually. -
display_heatmapPlot a heatmap given a square$N \times N$ numpy.array. -
display_heatmap_dfPlot a heatmap given apandas.DataFramewhere every row corresponds to parameters value pairs. -
display_accuraciesPlot accuracies and method name on a bar graph with color scale proportional to accuracy. -
display_training_validationLine plot of accuracy vs. a common variable parameter for multiple methods.
The most flexible method for qualitative comparisons of reduced data is display_group. It can be used in three main different ways:
(1) To simply plot transformed data from the same source data set with heterogeneous models as below.
from PyDimRed.utils.dr_utils import reduce_data_with_params
from PyDimRed.plot import display_group
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y = True)
params = [{"method" : ["TSNE", "UMAP", "TRIMAP"] , "n_nbrs" : [10,20,40,80]} , {"method" : ["PACMAP"]}]
Xs_train_reduced, names = reduce_data_with_params(X,y,*params)
display_group(
names=names,
X_train_list=Xs_train_reduced,
y_train=y,
nbr_cols=4,
nbr_rows=4,
legend="auto",
title=f"Transformed data via different DR models on the MNIST data set"
)(2) To plot transformed data with different model parameters in a grid like fashion.
from PyDimRed.utils.dr_utils import reduce_data_with_params
from PyDimRed.plot import display_group
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y = True)
range1 = range(2,20,3)
range2 = range(2,20,3)
params = [{"method" : ["TRIMAP"], "n_nbrs" : range1, "n_outliers" : range2}]
Xreds, _ = reduce_data_with_params(X,y,*params)
cols = len(params[0]["n_nbrs"])
rows = len(params[0]["n_outliers"])
display_group(
names=None,
X_train_list=Xreds,
y_train=y,
nbr_cols=cols,
nbr_rows=rows,
marker_size=10,
grid_x_label=[f"n_inliers={i}" for i in range1],
grid_y_label=[f"n_outliers={i}" for i in range2],
title="TRIMAP with variation in number of inliers and number of outliers"
)(3) To plot training and test data on the same subplots to compare transformations.
from PyDimRed.utils.dr_utils import reduce_data_with_params
from PyDimRed.plot import display_group
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y = True)
range1 = range(2,20,3)
range2 = range(2,20,3)
params = [{"method" : ["TRIMAP"], "n_nbrs" : range1, "n_outliers" : range2}]
cols = len(params[0]["n_nbrs"])
rows = len(params[0]["n_outliers"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
XtrainReds, _ = reduce_data_with_params(X_train, y_train,*params)
XtestReds, _ = reduce_data_with_params(X_test, y_test,*params)
display_group(
names=None,
X_train_list=XtrainReds,
y_train=y_train,
X_test_list=XtestReds,
y_test=y_test,
nbr_cols=cols,
nbr_rows=rows,
marker_size=10,
grid_x_label=[f"n_inliers={i}" for i in range1],
grid_y_label=[f"n_outliers={i}" for i in range2],
title="TRIMAP train and test projections with variation in number of inliers and number of outliers",
)The two utility modules are PyDimRed.utils.data and PyDimRed.utils.dr_utils. The first module has purpose to load csv files to numpy.array / pandas.DataFrame while the latter concerns itself with batch processing.
The main method in dr_utils is reduce_data_with_params which given a dictionary with parameter value pairs transforms a given data set according to all passed methods.
To run the unit tests clone the repo make sure that pytest is installed. Then run
cd test
pytest test_*.pyTo contribute please fork, clone, and submit a pull request!