Scikit-learn Pipeline and GridSearchCV with the OPU

You can use the OPUMap wrapper for sklearn in lightonml.projections.sklearn in a Pipeline and for example run a grid search on parameters using GridSearchCV.

In this notebook is also shown the use of a simulated OPU in case you don’t have access to a real one.

[1]:
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
import numpy as np

from lightonml.datasets import MNIST
[2]:
random_state = np.random.RandomState(1234)
[3]:
(X_train, y_train), (X_test, y_test) = MNIST()
X, y = np.concatenate([X_train, X_test]), np.concatenate([y_train, y_test])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000,
                                                    random_state=42)

OPU pipeline

To define a flow of operations, we can conveniently use sklearn.pipeline.Pipeline. In this way, we can easily perform cross-validation on the hyperparameters of the model.

[4]:
from sklearn.pipeline import Pipeline
[5]:
pipeline_steps = []

Data and data encoding

[6]:
from lightonml.encoding.base import BinaryThresholdEncoder
[7]:
encoder = BinaryThresholdEncoder(threshold_enc=24)
print('Encoder threshold: ', encoder.threshold_enc)
Encoder threshold:  24
[8]:
pipeline_steps.append(('encoder', encoder))

Random Mapping on the OPU

OPUMap can be initialized with simulated=True if you run the notebook without access to an OPU. Set simulated var to True if needed. Note that the number of random projections must be lowered, since simulation has much lower performance than a real OPU.

[9]:
from lightonml.projections.sklearn import OPUMap
[10]:
simulated=False  # Change to True if you don't have a real OPU

if simulated:
    random_mapping = OPUMap(n_components=3000,  # number of random projections
                            simulated=True, max_n_features=1000,
                            ndims=2)
else:
    random_mapping = OPUMap(n_components=10000,  # number of random projections
                            ndims=2)

pipeline_steps.append(('mapping', random_mapping))

Decoding

Some encoders, like SeparatedBitPlanEncoder, need a specific decoder to decode the random features. In this case we don’t need one, so we can use the NoDecoding class or just skip this pipeline step.

[12]:
from lightonml.encoding.base import NoDecoding
[13]:
pipeline_steps.append(('decoding', NoDecoding()))

Model

[14]:
classifier = RidgeClassifier()
[15]:
pipeline_steps.append(('classifier', classifier))

Instantiate and run the pipeline

[16]:
pipe = Pipeline(steps=pipeline_steps)
[17]:
pipe.named_steps
[17]:
{'encoder': <lightonml.encoding.base.BinaryThresholdEncoder at 0x7f58e6db6050>,
 'mapping': OPUMap(max_n_features=1000, n_components=3000, ndims=2,
        opu=<lightonml.opu.OPU object at 0x7f58e6dc6550>, simulated=True,
        verbose_level=0),
 'decoding': <lightonml.encoding.base.NoDecoding at 0x7f58e6dcc090>,
 'classifier': RidgeClassifier()}

Opening the OPU

[18]:
print('Fitting the model...')
pipe.fit(X_train, y_train)
Fitting the model...
[18]:
Pipeline(steps=[('encoder',
                 <lightonml.encoding.base.BinaryThresholdEncoder object at 0x7f58e6db6050>),
                ('mapping',
                 OPUMap(max_n_features=1000, n_components=3000, ndims=2,
                        opu=<lightonml.opu.OPU object at 0x7f58e6dc6550>,
                        simulated=True, verbose_level=0)),
                ('decoding',
                 <lightonml.encoding.base.NoDecoding object at 0x7f58e6dcc090>),
                ('classifier', RidgeClassifier())])
[19]:
train_accuracy = pipe.score(X_train, y_train)
test_accuracy = pipe.score(X_test, y_test)

print('Train accuracy {:.2f}'.format(train_accuracy * 100))
print('Test accuracy {:.2f}'.format(test_accuracy * 100))
Train accuracy 96.99
Test accuracy 95.81
[20]:
from sklearn.model_selection import ShuffleSplit, GridSearchCV

# grid for the values of alpha
alpha_values = 10. ** np.arange(-1, 1)
# define the parameters grid
grid_parameters = [{'classifier__alpha': alpha_values}]

# build cross validation scheme
cv_scheme = ShuffleSplit(n_splits=2, test_size=0.15)

grid_search = GridSearchCV(pipe, grid_parameters, scoring="accuracy", cv=cv_scheme, refit=False, return_train_score=True)
grid_search.fit(X_train, y_train)
[20]:
GridSearchCV(cv=ShuffleSplit(n_splits=2, random_state=None, test_size=0.15, train_size=None),
             estimator=Pipeline(steps=[('encoder',
                                        <lightonml.encoding.base.BinaryThresholdEncoder object at 0x7f58e6db6050>),
                                       ('mapping',
                                        OPUMap(max_n_features=1000,
                                               n_components=3000, ndims=2,
                                               opu=<lightonml.opu.OPU object at 0x7f58e6dc6550>,
                                               simulated=True,
                                               verbose_level=0)),
                                       ('decoding',
                                        <lightonml.encoding.base.NoDecoding object at 0x7f58e6dcc090>),
                                       ('classifier', RidgeClassifier())]),
             param_grid=[{'classifier__alpha': array([0.1, 1. ])}], refit=False,
             return_train_score=True, scoring='accuracy')
[21]:
import pandas as pd

pd.DataFrame.from_dict(grid_search.cv_results_)
[21]:
mean_fit_time std_fit_time mean_score_time std_score_time param_classifier__alpha params split0_test_score split1_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score mean_train_score std_train_score
0 4.585277 0.007682 0.554305 0.011957 0.1 {'classifier__alpha': 0.1} 0.958333 0.958222 0.958278 0.000056 1 0.971373 0.971216 0.971294 0.000078
1 4.862778 0.232286 0.541315 0.009398 1.0 {'classifier__alpha': 1.0} 0.958333 0.958222 0.958278 0.000056 1 0.971333 0.971157 0.971245 0.000088