Scikit-learn Pipeline and GridSearchCV with the OPU¶
You can use the OPUMap wrapper for sklearn
in lightonml.projections.sklearn
in a Pipeline
and for example run a grid search on parameters using GridSearchCV
.
In this notebook is also shown the use of a simulated OPU in case you don’t have access to a real one.
[1]:
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
import numpy as np
from lightonml.datasets import MNIST
[2]:
random_state = np.random.RandomState(1234)
[3]:
(X_train, y_train), (X_test, y_test) = MNIST()
X, y = np.concatenate([X_train, X_test]), np.concatenate([y_train, y_test])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000,
random_state=42)
OPU pipeline¶
To define a flow of operations, we can conveniently use sklearn.pipeline.Pipeline
. In this way, we can easily perform cross-validation on the hyperparameters of the model.
[4]:
from sklearn.pipeline import Pipeline
[5]:
pipeline_steps = []
Data and data encoding¶
[6]:
from lightonml.encoding.base import BinaryThresholdEncoder
[7]:
encoder = BinaryThresholdEncoder(threshold_enc=24)
print('Encoder threshold: ', encoder.threshold_enc)
Encoder threshold: 24
[8]:
pipeline_steps.append(('encoder', encoder))
Random Mapping on the OPU¶
OPUMap
can be initialized with simulated=True
if you run the notebook without access to an OPU. Set simulated
var to True
if needed. Note that the number of random projections must be lowered, since simulation has much lower performance than a real OPU.
[9]:
from lightonml.projections.sklearn import OPUMap
[10]:
simulated=False # Change to True if you don't have a real OPU
if simulated:
random_mapping = OPUMap(n_components=3000, # number of random projections
simulated=True, max_n_features=1000,
ndims=2)
else:
random_mapping = OPUMap(n_components=10000, # number of random projections
ndims=2)
pipeline_steps.append(('mapping', random_mapping))
Decoding¶
Some encoders, like SeparatedBitPlanEncoder, need a specific decoder to decode the random features. In this case we don’t need one, so we can use the NoDecoding class or just skip this pipeline step.
[12]:
from lightonml.encoding.base import NoDecoding
[13]:
pipeline_steps.append(('decoding', NoDecoding()))
Instantiate and run the pipeline¶
[16]:
pipe = Pipeline(steps=pipeline_steps)
[17]:
pipe.named_steps
[17]:
{'encoder': <lightonml.encoding.base.BinaryThresholdEncoder at 0x7f58e6db6050>,
'mapping': OPUMap(max_n_features=1000, n_components=3000, ndims=2,
opu=<lightonml.opu.OPU object at 0x7f58e6dc6550>, simulated=True,
verbose_level=0),
'decoding': <lightonml.encoding.base.NoDecoding at 0x7f58e6dcc090>,
'classifier': RidgeClassifier()}
Opening the OPU¶
[18]:
print('Fitting the model...')
pipe.fit(X_train, y_train)
Fitting the model...
[18]:
Pipeline(steps=[('encoder',
<lightonml.encoding.base.BinaryThresholdEncoder object at 0x7f58e6db6050>),
('mapping',
OPUMap(max_n_features=1000, n_components=3000, ndims=2,
opu=<lightonml.opu.OPU object at 0x7f58e6dc6550>,
simulated=True, verbose_level=0)),
('decoding',
<lightonml.encoding.base.NoDecoding object at 0x7f58e6dcc090>),
('classifier', RidgeClassifier())])
[19]:
train_accuracy = pipe.score(X_train, y_train)
test_accuracy = pipe.score(X_test, y_test)
print('Train accuracy {:.2f}'.format(train_accuracy * 100))
print('Test accuracy {:.2f}'.format(test_accuracy * 100))
Train accuracy 96.99
Test accuracy 95.81
[20]:
from sklearn.model_selection import ShuffleSplit, GridSearchCV
# grid for the values of alpha
alpha_values = 10. ** np.arange(-1, 1)
# define the parameters grid
grid_parameters = [{'classifier__alpha': alpha_values}]
# build cross validation scheme
cv_scheme = ShuffleSplit(n_splits=2, test_size=0.15)
grid_search = GridSearchCV(pipe, grid_parameters, scoring="accuracy", cv=cv_scheme, refit=False, return_train_score=True)
grid_search.fit(X_train, y_train)
[20]:
GridSearchCV(cv=ShuffleSplit(n_splits=2, random_state=None, test_size=0.15, train_size=None),
estimator=Pipeline(steps=[('encoder',
<lightonml.encoding.base.BinaryThresholdEncoder object at 0x7f58e6db6050>),
('mapping',
OPUMap(max_n_features=1000,
n_components=3000, ndims=2,
opu=<lightonml.opu.OPU object at 0x7f58e6dc6550>,
simulated=True,
verbose_level=0)),
('decoding',
<lightonml.encoding.base.NoDecoding object at 0x7f58e6dcc090>),
('classifier', RidgeClassifier())]),
param_grid=[{'classifier__alpha': array([0.1, 1. ])}], refit=False,
return_train_score=True, scoring='accuracy')
[21]:
import pandas as pd
pd.DataFrame.from_dict(grid_search.cv_results_)
[21]:
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_classifier__alpha | params | split0_test_score | split1_test_score | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | mean_train_score | std_train_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.585277 | 0.007682 | 0.554305 | 0.011957 | 0.1 | {'classifier__alpha': 0.1} | 0.958333 | 0.958222 | 0.958278 | 0.000056 | 1 | 0.971373 | 0.971216 | 0.971294 | 0.000078 |
1 | 4.862778 | 0.232286 | 0.541315 | 0.009398 | 1.0 | {'classifier__alpha': 1.0} | 0.958333 | 0.958222 | 0.958278 | 0.000056 | 1 | 0.971333 | 0.971157 | 0.971245 | 0.000088 |