Scikit-learn Pipeline and GridSearchCV with the OPU

You can use the OPUMap wrapper for sklearn in lightonml.projections.sklearn in a Pipeline and for example run a grid search on parameters using GridSearchCV.

[1]:
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
import numpy as np

from lightonml.datasets import MNIST
[2]:
random_state = np.random.RandomState(1234)
[3]:
(X_train, y_train), (X_test, y_test) = MNIST()
X, y = np.concatenate([X_train, X_test]), np.concatenate([y_train, y_test])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000,
                                                    random_state=42)

OPU pipeline

To define a flow of operations, we can conveniently use sklearn.pipeline.Pipeline. In this way, we can easily perform cross-validation on the hyperparameters of the model.

[4]:
from sklearn.pipeline import Pipeline
[5]:
pipeline_steps = []

Data and data encoding

[6]:
from lightonml.encoding.base import BinaryThresholdEncoder
[7]:
encoder = BinaryThresholdEncoder()
print('Encoder threshold: ', encoder.threshold_enc)
Encoder threshold:  25
[8]:
pipeline_steps.append(('encoder', encoder))

Random Mapping on the OPU

[9]:
from lightonml.projections.sklearn import OPUMap
[10]:
random_mapping = OPUMap(n_components=10000,  # number of random projections
                        ndims=2)
[11]:
pipeline_steps.append(('mapping', random_mapping))

Decoding

Some encoders, like SeparatedBitPlanEncoder, need a specific decoder to decode the random features. In this case we don’t need one, so we can use the NoDecoding class or just skip this pipeline step.

[12]:
from lightonml.encoding.base import NoDecoding
[13]:
pipeline_steps.append(('decoding', NoDecoding()))

Model

[14]:
classifier = RidgeClassifier()
[15]:
pipeline_steps.append(('classifier', classifier))

Instantiate and run the pipeline

[16]:
pipe = Pipeline(steps=pipeline_steps)
[17]:
pipe.named_steps
[17]:
{'classifier': RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                 max_iter=None, normalize=False, random_state=None,
                 solver='auto', tol=0.001),
 'decoding': <lightonml.encoding.base.NoDecoding at 0x7fe160e48278>,
 'encoder': <lightonml.encoding.base.BinaryThresholdEncoder at 0x7fe164136ba8>,
 'mapping': OPUMap(max_n_features=None, n_2d_features=None, n_components=10000, ndims=2,
        opu=<lightonopu.opu.OPU object at 0x7fe160e402e8>, packed=False,
        simulated=False, verbose_level=None)}

Opening the OPU

[18]:
random_mapping.opu.open()
[19]:
print('Fitting the model...')
pipe.fit(X_train, y_train)
Fitting the model...
[19]:
Pipeline(memory=None,
         steps=[('encoder',
                 <lightonml.encoding.base.BinaryThresholdEncoder object at 0x7fe164136ba8>),
                ('mapping',
                 OPUMap(max_n_features=None, n_2d_features=None,
                        n_components=10000, ndims=2,
                        opu=<lightonopu.opu.OPU object at 0x7fe160e402e8>,
                        packed=False, simulated=False, verbose_level=None)),
                ('decoding',
                 <lightonml.encoding.base.NoDecoding object at 0x7fe160e48278>),
                ('classifier',
                 RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True,
                                 fit_intercept=True, max_iter=None,
                                 normalize=False, random_state=None,
                                 solver='auto', tol=0.001))],
         verbose=False)
[20]:
train_accuracy = pipe.score(X_train, y_train)
test_accuracy = pipe.score(X_test, y_test)

print('Train accuracy {:.2f}'.format(train_accuracy * 100))
print('Test accuracy {:.2f}'.format(test_accuracy * 100))
Train accuracy 98.59
Test accuracy 96.66
[21]:
from sklearn.model_selection import ShuffleSplit, GridSearchCV

# grid for the values of alpha
alpha_values = 10. ** np.arange(-1, 1)
# define the parameters grid
grid_parameters = [{'classifier__alpha': alpha_values}]

# build cross validation scheme
cv_scheme = ShuffleSplit(n_splits=2, test_size=0.15)

grid_search = GridSearchCV(pipe, grid_parameters, cv=cv_scheme, refit=False, return_train_score=True)
grid_search.fit(X_train, y_train)
[21]:
GridSearchCV(cv=ShuffleSplit(n_splits=2, random_state=None, test_size=0.15, train_size=None),
             error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('encoder',
                                        <lightonml.encoding.base.BinaryThresholdEncoder object at 0x7fe164136ba8>),
                                       ('mapping',
                                        OPUMap(max_n_features=None,
                                               n_2d_features=None,
                                               n_components=10000, ndims=2,
                                               opu=<lightonopu.opu.OPU...
                                        <lightonml.encoding.base.NoDecoding object at 0x7fe160e48278>),
                                       ('classifier',
                                        RidgeClassifier(alpha=1.0,
                                                        class_weight=None,
                                                        copy_X=True,
                                                        fit_intercept=True,
                                                        max_iter=None,
                                                        normalize=False,
                                                        random_state=None,
                                                        solver='auto',
                                                        tol=0.001))],
                                verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'classifier__alpha': array([0.1, 1. ])}],
             pre_dispatch='2*n_jobs', refit=False, return_train_score=True,
             scoring=None, verbose=0)
[22]:
random_mapping.opu.close()
[23]:
import pandas as pd

pd.DataFrame.from_dict(grid_search.cv_results_)
[23]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_classifier__alpha params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score std_fit_time std_score_time std_test_score std_train_score
0 60.480346 6.694998 0.967500 0.988422 0.1 {'classifier__alpha': 0.1} 1 0.967778 0.988275 0.967222 0.988569 0.206407 0.003927 0.000278 0.000147
1 61.341593 6.723412 0.966778 0.988451 1 {'classifier__alpha': 1.0} 2 0.968556 0.988902 0.965000 0.988000 0.768301 0.015832 0.001778 0.000451