{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Change point detection using the OPU online mode\n", "\n", "Some machine learning algorithms are developed to be used *online*: processing one sample at a time. In this case, the optical transform can be optimized using the *online* mode introduced in `lightonopu 1.2`.\n", "\n", "We are going to show how this work using NEWMA, an online change point detection method. For more information on the algorithm: https://arxiv.org/abs/1805.08061." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data\n", "\n", "We prepare a time series where samples are drawn from a mixture of Guaussians that changes every `n` timesteps. You can skip the details of the data generation if you are not interested and go directly to the next section." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "@author: nkeriven, taken from https://github.com/lightonai/newma\n", "\"\"\"\n", "import numpy as np\n", "from sklearn import mixture\n", "\n", "\n", "def gmdraw(weights, mu, Sigma, n):\n", " k, d, p = weights.shape[0], mu.shape[1], np.cumsum(weights)\n", " label = np.random.rand(n)\n", " for i in range(n):\n", " label[i] = np.sum(label[i] > p)\n", " cSigma = np.zeros((k, d, d))\n", " for l in range(k):\n", " cSigma[l, :, :] = np.linalg.cholesky(Sigma[l, :, :])\n", " X = np.zeros((n, d))\n", " for i in range(n):\n", " j = int(label[i])\n", " X[i, :] = mu[j, :] + np.dot(np.random.randn(1, d), cSigma[j, :, :])\n", " return X, label\n", "\n", "\n", "def generateGMM(d=10, k=10, n=1000, std_mean=1, concentration_wishart=30, concentration_dirichlet=5):\n", " concentration_wishart = np.max((concentration_wishart, 3))\n", " weights = np.random.dirichlet(concentration_dirichlet*np.ones(k))\n", " mu = std_mean*k**(1/d)*np.random.randn(k, d)\n", " Sigma = np.zeros((k, d))\n", " for l in range(k):\n", " Sigma[l, :] = (concentration_wishart - 2)/np.sum(np.random.randn(int(concentration_wishart), d)**2, \n", " axis=0)\n", " clf = mixture.GaussianMixture(n_components=k, covariance_type='diag')\n", " clf.means_ = mu\n", " clf. covariances_ = Sigma\n", " clf.precisions_cholesky_ = mixture._gaussian_mixture._compute_precision_cholesky(\n", " Sigma, clf.covariance_type)\n", " clf.weights_ = weights\n", " X, label = clf.sample(n_samples=n)\n", " p = np.random.permutation(n)\n", " X, label = X[p, :], label[p]\n", " generated_data = {'data': X, 'weights': weights,\n", " 'means': mu, 'cov': Sigma,\n", " 'label': label, 'gmm': clf}\n", " return generated_data\n", "\n", "\n", "def stream_GMM(d=10, k=10, n=1000, nb_change=50, std_mean=0.2, \n", " concentration_wishart=30, concentration_dirichlet=5):\n", " X = np.zeros((n*(nb_change), d))\n", " ground_truth = np.zeros(n*(nb_change))\n", " for i in range(nb_change):\n", " GM = generateGMM(d=d, k=k, n=n, std_mean=std_mean, concentration_wishart=concentration_wishart,\n", " concentration_dirichlet=concentration_dirichlet)\n", " X[i*n:(i+1)*n, :] = GM['data']\n", " if i != 0:\n", " ground_truth[i*n] = 1\n", " return X, ground_truth" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "d = 5000\n", "n, nb_change = 250, 100\n", "\n", "k = 10\n", "std_mean = 0.15 # the bigger, the more change in means\n", "wishart = 3 # the bigger, the less change in diagonal variances\n", "\n", "X, ground_truth = stream_GMM(d=d, n=n, nb_change=nb_change, std_mean=std_mean, concentration_wishart=wishart,\n", " k=k)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# we binarize the data using 38 levels\n", "n_levels = 38\n", "Xencode = np.empty((X.shape[0], n_levels * X.shape[1]), dtype='uint8')\n", "mi, Ma = np.min(X), np.max(X) # rescale to 0 255\n", "X = 255 * ((X - mi) / (Ma - mi))\n", "X = X.astype('uint8')\n", "\n", "for i in range(n_levels):\n", " Xencode[:, i * X.shape[1]:(i + 1) * X.shape[1]] = X > 65 + i * 5\n", "del X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the detector class" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import norm\n", "\n", "class NEWMA():\n", " def __init__(self, init_sample, forget_factor=0.05, forget_factor2=0.1,\n", " feat_func=lambda x: x, dist_func=lambda z1, z2: linalg.norm(z1 - z2),\n", " thresholding_method='adapt', thresholding_quantile=0.95,\n", " fixed_threshold=None, adapt_forget_factor=0.05,\n", " store_values=True):\n", " self.statistic = 0 # Current value of the detection statistic\n", " self.thresholding_method = thresholding_method\n", " self.thresholding_mult = norm.ppf(thresholding_quantile)\n", " self.fixed_threshold = fixed_threshold\n", " self.adapt_forget_factor = adapt_forget_factor\n", " # for adaptive threshold method\n", " self.adapt_forget_factor = adapt_forget_factor\n", " # Current estimated mean and moment of order 2 of the statistic squared\n", " # NOTE: the adaptive threshold is based on the assumption that the squared statistic is approximately gaussian.\n", " self.adapt_mean = 0\n", " self.adapt_second_moment = 0\n", " # history of statistic\n", " self.store_values = store_values\n", " self.stat_stored = []\n", " self.ewma = feat_func(init_sample) # current sketch\n", " self.ewma2 = feat_func(init_sample) # current skech2\n", " self.forget_factor = forget_factor # update coeff for sketch\n", " self.forget_factor2 = forget_factor2 # update coeff for sketch2\n", " self.feat_func = feat_func # mapping Psi (identity, random features...)\n", " self.dist_func = dist_func # function to compute the distance (may return an array for block distances)\n", "\n", " def apply_to_data(self, data):\n", " count = 0\n", " for d in data:\n", " self.update(d)\n", " count += 1\n", " if count % 5000 == 0:\n", " print(f\"{count}/{len(data)}\")\n", "\n", " def flag_sample(self):\n", " if self.thresholding_method == 'adapt':\n", " return self.statistic > np.sqrt(\n", " self.adapt_mean + self.thresholding_mult * np.sqrt(self.adapt_second_moment - self.adapt_mean ** 2))\n", " elif self.thresholding_method == 'fixed':\n", " return self.statistic > self.fixed_threshold\n", " else:\n", " return TypeError('Thresholding method not recognised.')\n", "\n", " def update(self, new_sample):\n", " self.statistic = self.update_stat(\n", " new_sample) # compute the new detection statistic b the user-implemented function\n", "\n", " # compute adaptive detection result\n", " self.adapt_mean = (\n", " 1 - self.adapt_forget_factor) * self.adapt_mean + self.adapt_forget_factor * self.statistic ** 2\n", " self.adapt_second_moment = (\n", " 1 - self.adapt_forget_factor) * self.adapt_second_moment + self.adapt_forget_factor * self.statistic ** 4\n", "\n", " res = self.flag_sample()\n", " # if history is stored\n", " if self.store_values:\n", " thres = np.sqrt(\n", " self.adapt_mean + self.thresholding_mult * np.sqrt(self.adapt_second_moment - self.adapt_mean ** 2))\n", " self.stat_stored.append((self.statistic, thres, res))\n", "\n", " return res # return the result\n", "\n", " def update_stat(self, new_sample):\n", " temp = self.feat_func(new_sample)\n", " # sketches\n", " self.ewma = (1 - self.forget_factor) * self.ewma + self.forget_factor * temp\n", " self.ewma2 = (1 - self.forget_factor2) * self.ewma2 + self.forget_factor2 * temp\n", " # distance\n", " return self.dist_func(self.ewma, self.ewma2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For hyperparameter selection:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from scipy import linalg\n", "from scipy import optimize as opt\n", "\n", "def convert_parameters(window_size, forget_factor):\n", " \"\"\"From the window_size and one forgetting factor, compute the other forgetting factor..\n", " \"\"\"\n", " w_ = window_size\n", " C = forget_factor * (1 - forget_factor) ** w_\n", "\n", " # educated guess for initialization\n", " if forget_factor > 1 / (w_ + 1):\n", " init = 1 / (2 * (w_ + 1))\n", " else:\n", " init = 2 / (w_ + 1)\n", "\n", " def func(x):\n", " return (x * (1 - x) ** w_ - C) ** 2\n", "\n", " def grad(x):\n", " return ((1 - x) ** w_ - w_ * x * (1 - x) ** (w_ - 1)) * 2 * (x * (1 - x) ** w_ - C)\n", "\n", " return opt.minimize(func, jac=grad, x0=init, bounds=((0, 1),), tol=1e-20).x[0]\n", "\n", "\n", "def select_optimal_parameters(window_size, grid_size=1000):\n", " \"\"\"From the window_size, give the best newma parameters, w.r.t. the error bound in the paper.\n", " \"\"\"\n", " def error_bound(L, l):\n", " numerator = (np.sqrt(L + l) + ((1 - l) ** (2 * window_size) - (1 - L) ** (2 * window_size)))\n", " denominator = ((1 - l) ** window_size - (1 - L) ** window_size)\n", " return numerator / denominator\n", "\n", " ax = np.exp(np.linspace(np.log(1.001 / (window_size + 1)), -0.01, grid_size))\n", " errors = np.zeros(grid_size)\n", " for ind, L in zip(range(grid_size), ax):\n", " l = convert_parameters(window_size, L)\n", " errors[ind] = error_bound(L, l)\n", " Lambda = (ax[np.argmin(errors)] + 1 / (window_size + 1)) / 2\n", " return Lambda, convert_parameters(window_size, Lambda)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configuration of NEWMA" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Newma config\n", "B = 50 # window size\n", "big_Lambda, small_lambda = select_optimal_parameters(B) # forget factors chosen with heuristic in the paper\n", "thres_ff = small_lambda\n", "# number of random features is set automatically with this criterion\n", "m_OPU = 10 * int((1 / 4) / (small_lambda + big_Lambda) ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the OPU in online mode\n", "\n", "To optimize the OPU for usage with one sample at a time, you just need to pass `online=True` when you call the [fit1d](../lightonml.opu.rst#lightonml.opu.OPU.fit1d) method." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from lightonml import OPU\n", "opu = OPU(n_components=m_OPU)\n", "opu.fit1d(n_features=Xencode.shape[1], online=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Detecting change points" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def feature_function(x):\n", " return opu.transform(x).astype('float32')" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5000/25000\n", "10000/25000\n", "15000/25000\n", "20000/25000\n", "25000/25000\n", "NEWMA Online took: 138.3667573928833\n" ] } ], "source": [ "# convert to float online to avoid memory error\n", "import time\n", "mult = 1.5\n", "detector = NEWMA(Xencode[0], forget_factor=big_Lambda, forget_factor2=small_lambda,\n", " feat_func=feature_function, adapt_forget_factor=thres_ff*mult,\n", " thresholding_quantile=0.95, dist_func=lambda z1, z2: np.linalg.norm(z1 - z2))\n", "start = time.time()\n", "detector.apply_to_data(Xencode)\n", "print('NEWMA Online took:', time.time() - start)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Performance of the algorithm:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def evaluate_detection(ground_truth, flagged):\n", " n = ground_truth.shape[0]\n", " if n != flagged.shape[0]:\n", " print('error', n, flagged.shape[0])\n", " cp = np.zeros(n, dtype=bool)\n", " for i in range(n-1):\n", " if not flagged[i] and flagged[i + 1]:\n", " cp[i] = 1\n", " EDD, not_detected, FA = 0, 0, 0\n", " num_change = int(ground_truth.sum())\n", " where_change = np.concatenate((np.argwhere(ground_truth).flatten(), np.array([n])))\n", " for i in range(num_change):\n", " begin_ind = where_change[i]\n", " end_ind = where_change[i + 1]\n", " middle_ind = int((begin_ind + end_ind) / 2)\n", " i = begin_ind\n", " while i <= middle_ind and not cp[i]:\n", " i = i+1\n", " if cp[i]:\n", " EDD += i - begin_ind\n", " else:\n", " not_detected += 1\n", " FA += cp[middle_ind:end_ind].sum()\n", " results = {'EDD': EDD / np.max((num_change - not_detected, 1)),\n", " 'not_detected': 100 * not_detected / num_change,\n", " 'false_alarm': FA / num_change, 'cp': cp}\n", " return results\n", "\n", "\n", "def compute_curves(ground_truth, dist,\n", " num_points=50,\n", " start_coeff=1.3, end_coeff=2,\n", " thres_values=np.array([np.nan]),\n", " thres_offset=0):\n", " if np.isnan(thres_values)[0]:\n", " thres_values = np.mean(dist)\n", " thres_levels = np.linspace(start_coeff, end_coeff, num_points)\n", " EDDs = np.zeros(num_points)\n", " FAs = np.zeros(num_points)\n", " NDs = np.zeros(num_points)\n", " for i in range(num_points):\n", " flagged_points = dist > thres_levels[i] * thres_values + thres_offset\n", " res = evaluate_detection(ground_truth, flagged_points)\n", " EDDs[i] = res['EDD']\n", " FAs[i] = res['false_alarm']\n", " NDs[i] = res['not_detected']\n", " return EDDs, FAs, NDs" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "detection_stat = np.array([i[0] for i in detector.stat_stored])[int(10 * n):] # padding\n", "online_th = np.array([i[1] for i in detector.stat_stored])[int(10 * n):]\n", "ground_truth = ground_truth[int(10 * n):]\n", "\n", "# display perf\n", "EDD, FA, ND = compute_curves(ground_truth, detection_stat, num_points=1,\n", " thres_values=online_th, start_coeff=1, end_coeff=1)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using thresholding quantile\n", "\n", "False alarms: 0.89\n", "Missed detections: 2.22 %\n", "Expected detection delay: 31.60 timesteps\n" ] } ], "source": [ "print(\"Using thresholding quantile\\n\")\n", "print(f\"False alarms: {FA[0]:.2f}\")\n", "print(f\"Missed detections: {ND[0]:.2f} %\")\n", "print(f\"Expected detection delay: {EDD[0]:.2f} timesteps\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "dev", "language": "python", "name": "dev" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }