Getting machines to learn +-×÷ (Part II)

One-hot encoding

In the final section of Part I we had a 3-feature dataset, where the first two features were the operands and the third feature encoded whether the operation was addition, subtraction, multiplication or division. We used 0 to represent +, 1 to represent -, 2 to represent × and 3 to represent ÷. While that specification was entirely correct, the representation might be misleading in some ways.

Here’s why. When the numbers 0, 1, 2, 3 are interpreted qualitatively, that encoding is good in that it places +, -, × and ÷ in four different categories. When interpreted quantitatively, however, + (represented by 0) is not really meaningfully further away from ÷ (represented by 3) than – (represented by 1) is , as suggested by the difference between the numbers 3-0=3 (further apart) and 3-1=2 (closer).

To force the encoding to qualitative categories, we use one-hot encoding. Instead of representing +-×÷ with a single column containing the numbers 0, 1, 2 or 3, we now represent +-×÷ with four columns containing either 0 or 1:

If a data point (or row) is +, the 3rd column will be 1 while columns 4, 5 and 6 will be zero.
If a data point (or row) is -, the 4rd column will be 1 while columns 3, 5 and 6 will be zero.
If a data point (or row) is ×, the 5th column will be 1 while columns 3, 4 and 6 will be zero.
If a data point (or row) is ÷, the last column will be 1 while columns 3, 4 and 5 will be zero.

The Jupyter notebook for this exercise can be downloaded here.

We begin by recycling the same imports and functions we had from Part I:

import numpy as np
rs = np.random.seed(77)

def fitscore(model):
     model.fit(X, y)
     score = model.score(XX, yy)
     print('test score = {:6.3f}'.format(score))

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_validate, GridSearchCV

def callfitscores():
    print('{:28s}'.format('LinearRegression'), end='')
    fitscore(LinearRegression(n_jobs=-1))
    print('{:28s}'.format('Ridge'), end='')
    fitscore(Ridge(random_state=rs))
    print('{:28s}'.format('Lasso'), end='')
    fitscore(Lasso(random_state=rs))
    print('{:28s}'.format('RandomForestRegressor'), end='')
    fitscore(RandomForestRegressor(random_state=rs))
    print('{:28s}'.format('GradientBoostingRegressor'), end='')
    fitscore(GradientBoostingRegressor(random_state=rs))

def cv(model):
    cvscores = cross_validate(model, XX, yy, cv=10, return_train_score=True)
    print('cross_validate test score = {:6.3f}  {:6.3f}  {:6.3f}  {:.0e}'.format(cvscores['test_score'].mean(), cvscores['test_score'].min(), cvscores['test_score'].max(), np.std(cvscores['test_score'])))

def callcvs():
    print('{:28s}'.format('LinearRegression'), end='')
    cv(LinearRegression(n_jobs=-1))
    print('{:28s}'.format('Ridge'), end='')
    cv(Ridge(random_state=rs))
    print('{:28s}'.format('Lasso'), end='')
    cv(Lasso(random_state=rs))
    print('{:28s}'.format('RandomForestRegressor'), end='')
    cv(RandomForestRegressor(n_estimators=100, random_state=rs))
    print('{:28s}'.format('GradientBoostingRegressor'), end='')
    cv(GradientBoostingRegressor(random_state=rs))

def gsearchcv(model, param_grid):
     grid_search = GridSearchCV(model, param_grid=param_grid, cv=10, return_train_score=True)
     grid_search.fit(XX, yy)
     print('GridSearch best score = {:6.3f} {}'.format(grid_search.best_score_, grid_search.best_params_))

logrange = np.asarray([.001, .01, .1, 1, 10, 100])
zero2onerange = np.arange(0, 1.1, .1)
neighbourrange = np.arange(25, 300)
lrrange = np.asarray([.00001, .0001, .001, .01, .1])
one2fiverange = np.arange(1, 6)
ten2twothourange = np.arange(50, 1001, 50)

def callgsearchcvs():
    print('{:28s}'.format('Ridge'), end='')
    param_grid = {'alpha': logrange}
    gsearchcv(Ridge(random_state=rs), param_grid)

    print('{:28s}'.format('Lasso'), end='')
    param_grid = {'alpha': logrange, 'max_iter': 1000/logrange}
    gsearchcv(Lasso(random_state=rs), param_grid)

    print('{:28s}'.format('RandomForestRegressor'), end='')
    param_grid = {'n_estimators': neighbourrange}
    gsearchcv(RandomForestRegressor(n_jobs=-1, random_state=rs), param_grid)

    print('{:28s}'.format('GradientBoostingRegressor'), end='')         
    param_grid = {'n_estimators':neighbourrange, 'learning_rate': lrrange, 'max_depth': one2fiverange}
    gsearchcv(GradientBoostingRegressor(random_state=rs), param_grid)

And we define our one-hot encoded input data:

X = np.array([[0, 0, 1, 0, 0, 0],
              [1, 2, 1, 0, 0, 0],
              [2, 1, 1, 0, 0, 0],
              [0, 0, 0, 1, 0, 0],
              [1, 2, 0, 1, 0, 0],
              [2, 1, 0, 1, 0, 0],
              [0, 0, 0, 0, 1, 0],
              [1, 2, 0, 0, 1, 0],
              [2, 1, 0, 0, 1, 0],
              [0, 1, 0, 0, 0, 1],
              [1, 2, 0, 0, 0, 1],
              [2, 1, 0, 0, 0, 1]], dtype='int')
y = np.array([0,
              3,
              3,
              0,
              -1,
              1,
              0,
              2,
              2,
              0,
              .5,
              2])

XX = []
for a in range(10):
    for b in range(10):
        XX.append([a, b])
XX = np.asarray(XX, dtype='int')
yy_add = np.add(XX[:,0], XX[:,1])
yy_sub = np.subtract(XX[:,0], XX[:,1])
yy_mul = np.multiply(XX[:,0], XX[:,1])
XX_div = XX[XX[:,1]>0]
yy_div = np.divide(XX_div[:,0], XX_div[:,1])
XX = np.concatenate((XX, XX, XX, XX_div))
XX = np.concatenate((XX, np.zeros((XX.shape[0], 4))), axis=1)
XX[:100, 2] = 1
XX[100:200, 3] = 1
XX[200:300, 4] = 1
XX[300:, 5] = 1
yy = np.concatenate((yy_add, yy_sub, yy_mul, yy_div))
i = np.arange(len(XX))
np.random.shuffle(i)
XX = XX[i]
yy = yy[i]
for n in range(5):
     print(XX[n], yy[n])

We get:

[7. 6. 0. 1. 0. 0.] 1.0 
[5. 6. 0. 0. 0. 1.] 0.8333333333333334 
[5. 9. 0. 0. 0. 1.] 0.5555555555555556 
[7. 7. 1. 0. 0. 0.] 14.0 
[1. 8. 0. 1. 0. 0.] -7.0

Whether we implement one-hot encoding or not, it might not lead to the most drastic practical difference most of the time, but it is a more correct way to represent qualitative categorial data.

When we run:

callfitscores()

We get:

LinearRegression            test score =  0.110 
Ridge                       test score =  0.082 
Lasso                       test score = -0.260 
RandomForestRegressor       test score = -0.159
GradientBoostingRegressor   test score = -0.154

When we run:

callcvs()

we get:

LinearRegression            cross_validate test score =  0.539   0.445   0.667  6e-02 
Ridge                       cross_validate test score =  0.539   0.446   0.665  6e-02 
Lasso                       cross_validate test score =  0.438   0.303   0.533  7e-02 
RandomForestRegressor       cross_validate test score =  0.988   0.981   0.993  4e-03 
GradientBoostingRegressor   cross_validate test score =  0.977   0.954   0.988  1e-02

And when we run:

callgsearchcvs()

we get:

Ridge                       GridSearch best score =  0.539 {'alpha': 1.0} 
Lasso                       GridSearch best score =  0.539 {'alpha': 0.001, 'max_iter': 1000000.0} 
RandomForestRegressor       GridSearch best score =  0.990 {'n_estimators': 115} 
GradientBoostingRegressor   GridSearch best score =  0.996 {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 299}

Extrapolation

So far our input data have been limited to positive digits. What happens if we use a model trained on positive digits to do arithmetic operations on negative integers? Let’s have a go.

Since Ridge and Lasso are variants of Linear Regression, for simplicity we take just a Linear Regression model to represent linear regression, and we compare that with Gradient Boosting Regressor. We train the two models using the dataset of positive digits as before:

model_linearregression = LinearRegression()
model_gradientboosting = GradientBoostingRegressor(learning_rate=.1, max_depth=4, n_estimators=299)
model_linearregression.fit(XX, yy)
model_gradientboosting.fit(XX, yy)

We generate some data for negative integers:

test_X = []
for a in range(-4, -1):
    for b in range(-4, -1):
         test_X.append([a, b])
test_X = np.asarray(test_X, dtype='int')

test_y_add = np.add(test_X[:,0], test_X[:,1])
test_y_sub = np.subtract(test_X[:,0], test_X[:,1])
test_y_mul = np.multiply(test_X[:,0], test_X[:,1])
test_y_div = np.divide(test_X[:,0], test_X[:,1])
test_y = np.concatenate((test_y_add, test_y_sub, test_y_mul, test_y_div))

test_X = np.concatenate((test_X, test_X, test_X, test_X))
test_X = np.concatenate((test_X, np.zeros((test_X.shape[0], 4))), axis=1)
test_X[:9, 2] = 1
test_X[9:18, 3] = 1
test_X[18:27, 4] = 1
test_X[27:, 5] = 1

And we use the two models we just trained to make predictions on test_X:

p_linearregression = model_linearregression.predict(test_X)
p_gradientboosting = model_gradientboosting.predict(test_X)
for a, b, c, d in zip(test_X, test_y, p_linearregression, p_gradientboosting):
    print('{}{:15.3e}{:15.3e}{:15.3e}'.format(a, b, c, d))

%matplotlib inline
import matplotlib as mpl
mpl.rcParams['font.size'] = 20
mpl.rcParams['lines.markersize'] = 15
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
plt.plot(test_y, 'r.')
plt.plot(p_linearregression, 'g^')
plt.plot(p_gradientboosting, 'bx', mew=4)
plt.legend(['ground truth', 'linear regression', 'gradient boosting'])
plt.savefig('extrapolate.png')

We get:

[-4. -4.  1.  0.  0.  0.]     -8.000e+00     -1.541e+01     -2.846e-02 
[-4. -3.  1.  0.  0.  0.]     -7.000e+00     -1.427e+01     -2.846e-02 
[-4. -2.  1.  0.  0.  0.]     -6.000e+00     -1.314e+01     -2.846e-02 
[-3. -4.  1.  0.  0.  0.]     -7.000e+00     -1.367e+01     -2.846e-02 
[-3. -3.  1.  0.  0.  0.]     -6.000e+00     -1.253e+01     -2.846e-02 
[-3. -2.  1.  0.  0.  0.]     -5.000e+00     -1.140e+01     -2.846e-02 
[-2. -4.  1.  0.  0.  0.]     -6.000e+00     -1.193e+01     -2.846e-02 
[-2. -3.  1.  0.  0.  0.]     -5.000e+00     -1.080e+01     -2.846e-02 
[-2. -2.  1.  0.  0.  0.]     -4.000e+00     -9.663e+00     -2.846e-02 
[-4. -4.  0.  1.  0.  0.]      0.000e+00     -2.441e+01     -9.938e-03 
[-4. -3.  0.  1.  0.  0.]     -1.000e+00     -2.327e+01     -9.938e-03 
[-4. -2.  0.  1.  0.  0.]     -2.000e+00     -2.214e+01     -9.938e-03 
[-3. -4.  0.  1.  0.  0.]      1.000e+00     -2.267e+01     -9.938e-03 
[-3. -3.  0.  1.  0.  0.]      0.000e+00     -2.153e+01     -9.938e-03 
[-3. -2.  0.  1.  0.  0.]     -1.000e+00     -2.040e+01     -9.938e-03 
[-2. -4.  0.  1.  0.  0.]      2.000e+00     -2.093e+01     -9.938e-03 
[-2. -3.  0.  1.  0.  0.]      1.000e+00     -1.980e+01     -9.938e-03 
[-2. -2.  0.  1.  0.  0.]      0.000e+00     -1.866e+01     -9.938e-03 
[-4. -4.  0.  0.  1.  0.]      1.600e+01     -4.155e+00      7.232e-02 
[-4. -3.  0.  0.  1.  0.]      1.200e+01     -3.023e+00      7.232e-02 
[-4. -2.  0.  0.  1.  0.]      8.000e+00     -1.891e+00      7.232e-02 
[-3. -4.  0.  0.  1.  0.]      1.200e+01     -2.416e+00      7.232e-02 
[-3. -3.  0.  0.  1.  0.]      9.000e+00     -1.284e+00      7.232e-02 
[-3. -2.  0.  0.  1.  0.]      6.000e+00     -1.522e-01      7.232e-02 
[-2. -4.  0.  0.  1.  0.]      8.000e+00     -6.771e-01      7.232e-02 
[-2. -3.  0.  0.  1.  0.]      6.000e+00      4.550e-01      7.232e-02 
[-2. -2.  0.  0.  1.  0.]      4.000e+00      1.587e+00      7.232e-02 
[-4. -4.  0.  0.  0.  1.]      1.000e+00     -2.356e+01     -9.392e-02 
[-4. -3.  0.  0.  0.  1.]      1.333e+00     -2.242e+01     -9.392e-02 
[-4. -2.  0.  0.  0.  1.]      2.000e+00     -2.129e+01     -9.392e-02 
[-3. -4.  0.  0.  0.  1.]      7.500e-01     -2.182e+01     -9.392e-02 
[-3. -3.  0.  0.  0.  1.]      1.000e+00     -2.069e+01     -9.392e-02 
[-3. -2.  0.  0.  0.  1.]      1.500e+00     -1.955e+01     -9.392e-02 
[-2. -4.  0.  0.  0.  1.]      5.000e-01     -2.008e+01     -9.392e-02 
[-2. -3.  0.  0.  0.  1.]      6.667e-01     -1.895e+01     -9.392e-02 
[-2. -2.  0.  0.  0.  1.]      1.000e+00     -1.781e+01     -9.392e-02

Figure 1. Extrapolation of models trained using Linear Regression and Gradient Boosting Regressor.

We immediately see the inability of GradientBoostingRegressor to extrapolate. When tested with data beyond the scope of the input data it was trained with, GradientBoostingRegressor predicts a constant value for all + operations, a constant value for all – operations, a constant value for all × operations, and a constant value for all ÷ operations.

Poor performance on non-linear problems aside, LinearRegressor is always able to extrapolate. It makes an attempt to at least respond to the variation in input data, rather than resigning to a flat constant.

Learning!

Getting machines to learn +-×÷ (Part II)

One-hot encoding

Extrapolation

Leave a Reply Cancel reply

Metric: mutual info

Metric: silhoutte score

Metrics: homogeneity score, completeness score, v measure

Metric: Fowlkes-Mallows score

Metric: entropy

Metric: Davies-Bouldin index

Metric: Calinski-Harabasz index

Metric: adjusted rand score