Getting machines to learn +-×÷ (Part II)

One-hot encoding

In the final section of Part I we had a 3-feature dataset, where the first two features were the operands and the third feature encoded whether the operation was addition, subtraction, multiplication or division. We used 0 to represent +, 1 to represent -, 2 to represent × and 3 to represent ÷. While that specification was entirely correct, the representation might be misleading in some ways.

Here’s why. When the numbers 0, 1, 2, 3 are interpreted qualitatively, that encoding is good in that it places +, -, × and ÷ in four different categories. When interpreted quantitatively, however, + (represented by 0) is not really meaningfully further away from ÷ (represented by 3) than – (represented by 1) is , as suggested by the difference between the numbers 3-0=3 (further apart) and 3-1=2 (closer).

To force the encoding to qualitative categories, we use one-hot encoding. Instead of representing +-×÷ with a single column containing the numbers 0, 1, 2 or 3, we now represent +-×÷ with four columns containing either 0 or 1:

  • If a data point (or row) is +, the 3rd column will be 1 while columns 4, 5 and 6 will be zero.
  • If a data point (or row) is -, the 4rd column will be 1 while columns 3, 5 and 6 will be zero.
  • If a data point (or row) is ×, the 5th column will be 1 while columns 3, 4 and 6 will be zero.
  • If a data point (or row) is ÷, the last column will be 1 while columns 3, 4 and 5 will be zero.

The Jupyter notebook for this exercise can be downloaded here.

We begin by recycling the same imports and functions we had from Part I:

import numpy as np
rs = np.random.seed(77)

def fitscore(model):
model.fit(X, y)
score = model.score(XX, yy)
print('test score = {:6.3f}'.format(score))

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_validate, GridSearchCV

def callfitscores():
print('{:28s}'.format('LinearRegression'), end='')
fitscore(LinearRegression(n_jobs=-1))
print('{:28s}'.format('Ridge'), end='')
fitscore(Ridge(random_state=rs))
print('{:28s}'.format('Lasso'), end='')
fitscore(Lasso(random_state=rs))
print('{:28s}'.format('RandomForestRegressor'), end='')
fitscore(RandomForestRegressor(random_state=rs))
print('{:28s}'.format('GradientBoostingRegressor'), end='')
fitscore(GradientBoostingRegressor(random_state=rs))

def cv(model):
cvscores = cross_validate(model, XX, yy, cv=10, return_train_score=True)
print('cross_validate test score = {:6.3f} {:6.3f} {:6.3f} {:.0e}'.format(cvscores['test_score'].mean(), cvscores['test_score'].min(), cvscores['test_score'].max(), np.std(cvscores['test_score'])))

def callcvs():
print('{:28s}'.format('LinearRegression'), end='')
cv(LinearRegression(n_jobs=-1))
print('{:28s}'.format('Ridge'), end='')
cv(Ridge(random_state=rs))
print('{:28s}'.format('Lasso'), end='')
cv(Lasso(random_state=rs))
print('{:28s}'.format('RandomForestRegressor'), end='')
cv(RandomForestRegressor(n_estimators=100, random_state=rs))
print('{:28s}'.format('GradientBoostingRegressor'), end='')
cv(GradientBoostingRegressor(random_state=rs))

def gsearchcv(model, param_grid):
grid_search = GridSearchCV(model, param_grid=param_grid, cv=10, return_train_score=True)
grid_search.fit(XX, yy)
print('GridSearch best score = {:6.3f} {}'.format(grid_search.best_score_, grid_search.best_params_))

logrange = np.asarray([.001, .01, .1, 1, 10, 100])
zero2onerange = np.arange(0, 1.1, .1)
neighbourrange = np.arange(25, 300)
lrrange = np.asarray([.00001, .0001, .001, .01, .1])
one2fiverange = np.arange(1, 6)
ten2twothourange = np.arange(50, 1001, 50)

def callgsearchcvs():
print('{:28s}'.format('Ridge'), end='')
param_grid = {'alpha': logrange}
gsearchcv(Ridge(random_state=rs), param_grid)

print('{:28s}'.format('Lasso'), end='')
param_grid = {'alpha': logrange, 'max_iter': 1000/logrange}
gsearchcv(Lasso(random_state=rs), param_grid)

print('{:28s}'.format('RandomForestRegressor'), end='')
param_grid = {'n_estimators': neighbourrange}
gsearchcv(RandomForestRegressor(n_jobs=-1, random_state=rs), param_grid)

print('{:28s}'.format('GradientBoostingRegressor'), end='')
param_grid = {'n_estimators':neighbourrange, 'learning_rate': lrrange, 'max_depth': one2fiverange}
gsearchcv(GradientBoostingRegressor(random_state=rs), param_grid)

And we define our one-hot encoded input data:

X = np.array([[0, 0, 1, 0, 0, 0],
[1, 2, 1, 0, 0, 0],
[2, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[1, 2, 0, 1, 0, 0],
[2, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0],
[1, 2, 0, 0, 1, 0],
[2, 1, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1],
[1, 2, 0, 0, 0, 1],
[2, 1, 0, 0, 0, 1]], dtype='int')
y = np.array([0,
3,
3,
0,
-1,
1,
0,
2,
2,
0,
.5,
2])

XX = []
for a in range(10):
for b in range(10):
XX.append([a, b])
XX = np.asarray(XX, dtype='int')
yy_add = np.add(XX[:,0], XX[:,1])
yy_sub = np.subtract(XX[:,0], XX[:,1])
yy_mul = np.multiply(XX[:,0], XX[:,1])
XX_div = XX[XX[:,1]>0]
yy_div = np.divide(XX_div[:,0], XX_div[:,1])
XX = np.concatenate((XX, XX, XX, XX_div))
XX = np.concatenate((XX, np.zeros((XX.shape[0], 4))), axis=1)
XX[:100, 2] = 1
XX[100:200, 3] = 1
XX[200:300, 4] = 1
XX[300:, 5] = 1
yy = np.concatenate((yy_add, yy_sub, yy_mul, yy_div))
i = np.arange(len(XX))
np.random.shuffle(i)
XX = XX[i]
yy = yy[i]
for n in range(5):
print(XX[n], yy[n])

We get:

[7. 6. 0. 1. 0. 0.] 1.0 
[5. 6. 0. 0. 0. 1.] 0.8333333333333334
[5. 9. 0. 0. 0. 1.] 0.5555555555555556
[7. 7. 1. 0. 0. 0.] 14.0
[1. 8. 0. 1. 0. 0.] -7.0

Whether we implement one-hot encoding or not, it might not lead to the most drastic practical difference most of the time, but it is a more correct way to represent qualitative categorial data.

When we run:

callfitscores()

We get:

LinearRegression            test score =  0.110 
Ridge test score = 0.082
Lasso test score = -0.260
RandomForestRegressor test score = -0.159
GradientBoostingRegressor test score = -0.154

When we run:

callcvs()

we get:

LinearRegression            cross_validate test score =  0.539   0.445   0.667  6e-02 
Ridge cross_validate test score = 0.539 0.446 0.665 6e-02
Lasso cross_validate test score = 0.438 0.303 0.533 7e-02
RandomForestRegressor cross_validate test score = 0.988 0.981 0.993 4e-03
GradientBoostingRegressor cross_validate test score = 0.977 0.954 0.988 1e-02

And when we run:

callgsearchcvs()

we get:

Ridge                       GridSearch best score =  0.539 {'alpha': 1.0} 
Lasso GridSearch best score = 0.539 {'alpha': 0.001, 'max_iter': 1000000.0}
RandomForestRegressor GridSearch best score = 0.990 {'n_estimators': 115}
GradientBoostingRegressor GridSearch best score = 0.996 {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 299}

Extrapolation

So far our input data have been limited to positive digits. What happens if we use a model trained on positive digits to do arithmetic operations on negative integers? Let’s have a go.

Since Ridge and Lasso are variants of Linear Regression, for simplicity we take just a Linear Regression model to represent linear regression, and we compare that with Gradient Boosting Regressor. We train the two models using the dataset of positive digits as before:

model_linearregression = LinearRegression()
model_gradientboosting = GradientBoostingRegressor(learning_rate=.1, max_depth=4, n_estimators=299)
model_linearregression.fit(XX, yy)
model_gradientboosting.fit(XX, yy)

We generate some data for negative integers:

test_X = []
for a in range(-4, -1):
for b in range(-4, -1):
test_X.append([a, b])
test_X = np.asarray(test_X, dtype='int')

test_y_add = np.add(test_X[:,0], test_X[:,1])
test_y_sub = np.subtract(test_X[:,0], test_X[:,1])
test_y_mul = np.multiply(test_X[:,0], test_X[:,1])
test_y_div = np.divide(test_X[:,0], test_X[:,1])
test_y = np.concatenate((test_y_add, test_y_sub, test_y_mul, test_y_div))

test_X = np.concatenate((test_X, test_X, test_X, test_X))
test_X = np.concatenate((test_X, np.zeros((test_X.shape[0], 4))), axis=1)
test_X[:9, 2] = 1
test_X[9:18, 3] = 1
test_X[18:27, 4] = 1
test_X[27:, 5] = 1

And we use the two models we just trained to make predictions on test_X:

p_linearregression = model_linearregression.predict(test_X)
p_gradientboosting = model_gradientboosting.predict(test_X)
for a, b, c, d in zip(test_X, test_y, p_linearregression, p_gradientboosting):
print('{}{:15.3e}{:15.3e}{:15.3e}'.format(a, b, c, d))

%matplotlib inline
import matplotlib as mpl
mpl.rcParams['font.size'] = 20
mpl.rcParams['lines.markersize'] = 15
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
plt.plot(test_y, 'r.')
plt.plot(p_linearregression, 'g^')
plt.plot(p_gradientboosting, 'bx', mew=4)
plt.legend(['ground truth', 'linear regression', 'gradient boosting'])
plt.savefig('extrapolate.png')

We get:

[-4. -4.  1.  0.  0.  0.]     -8.000e+00     -1.541e+01     -2.846e-02 
[-4. -3. 1. 0. 0. 0.] -7.000e+00 -1.427e+01 -2.846e-02
[-4. -2. 1. 0. 0. 0.] -6.000e+00 -1.314e+01 -2.846e-02
[-3. -4. 1. 0. 0. 0.] -7.000e+00 -1.367e+01 -2.846e-02
[-3. -3. 1. 0. 0. 0.] -6.000e+00 -1.253e+01 -2.846e-02
[-3. -2. 1. 0. 0. 0.] -5.000e+00 -1.140e+01 -2.846e-02
[-2. -4. 1. 0. 0. 0.] -6.000e+00 -1.193e+01 -2.846e-02
[-2. -3. 1. 0. 0. 0.] -5.000e+00 -1.080e+01 -2.846e-02
[-2. -2. 1. 0. 0. 0.] -4.000e+00 -9.663e+00 -2.846e-02
[-4. -4. 0. 1. 0. 0.] 0.000e+00 -2.441e+01 -9.938e-03
[-4. -3. 0. 1. 0. 0.] -1.000e+00 -2.327e+01 -9.938e-03
[-4. -2. 0. 1. 0. 0.] -2.000e+00 -2.214e+01 -9.938e-03
[-3. -4. 0. 1. 0. 0.] 1.000e+00 -2.267e+01 -9.938e-03
[-3. -3. 0. 1. 0. 0.] 0.000e+00 -2.153e+01 -9.938e-03
[-3. -2. 0. 1. 0. 0.] -1.000e+00 -2.040e+01 -9.938e-03
[-2. -4. 0. 1. 0. 0.] 2.000e+00 -2.093e+01 -9.938e-03
[-2. -3. 0. 1. 0. 0.] 1.000e+00 -1.980e+01 -9.938e-03
[-2. -2. 0. 1. 0. 0.] 0.000e+00 -1.866e+01 -9.938e-03
[-4. -4. 0. 0. 1. 0.] 1.600e+01 -4.155e+00 7.232e-02
[-4. -3. 0. 0. 1. 0.] 1.200e+01 -3.023e+00 7.232e-02
[-4. -2. 0. 0. 1. 0.] 8.000e+00 -1.891e+00 7.232e-02
[-3. -4. 0. 0. 1. 0.] 1.200e+01 -2.416e+00 7.232e-02
[-3. -3. 0. 0. 1. 0.] 9.000e+00 -1.284e+00 7.232e-02
[-3. -2. 0. 0. 1. 0.] 6.000e+00 -1.522e-01 7.232e-02
[-2. -4. 0. 0. 1. 0.] 8.000e+00 -6.771e-01 7.232e-02
[-2. -3. 0. 0. 1. 0.] 6.000e+00 4.550e-01 7.232e-02
[-2. -2. 0. 0. 1. 0.] 4.000e+00 1.587e+00 7.232e-02
[-4. -4. 0. 0. 0. 1.] 1.000e+00 -2.356e+01 -9.392e-02
[-4. -3. 0. 0. 0. 1.] 1.333e+00 -2.242e+01 -9.392e-02
[-4. -2. 0. 0. 0. 1.] 2.000e+00 -2.129e+01 -9.392e-02
[-3. -4. 0. 0. 0. 1.] 7.500e-01 -2.182e+01 -9.392e-02
[-3. -3. 0. 0. 0. 1.] 1.000e+00 -2.069e+01 -9.392e-02
[-3. -2. 0. 0. 0. 1.] 1.500e+00 -1.955e+01 -9.392e-02
[-2. -4. 0. 0. 0. 1.] 5.000e-01 -2.008e+01 -9.392e-02
[-2. -3. 0. 0. 0. 1.] 6.667e-01 -1.895e+01 -9.392e-02
[-2. -2. 0. 0. 0. 1.] 1.000e+00 -1.781e+01 -9.392e-02
Figure 1. Extrapolation of models trained using Linear Regression and Gradient Boosting Regressor.

We immediately see the inability of GradientBoostingRegressor to extrapolate. When tested with data beyond the scope of the input data it was trained with, GradientBoostingRegressor predicts a constant value for all + operations, a constant value for all – operations, a constant value for all × operations, and a constant value for all ÷ operations.

Poor performance on non-linear problems aside, LinearRegressor is always able to extrapolate. It makes an attempt to at least respond to the variation in input data, rather than resigning to a flat constant.

Leave a Reply

Your email address will not be published. Required fields are marked *