One-hot encoding
In the final section of Part I we had a 3-feature dataset, where the first two features were the operands and the third feature encoded whether the operation was addition, subtraction, multiplication or division. We used 0 to represent +, 1 to represent -, 2 to represent × and 3 to represent ÷. While that specification was entirely correct, the representation might be misleading in some ways.
Here’s why. When the numbers 0, 1, 2, 3 are interpreted qualitatively, that encoding is good in that it places +, -, × and ÷ in four different categories. When interpreted quantitatively, however, + (represented by 0) is not really meaningfully further away from ÷ (represented by 3) than – (represented by 1) is , as suggested by the difference between the numbers 3-0=3 (further apart) and 3-1=2 (closer).
To force the encoding to qualitative categories, we use one-hot encoding. Instead of representing +-×÷ with a single column containing the numbers 0, 1, 2 or 3, we now represent +-×÷ with four columns containing either 0 or 1:
- If a data point (or row) is +, the 3rd column will be 1 while columns 4, 5 and 6 will be zero.
- If a data point (or row) is -, the 4rd column will be 1 while columns 3, 5 and 6 will be zero.
- If a data point (or row) is ×, the 5th column will be 1 while columns 3, 4 and 6 will be zero.
- If a data point (or row) is ÷, the last column will be 1 while columns 3, 4 and 5 will be zero.
The Jupyter notebook for this exercise can be downloaded here.
We begin by recycling the same imports and functions we had from Part I:
import numpy as np
rs = np.random.seed(77)
def fitscore(model):
model.fit(X, y)
score = model.score(XX, yy)
print('test score = {:6.3f}'.format(score))
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_validate, GridSearchCV
def callfitscores():
print('{:28s}'.format('LinearRegression'), end='')
fitscore(LinearRegression(n_jobs=-1))
print('{:28s}'.format('Ridge'), end='')
fitscore(Ridge(random_state=rs))
print('{:28s}'.format('Lasso'), end='')
fitscore(Lasso(random_state=rs))
print('{:28s}'.format('RandomForestRegressor'), end='')
fitscore(RandomForestRegressor(random_state=rs))
print('{:28s}'.format('GradientBoostingRegressor'), end='')
fitscore(GradientBoostingRegressor(random_state=rs))
def cv(model):
cvscores = cross_validate(model, XX, yy, cv=10, return_train_score=True)
print('cross_validate test score = {:6.3f} {:6.3f} {:6.3f} {:.0e}'.format(cvscores['test_score'].mean(), cvscores['test_score'].min(), cvscores['test_score'].max(), np.std(cvscores['test_score'])))
def callcvs():
print('{:28s}'.format('LinearRegression'), end='')
cv(LinearRegression(n_jobs=-1))
print('{:28s}'.format('Ridge'), end='')
cv(Ridge(random_state=rs))
print('{:28s}'.format('Lasso'), end='')
cv(Lasso(random_state=rs))
print('{:28s}'.format('RandomForestRegressor'), end='')
cv(RandomForestRegressor(n_estimators=100, random_state=rs))
print('{:28s}'.format('GradientBoostingRegressor'), end='')
cv(GradientBoostingRegressor(random_state=rs))
def gsearchcv(model, param_grid):
grid_search = GridSearchCV(model, param_grid=param_grid, cv=10, return_train_score=True)
grid_search.fit(XX, yy)
print('GridSearch best score = {:6.3f} {}'.format(grid_search.best_score_, grid_search.best_params_))
logrange = np.asarray([.001, .01, .1, 1, 10, 100])
zero2onerange = np.arange(0, 1.1, .1)
neighbourrange = np.arange(25, 300)
lrrange = np.asarray([.00001, .0001, .001, .01, .1])
one2fiverange = np.arange(1, 6)
ten2twothourange = np.arange(50, 1001, 50)
def callgsearchcvs():
print('{:28s}'.format('Ridge'), end='')
param_grid = {'alpha': logrange}
gsearchcv(Ridge(random_state=rs), param_grid)
print('{:28s}'.format('Lasso'), end='')
param_grid = {'alpha': logrange, 'max_iter': 1000/logrange}
gsearchcv(Lasso(random_state=rs), param_grid)
print('{:28s}'.format('RandomForestRegressor'), end='')
param_grid = {'n_estimators': neighbourrange}
gsearchcv(RandomForestRegressor(n_jobs=-1, random_state=rs), param_grid)
print('{:28s}'.format('GradientBoostingRegressor'), end='')
param_grid = {'n_estimators':neighbourrange, 'learning_rate': lrrange, 'max_depth': one2fiverange}
gsearchcv(GradientBoostingRegressor(random_state=rs), param_grid)
And we define our one-hot encoded input data:
X = np.array([[0, 0, 1, 0, 0, 0],
[1, 2, 1, 0, 0, 0],
[2, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[1, 2, 0, 1, 0, 0],
[2, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0],
[1, 2, 0, 0, 1, 0],
[2, 1, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1],
[1, 2, 0, 0, 0, 1],
[2, 1, 0, 0, 0, 1]], dtype='int')
y = np.array([0,
3,
3,
0,
-1,
1,
0,
2,
2,
0,
.5,
2])
XX = []
for a in range(10):
for b in range(10):
XX.append([a, b])
XX = np.asarray(XX, dtype='int')
yy_add = np.add(XX[:,0], XX[:,1])
yy_sub = np.subtract(XX[:,0], XX[:,1])
yy_mul = np.multiply(XX[:,0], XX[:,1])
XX_div = XX[XX[:,1]>0]
yy_div = np.divide(XX_div[:,0], XX_div[:,1])
XX = np.concatenate((XX, XX, XX, XX_div))
XX = np.concatenate((XX, np.zeros((XX.shape[0], 4))), axis=1)
XX[:100, 2] = 1
XX[100:200, 3] = 1
XX[200:300, 4] = 1
XX[300:, 5] = 1
yy = np.concatenate((yy_add, yy_sub, yy_mul, yy_div))
i = np.arange(len(XX))
np.random.shuffle(i)
XX = XX[i]
yy = yy[i]
for n in range(5):
print(XX[n], yy[n])
We get:
[7. 6. 0. 1. 0. 0.] 1.0
[5. 6. 0. 0. 0. 1.] 0.8333333333333334
[5. 9. 0. 0. 0. 1.] 0.5555555555555556
[7. 7. 1. 0. 0. 0.] 14.0
[1. 8. 0. 1. 0. 0.] -7.0
Whether we implement one-hot encoding or not, it might not lead to the most drastic practical difference most of the time, but it is a more correct way to represent qualitative categorial data.
When we run:
callfitscores()
We get:
LinearRegression test score = 0.110
Ridge test score = 0.082
Lasso test score = -0.260
RandomForestRegressor test score = -0.159
GradientBoostingRegressor test score = -0.154
When we run:
callcvs()
we get:
LinearRegression cross_validate test score = 0.539 0.445 0.667 6e-02
Ridge cross_validate test score = 0.539 0.446 0.665 6e-02
Lasso cross_validate test score = 0.438 0.303 0.533 7e-02
RandomForestRegressor cross_validate test score = 0.988 0.981 0.993 4e-03
GradientBoostingRegressor cross_validate test score = 0.977 0.954 0.988 1e-02
And when we run:
callgsearchcvs()
we get:
Ridge GridSearch best score = 0.539 {'alpha': 1.0}
Lasso GridSearch best score = 0.539 {'alpha': 0.001, 'max_iter': 1000000.0}
RandomForestRegressor GridSearch best score = 0.990 {'n_estimators': 115}
GradientBoostingRegressor GridSearch best score = 0.996 {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 299}
Extrapolation
So far our input data have been limited to positive digits. What happens if we use a model trained on positive digits to do arithmetic operations on negative integers? Let’s have a go.
Since Ridge and Lasso are variants of Linear Regression, for simplicity we take just a Linear Regression model to represent linear regression, and we compare that with Gradient Boosting Regressor. We train the two models using the dataset of positive digits as before:
model_linearregression = LinearRegression()
model_gradientboosting = GradientBoostingRegressor(learning_rate=.1, max_depth=4, n_estimators=299)
model_linearregression.fit(XX, yy)
model_gradientboosting.fit(XX, yy)
We generate some data for negative integers:
test_X = []
for a in range(-4, -1):
for b in range(-4, -1):
test_X.append([a, b])
test_X = np.asarray(test_X, dtype='int')
test_y_add = np.add(test_X[:,0], test_X[:,1])
test_y_sub = np.subtract(test_X[:,0], test_X[:,1])
test_y_mul = np.multiply(test_X[:,0], test_X[:,1])
test_y_div = np.divide(test_X[:,0], test_X[:,1])
test_y = np.concatenate((test_y_add, test_y_sub, test_y_mul, test_y_div))
test_X = np.concatenate((test_X, test_X, test_X, test_X))
test_X = np.concatenate((test_X, np.zeros((test_X.shape[0], 4))), axis=1)
test_X[:9, 2] = 1
test_X[9:18, 3] = 1
test_X[18:27, 4] = 1
test_X[27:, 5] = 1
And we use the two models we just trained to make predictions on test_X:
p_linearregression = model_linearregression.predict(test_X)
p_gradientboosting = model_gradientboosting.predict(test_X)
for a, b, c, d in zip(test_X, test_y, p_linearregression, p_gradientboosting):
print('{}{:15.3e}{:15.3e}{:15.3e}'.format(a, b, c, d))
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['font.size'] = 20
mpl.rcParams['lines.markersize'] = 15
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
plt.plot(test_y, 'r.')
plt.plot(p_linearregression, 'g^')
plt.plot(p_gradientboosting, 'bx', mew=4)
plt.legend(['ground truth', 'linear regression', 'gradient boosting'])
plt.savefig('extrapolate.png')
We get:
[-4. -4. 1. 0. 0. 0.] -8.000e+00 -1.541e+01 -2.846e-02
[-4. -3. 1. 0. 0. 0.] -7.000e+00 -1.427e+01 -2.846e-02
[-4. -2. 1. 0. 0. 0.] -6.000e+00 -1.314e+01 -2.846e-02
[-3. -4. 1. 0. 0. 0.] -7.000e+00 -1.367e+01 -2.846e-02
[-3. -3. 1. 0. 0. 0.] -6.000e+00 -1.253e+01 -2.846e-02
[-3. -2. 1. 0. 0. 0.] -5.000e+00 -1.140e+01 -2.846e-02
[-2. -4. 1. 0. 0. 0.] -6.000e+00 -1.193e+01 -2.846e-02
[-2. -3. 1. 0. 0. 0.] -5.000e+00 -1.080e+01 -2.846e-02
[-2. -2. 1. 0. 0. 0.] -4.000e+00 -9.663e+00 -2.846e-02
[-4. -4. 0. 1. 0. 0.] 0.000e+00 -2.441e+01 -9.938e-03
[-4. -3. 0. 1. 0. 0.] -1.000e+00 -2.327e+01 -9.938e-03
[-4. -2. 0. 1. 0. 0.] -2.000e+00 -2.214e+01 -9.938e-03
[-3. -4. 0. 1. 0. 0.] 1.000e+00 -2.267e+01 -9.938e-03
[-3. -3. 0. 1. 0. 0.] 0.000e+00 -2.153e+01 -9.938e-03
[-3. -2. 0. 1. 0. 0.] -1.000e+00 -2.040e+01 -9.938e-03
[-2. -4. 0. 1. 0. 0.] 2.000e+00 -2.093e+01 -9.938e-03
[-2. -3. 0. 1. 0. 0.] 1.000e+00 -1.980e+01 -9.938e-03
[-2. -2. 0. 1. 0. 0.] 0.000e+00 -1.866e+01 -9.938e-03
[-4. -4. 0. 0. 1. 0.] 1.600e+01 -4.155e+00 7.232e-02
[-4. -3. 0. 0. 1. 0.] 1.200e+01 -3.023e+00 7.232e-02
[-4. -2. 0. 0. 1. 0.] 8.000e+00 -1.891e+00 7.232e-02
[-3. -4. 0. 0. 1. 0.] 1.200e+01 -2.416e+00 7.232e-02
[-3. -3. 0. 0. 1. 0.] 9.000e+00 -1.284e+00 7.232e-02
[-3. -2. 0. 0. 1. 0.] 6.000e+00 -1.522e-01 7.232e-02
[-2. -4. 0. 0. 1. 0.] 8.000e+00 -6.771e-01 7.232e-02
[-2. -3. 0. 0. 1. 0.] 6.000e+00 4.550e-01 7.232e-02
[-2. -2. 0. 0. 1. 0.] 4.000e+00 1.587e+00 7.232e-02
[-4. -4. 0. 0. 0. 1.] 1.000e+00 -2.356e+01 -9.392e-02
[-4. -3. 0. 0. 0. 1.] 1.333e+00 -2.242e+01 -9.392e-02
[-4. -2. 0. 0. 0. 1.] 2.000e+00 -2.129e+01 -9.392e-02
[-3. -4. 0. 0. 0. 1.] 7.500e-01 -2.182e+01 -9.392e-02
[-3. -3. 0. 0. 0. 1.] 1.000e+00 -2.069e+01 -9.392e-02
[-3. -2. 0. 0. 0. 1.] 1.500e+00 -1.955e+01 -9.392e-02
[-2. -4. 0. 0. 0. 1.] 5.000e-01 -2.008e+01 -9.392e-02
[-2. -3. 0. 0. 0. 1.] 6.667e-01 -1.895e+01 -9.392e-02
[-2. -2. 0. 0. 0. 1.] 1.000e+00 -1.781e+01 -9.392e-02

We immediately see the inability of GradientBoostingRegressor to extrapolate. When tested with data beyond the scope of the input data it was trained with, GradientBoostingRegressor predicts a constant value for all + operations, a constant value for all – operations, a constant value for all × operations, and a constant value for all ÷ operations.
Poor performance on non-linear problems aside, LinearRegressor is always able to extrapolate. It makes an attempt to at least respond to the variation in input data, rather than resigning to a flat constant.