One-hot encoding using keras and scikit-learn

Jupyter notebook for this exercise is available here:

keras

Let us seed, just for the sake of reproducibility, and then generate a dozen of random, positve integers:

import numpy as np
np.random.seed(77)
y_positives = np.random.randint(0, 10, 12)
print(y_positives)

Output:

[7 4 4 5 8 0 9 7 5 3 0 6]

Q: Any difference betweem keras.utils.to_categorical and keras.utils.np_utils.to_categorical?

A: No.

from keras import utils
onehot_utils = utils.to_categorical(y_positives)
onehot_nputils = utils.np_utils.to_categorical(y_positives)
np.array_equal(onehot_utils, onehot_nputils)

We get:

True

Let us list the data side-by-side before and after one-hot encoding:

for before, after in zip (y_positives, onehot_utils):
print(before, after)

We get:

7 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
4 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
4 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
5 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
8 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
0 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
9 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
7 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
5 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
3 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
0 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
6 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]

Q: Can to_categorical handle negative integers?

A: No.

y_negatives = np.subtract(y_positives, 5)
print(y_negatives)

We get:

[ 2 -1 -1  0  3 -5  4  2  0 -2 -5  1]

Now we try to one-hot encode:

onehot_negatives = utils.to_categorical(y_negatives)
for before, after in zip(y_negatives, onehot_negatives):
print('{:2d} {}'.format(before, after))

We get:

 2  [0. 0. 1. 0. 0.] 
-1 [0. 0. 0. 0. 1.]
-1 [0. 0. 0. 0. 1.]
0 [1. 0. 0. 0. 0.]
3 [0. 0. 0. 1. 0.]
-5 [1. 0. 0. 0. 0.]
4 [0. 0. 0. 0. 1.]
2 [0. 0. 1. 0. 0.]
0 [1. 0. 0. 0. 0.]
-2 [0. 0. 0. 1. 0.]
-5 [1. 0. 0. 0. 0.]
1 [0. 1. 0. 0. 0.]

Note how negative entries get screwed up. Total number of categories shrinks to 5. -1 after encoding, for example, is identical to 4 after encoding.

Q: So how do we one hot-encode data with negative integers?

A: Easy.

y_shifted_negatives = y_negatives - y_negatives.min()
onehot_shifted_negatives = utils.to_categorical(y_shifted_negatives)
for before, after in zip(y_negatives, onehot_shifted_negatives):
print('{:2d} {}'.format(before, after))

We get:

 2  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
-1 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
-1 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
0 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
3 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
-5 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
4 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
2 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
0 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
-2 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
-5 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
1 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]

Scikit-learn

We can accomplish the same using scikit-learn:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories='auto', sparse=False)
onehot_sklearn = enc.fit_transform(y_negatives.reshape([-1, 1]))
for before, after in zip(y_negatives, onehot_sklearn):
print('{:2d} {}'.format(before, after))

We get:

 2  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
-1 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
-1 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
0 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
3 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
-5 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
4 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
2 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
0 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
-2 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
-5 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
1 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]

Note: without the categories='auto' option we would get a complaint on negative integers:

ValueError: OneHotEncoder in legacy mode cannot handle categories encoded as negative integers. Please set categories='auto' explicitly to be able to use arbitrary integer values as category identifiers.

Leave a Reply

Your email address will not be published. Required fields are marked *