One-hot encoding using keras and scikit-learn

Jupyter notebook for this exercise is available here:

keras

Let us seed, just for the sake of reproducibility, and then generate a dozen of random, positve integers:

import numpy as np
np.random.seed(77) 
y_positives = np.random.randint(0, 10, 12)
print(y_positives)

Output:

[7 4 4 5 8 0 9 7 5 3 0 6]

Q: Any difference betweem keras.utils.to_categorical and keras.utils.np_utils.to_categorical?

A: No.

from keras import utils
onehot_utils = utils.to_categorical(y_positives)
onehot_nputils = utils.np_utils.to_categorical(y_positives)
np.array_equal(onehot_utils, onehot_nputils)

We get:

True

Let us list the data side-by-side before and after one-hot encoding:

for before, after in zip (y_positives, onehot_utils):
     print(before, after)

We get:

7 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
4 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 
4 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 
5 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] 
8 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] 
0 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
9 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] 
7 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
5 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] 
3 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] 
0 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
6 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]

Q: Can `to_categorical` handle negative integers?

A: No.

y_negatives = np.subtract(y_positives, 5)
print(y_negatives)

We get:

[ 2 -1 -1  0  3 -5  4  2  0 -2 -5  1]

Now we try to one-hot encode:

onehot_negatives = utils.to_categorical(y_negatives)
for before, after in zip(y_negatives, onehot_negatives):
    print('{:2d}  {}'.format(before, after))

We get:

 2  [0. 0. 1. 0. 0.] 
-1  [0. 0. 0. 0. 1.] 
-1  [0. 0. 0. 0. 1.]  
 0  [1. 0. 0. 0. 0.] 
 3  [0. 0. 0. 1. 0.] 
-5  [1. 0. 0. 0. 0.]  
 4  [0. 0. 0. 0. 1.] 
 2  [0. 0. 1. 0. 0.] 
 0  [1. 0. 0. 0. 0.] 
-2  [0. 0. 0. 1. 0.] 
-5  [1. 0. 0. 0. 0.]
 1  [0. 1. 0. 0. 0.]

Note how negative entries get screwed up. Total number of categories shrinks to 5. -1 after encoding, for example, is identical to 4 after encoding.

Q: So how do we one hot-encode data with negative integers?

A: Easy.

y_shifted_negatives = y_negatives - y_negatives.min()
onehot_shifted_negatives = utils.to_categorical(y_shifted_negatives)
for before, after in zip(y_negatives, onehot_shifted_negatives):
    print('{:2d}  {}'.format(before, after))

We get:

 2  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
-1  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 
-1  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 
 0  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] 
 3  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] 
-5  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
 4  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] 
 2  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
 0  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] 
-2  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] 
-5  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
 1  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]

Scikit-learn

We can accomplish the same using scikit-learn:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories='auto', sparse=False)
onehot_sklearn = enc.fit_transform(y_negatives.reshape([-1, 1]))
for before, after in zip(y_negatives, onehot_sklearn):
    print('{:2d}  {}'.format(before, after))

We get:

 2  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
-1  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 
-1  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] 
 0  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] 
 3  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] 
-5  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
 4  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] 
 2  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] 
 0  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] 
-2  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] 
-5  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 
 1  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]

Note: without the categories='auto' option we would get a complaint on negative integers:

ValueError: OneHotEncoder in legacy mode cannot handle categories encoded as negative integers. Please set categories='auto' explicitly to be able to use arbitrary integer values as category identifiers.

Learning!

One-hot encoding using keras and scikit-learn

keras

Q: Any difference betweem keras.utils.to_categorical and keras.utils.np_utils.to_categorical?

A: No.

Q: Can `to_categorical` handle negative integers?

A: No.

Q: So how do we one hot-encode data with negative integers?

A: Easy.

Scikit-learn

Leave a Reply Cancel reply

Metric: mutual info

Metric: silhoutte score

Metrics: homogeneity score, completeness score, v measure

Metric: Fowlkes-Mallows score

Metric: entropy

Metric: Davies-Bouldin index

Metric: Calinski-Harabasz index

Metric: adjusted rand score

keras

Q: Any difference betweem keras.utils.to_categorical and keras.utils.np_utils.to_categorical?

A: No.

Q: Can to_categorical handle negative integers?

A: No.

Q: So how do we one hot-encode data with negative integers?

A: Easy.

Scikit-learn

Leave a Reply Cancel reply

Q: Can `to_categorical` handle negative integers?