Category Encoders: Catalog & Experiments (Part 1)

We shall explore 24 encoders from 4 libraries:

library	one-hot encoders	other simple encoders	contrast encoders	target/Bayesian encoders
sklearn.preprocessing	OneHotEncoder	LabelEncoder OrdinalEncoder LabelBinarizer
category_encoders	OneHotEncoder	OrdinalEncoder BinaryEncoder BaseNEncoder CountEncoder HashingEncoder	HelmertEncoder SumEncoder BackwardDifferenceEncoder PolynomialEncoder	TargetEncoder MEstimateEncoder WOEEncoder JamesSteinEncoder LeaveOneOutEncoder CatBoostEncoder GLMMEncoder
pandas	get_dummies	factorize
keras.utils	to_categorical

Encoders map the original categories (often dtype=string) to a set of representing values (often dtype=int for simple encoders; dtype=float for target encoders). This notebook walks through a tour of the encoders listed in the table, exploring each non-target encoder one by one, producing a comparison table at the end. Target encoders shall be explored in detail in a separate notebook.

When to use which encoder to solve what problems? There is a good guide here: [Encode Smarter: How to Easily Integrate Categorical Encoding into Your Machine Learning Pipeline](https://innovation.alteryx.com/encode-smarter).

from sklearn import preprocessing
from category_encoders import OrdinalEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, HashingEncoder
from category_encoders import HelmertEncoder, SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder
from category_encoders import TargetEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, LeaveOneOutEncoder, CatBoostEncoder, GLMMEncoder
from keras import utils

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings, gc, time
warnings.simplefilter('ignore') # once | error | always | default | module

# We shall be compiling a summary table as we go along.
summary = pd.DataFrame({'inp2out_map': pd.Series(dtype=object),   # input-to-output map
                        'nunique'    : pd.Series(dtype=int),      # number of unique (or distinct) values in output
                        'unique'     : pd.Series(dtype='object'), # unique values in output
                        'shape'      : pd.Series(dtype=int),      # rows-by-columns of output array
                        'tictoc'     : pd.Series(dtype=int)})     # computation time i seconds
summary.index.name = 'encoder'
# The grand summary is printed at the end of this notebook.

train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv', index_col='id') # [['cat10', 'cat5', 'target']]
train.sample(5)

Is encoding optional?¶

Not always. Some packages can’t digest string-type data without encoding. ‘Donkey’, ‘horse’ and ‘mule’, for instance, would not work whereas 0, 1 and 2 would.

Even when the package can digest data without encoding, they sometimes learn encoded data better.

Encoder types: a broad-stroke scan¶

We’ve got 2 dozen encoders here. Let’s take an overview by trying to group them into families according to observable behaviors.

%%time
# Would the output differ whether or not we supply the target as input?
# Let's run a test with 10 encoders which optionally accept the target as input:
pick = train.columns[train.columns.str.startswith('cat')]
for ncoda in [OrdinalEncoder, HelmertEncoder, SumEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, BackwardDifferenceEncoder]:
    tis = ncoda().fit_transform(train[pick])
    tat = ncoda().fit_transform(train[pick], train['target']) 
#   Print 'True' if same; print 'False' otherwise
    print((tis==tat).all().all(), ncoda)

True <class 'category_encoders.ordinal.OrdinalEncoder'>
True <class 'category_encoders.helmert.HelmertEncoder'>
True <class 'category_encoders.sum_coding.SumEncoder'>
True <class 'category_encoders.one_hot.OneHotEncoder'>
True <class 'category_encoders.binary.BinaryEncoder'>
True <class 'category_encoders.basen.BaseNEncoder'>
True <class 'category_encoders.count.CountEncoder'>
True <class 'category_encoders.backward_difference.BackwardDifferenceEncoder'>
CPU times: user 3min 48s, sys: 2min 49s, total: 6min 38s
Wall time: 6min 36s

# Some encoders use the target for computing the output; they can't run without being given the target. These are the target encoders.
for ncoda in [TargetEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, LeaveOneOutEncoder, CatBoostEncoder, GLMMEncoder]:
    try:
#       Run without train['target']:
        tis = ncoda().fit_transform(train[pick])
        print('Passed:', ncoda)    
    except Exception as complaint:
        print(complaint)
        print('See, told ya it was going to break:', ncoda)    
    gc.collect()

fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.target_encoder.TargetEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.m_estimate.MEstimateEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.woe.WOEEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.james_stein.JamesSteinEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.leave_one_out.LeaveOneOutEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.cat_boost.CatBoostEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.glmm.GLMMEncoder'>

# Let's do a scan for encoders which output a column named, 'intercept', which suggests contrast encoding, which we will see in the last section.
for ncoda in [OrdinalEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, HashingEncoder,
              HelmertEncoder, SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder]:
    out = ncoda().fit_transform(train[pick])
    if 'intercept' in out.columns:
        print(str(ncoda))

<class 'category_encoders.helmert.HelmertEncoder'>
<class 'category_encoders.sum_coding.SumEncoder'>
<class 'category_encoders.backward_difference.BackwardDifferenceEncoder'>
<class 'category_encoders.polynomial.PolynomialEncoder'>

One-to-one simple, target-independent encoders¶

# Let's zoom into a single column.
train['cat10'].nunique(), train['cat10'].unique()
# cat10 alone has 299 unique values altogether. This value in termed 'cardinality'.
# This is an extreme case. Cardinalities are usually lower e.g. exam grades = A, B, C, D, E would have cardinality=5.

(299,
 array(['LO', 'HJ', 'DJ', 'KV', 'DP', 'GE', 'HQ', 'HC', 'EK', 'GS', 'HG',
        'BY', 'HX', 'JK', 'FJ', 'LM', 'HK', 'MD', 'IG', 'JG', 'AN', 'AD',
        'MC', 'KW', 'CK', 'LF', 'CS', 'GK', 'DC', 'LB', 'FM', 'IH', 'LN',
        'IK', 'DF', 'IB', 'CB', 'LY', 'JW', 'FI', 'CR', 'IE', 'LE', 'HB',
        'HV', 'LG', 'BG', 'KP', 'LI', 'HL', 'BF', 'LU', 'O', 'GI', 'DQ',
        'IR', 'DV', 'HA', 'KB', 'FP', 'AT', 'IF', 'HN', 'GC', 'C', 'KC',
        'G', 'JA', 'CU', 'BC', 'AB', 'KF', 'MB', 'HE', 'BL', 'FQ', 'IA',
        'MJ', 'FO', 'V', 'JT', 'AU', 'IO', 'GQ', 'CC', 'JR', 'BM', 'HH',
        'AV', 'GT', 'I', 'IU', 'JN', 'EV', 'MV', 'EQ', 'LW', 'FN', 'IT',
        'AA', 'DK', 'IJ', 'GU', 'P', 'JH', 'CM', 'GA', 'R', 'LX', 'IX',
        'DY', 'D', 'FL', 'CP', 'GL', 'DI', 'CD', 'IV', 'FS', 'FR', 'J',
        'MP', 'MH', 'EL', 'JD', 'AP', 'AE', 'F', 'LC', 'BP', 'BI', 'MF',
        'DO', 'MG', 'MT', 'LD', 'CW', 'KS', 'BV', 'JV', 'BB', 'AM', 'KX',
        'FK', 'AH', 'LV', 'W', 'DU', 'FB', 'JX', 'KA', 'CO', 'AR', 'KR',
        'JI', 'T', 'JP', 'LQ', 'FX', 'FD', 'EY', 'Y', 'JO', 'EC', 'HM',
        'AC', 'DW', 'HU', 'FH', 'AY', 'AL', 'GD', 'GB', 'DS', 'FT', 'KH',
        'CG', 'JB', 'E', 'CN', 'BT', 'X', 'BX', 'HW', 'EI', 'ID', 'KT',
        'GR', 'L', 'KG', 'EA', 'HO', 'GX', 'K', 'AS', 'DM', 'AK', 'FC',
        'MS', 'HR', 'EU', 'ES', 'JY', 'HP', 'KL', 'FE', 'CY', 'EO', 'KJ',
        'CJ', 'CI', 'JL', 'IC', 'S', 'DH', 'GN', 'BS', 'AG', 'M', 'EW',
        'FA', 'LJ', 'GJ', 'KQ', 'HF', 'MR', 'BQ', 'ED', 'FG', 'LL', 'EG',
        'HY', 'EH', 'GW', 'BD', 'IQ', 'Q', 'DA', 'DD', 'GM', 'KN', 'MQ',
        'GY', 'KD', 'JJ', 'CL', 'IY', 'KU', 'CT', 'KK', 'DN', 'BO', 'IP',
        'LH', 'IM', 'DE', 'ME', 'EE', 'LT', 'LR', 'MI', 'CF', 'DR', 'EB',
        'KI', 'DX', 'DL', 'MW', 'FF', 'EF', 'EP', 'MU', 'MA', 'GG', 'CQ',
        'DT', 'FV', 'CH', 'AF', 'AJ', 'IN', 'JC', 'EN', 'JU', 'JE', 'ML',
        'AW', 'HI', 'MO', 'GF', 'MK', 'GH', 'FW', 'GV', 'JF', 'BA', 'LK',
        'IL', 'CX'], dtype=object))

%%time
for which in [preprocessing.LabelEncoder, preprocessing.OrdinalEncoder, OrdinalEncoder,  # Section 1
              preprocessing.OneHotEncoder, OneHotEncoder,                                # Section 2
              preprocessing.LabelBinarizer, BinaryEncoder, BaseNEncoder,                 # Section 3
              CountEncoder,                                                              # Section 4
              HelmertEncoder, SumEncoder, BackwardDifferenceEncoder]:                    # Section 5
    if which==preprocessing.OrdinalEncoder or which==preprocessing.OneHotEncoder: 
        inp = train['cat10'].values.reshape(-1, 1)
    else:
        inp = train['cat10']

    tic = time.time()
    if which==preprocessing.OneHotEncoder: 
        out = which(sparse=False).fit_transform(inp)
    else:
        out = which().fit_transform(inp)
    tictoc = time.time() - tic

    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
#   Grab the label, apply some minor hiding cosmetics:
    label = str(which).replace("<class '", "").replace("'>", "")
    if inp2out_map.isnull().any().any():
        print(label, "doesn't map one-to-one")
    summary.loc[label] = inp2out_map, len(unik), unik, inp2out_map.shape, tictoc
columns_show = ['nunique', 'unique', 'shape', 'tictoc']
summary[columns_show]

CPU times: user 29.3 s, sys: 8.26 s, total: 37.5 s
Wall time: 37.5 s

1. Label & Ordinal encoders¶

From the table we find the first 3 rows:

sklearn.preprocessing._label.LabelEncoder
sklearn.preprocessing._encoders.OrdinalEncoder
category_encoders.ordinal.OrdinalEncoder

rather similar to each other:

they all output 299 unique numbers, where 299 is the cardinality of the original input;
they all output a single column;
they basically do one-to-one mapping of the original input;
they run quickly compared to the rest.

1.1 LabelEncoder vs OrdinalEncoder¶

LabelEncoder encodes one variable at a time; meant for encoding target labels (as in classification problems).
OrdinalEncoder encodes multiple variables/columns at a time; meant for encoding features (plural).

Let’s see that in action:

try:
    out = LabelEncoder().fit_transform(train[['cat10', 'cat5']])
except Exception as complaint:
    print(complaint)
    print('See, told ya it was going to break.')

name 'LabelEncoder' is not defined
See, told ya it was going to break.

out = OrdinalEncoder().fit_transform(train[['cat10', 'cat5']])
# no complains

1.2 pandas does label encoding too¶

def redressOutput(out):
    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
    return inp2out_map, len(unik), unik, inp2out_map.shape

tic = time.time()
out = pd.factorize(train['cat10'])[0]
tictoc = time.time() - tic
summary.loc['pd.factorize'] = redressOutput(out) + (tictoc, )

labelordinal_encoders = ['sklearn.preprocessing._label.LabelEncoder',
                         'sklearn.preprocessing._encoders.OrdinalEncoder',
                         'category_encoders.ordinal.OrdinalEncoder',
                         'pd.factorize']
summary.loc[labelordinal_encoders, columns_show ]

# like scikit's LabelEncoder, pd.factorize can only handle one column at a time
try:
    out = pd.factorize(train[['cat10', 'cat5']])
except Exception as complaint:
    print(complaint)
    print('See, told ya it was going to break.')

could not broadcast input array from shape (300000,2) into shape (300000)
See, told ya it was going to break.

2. One-hot encoders¶

2.1 by scikit-learn and catagory-encoders¶

summary.loc[ summary.index.str.contains('OneHot') , columns_show ]
# We've got two one-hot encoders so far. One from sklearn.preprocessing; another by category_encoders. Both work in a similar way. We can use either.

Compared to label and ordinal encoders, we find that with one-hot encoders:

nunique dropped from 299 to 2;
the number of columns increased from 1 to 299.

Let’s see how a one-hot encoder maps input to output:

inp2out_map = summary.loc['category_encoders.one_hot.OneHotEncoder', 'inp2out_map']
inp2out_map

One-hot encoding is thus name because for each row there is strictly one 1; all other columns must be zero. Let’s do a quick check:

for row_idx, row_data in inp2out_map.iterrows():
    vcount = row_data.value_counts().sort_index()
    if not (vcount==pd.Series({0: 298, 1: 1})).all():
        print('oopsy')
# Loop passes without any oopsy, confirming that each row had strictly 1 one and 298 zeros.

# Let's take the chance to visualise the input-to-output mapping.
plt.imshow(inp2out_map, cmap='gray'); plt.axis('equal'); _ = plt.axis('off')
# black = zero; white = one. We find strictly 1 one on each row, zero everywhere else.

2.2 by pandas¶

tic = time.time()
out = pd.get_dummies(train['cat10'])
tictoc = time.time() - tic
summary.loc['pd.get_dummies'] = redressOutput(out) + (tictoc, )

onehot_encoders = ['sklearn.preprocessing._encoders.OneHotEncoder',
                   'category_encoders.one_hot.OneHotEncoder',
                   'pd.get_dummies']
summary.loc[onehot_encoders, columns_show ]

2.3 by keras¶

But with numeric input only. cat10 is string, not numeric. We would need to first convert from string to numeric.

try:
    utils.to_categorical(train['cat10'])
except Exception as complaint:
    print(complaint)
    print('See, told ya it was going to break.')

invalid literal for int() with base 10: 'LO'
See, told ya it was going to break.

tic = time.time()
borrow = preprocessing.LabelEncoder().fit_transform(train['cat10'])
out = utils.to_categorical(borrow)
tictoc = time.time() - tic
summary.loc['utils.to_categorical'] = redressOutput(out) + (tictoc, )

onehot_encoders = ['sklearn.preprocessing._encoders.OneHotEncoder',
                   'category_encoders.one_hot.OneHotEncoder',
                   'pd.get_dummies',
                   'utils.to_categorical']
summary.loc[onehot_encoders, columns_show ]
# We have at our disposal 4 one-hot encoders by different libraries.

Warning¶

keras.utils.to_categorical doesn’t work with negative input.

def redressOutput(out):
    inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
                             pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
    inp2out_map.set_index('inp', inplace=True, drop=True)
    unik = np.unique(inp2out_map.values)
    return inp2out_map, len(unik), unik, inp2out_map.shape

inp = [0, 1, 2, 3, 4]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# All good: 5 unique values in, 5 unique values out.

5

inp = [-1, 0, 1, 2, 3]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# 5 unique values in but just 4 out. What's happening here?

4

inp = [-2, -1, 0, 1, 2]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# Now it's even worse: 5 unique values in, just 3 unique values out.

3

for before, after in zip (inp, out):
    print(before, after)
# This explains why. Negative values weren't mapped the way we thought. 
# -2 was mapped to the same outcome as 1. 
# -1 got mapped to the same outcome as 2.

-2 [0. 1. 0.]
-1 [0. 0. 1.]
0 [1. 0. 0.]
1 [0. 1. 0.]
2 [0. 0. 1.]

3. Binary & Base-N Encoders¶

Base-N encoding is the superset of

binary encoding (N=2);
one-hot encoding (N=1). By default category_encoders.BaseNEncoder takes N=2; the output is there for identical to category_encoders.BinaryEncoder:

summary.loc[ ['category_encoders.binary.BinaryEncoder', 'category_encoders.basen.BaseNEncoder'] ][ columns_show ]

tis = summary.loc['category_encoders.binary.BinaryEncoder', 'inp2out_map']
tat = summary.loc['category_encoders.basen.BaseNEncoder', 'inp2out_map']
(tis==tat).all().all()

True

But do we really need 10 columns?¶

2^8, 2^9 = 256, 512 so we should only need 9 columns to binary-encode 299 categories.

tis.apply(lambda x: np.unique(x))
# Column cat10_0 is all zeros and is therefore redundant.

cat10_0       [0]
cat10_1    [0, 1]
cat10_2    [0, 1]
cat10_3    [0, 1]
cat10_4    [0, 1]
cat10_5    [0, 1]
cat10_6    [0, 1]
cat10_7    [0, 1]
cat10_8    [0, 1]
cat10_9    [0, 1]
dtype: object

# We can pass the option drop_invariant=True to avoid that redundancy.
BaseNEncoder(drop_invariant=True).fit_transform(train['cat10'])
# Now the redundant column disappears; we get 9 columns instead of 10.

4. Count Encoder¶

summary.loc['category_encoders.count.CountEncoder', 'inp2out_map']
# The count encoder seems to output all sorts of integers.

# Let's take a look where those values come from. For sampling sake we take the last 3 values and try to derive them.
out = CountEncoder().fit_transform(train['cat10'], train['target'])
inp, out.tail(3)

([-2, -1, 0, 1, 2],
         cat10
 id           
 499996   3011
 499997    565
 499999   5917)

# Where did 3011 come from?
(train['cat10']=='HC').sum()

3011

# Where did 565 come from?
(train['cat10']=='BF').sum()

565

# Where did 5917 come from?
(train['cat10']=='LM').sum()

5917

5. Contrast Encoders¶

These are contrast encoders characterised by the presence of an intercept in the output.

5.1 Helmert Encoder¶

summary.loc[ 'category_encoders.helmert.HelmertEncoder', 'inp2out_map' ]

5.2 Sum Encoder¶

inp2out_map = summary.loc[ 'category_encoders.sum_coding.SumEncoder', 'inp2out_map' ]
inp2out_map

column_sum = inp2out_map.sum()
column_sum
# This is the signature of sum encoding: except the ```intercept``` column all columns sum to zero.

intercept    299.0
cat10_0        0.0
cat10_1        0.0
cat10_2        0.0
cat10_3        0.0
             ...  
cat10_293      0.0
cat10_294      0.0
cat10_295      0.0
cat10_296      0.0
cat10_297      0.0
Length: 299, dtype: float64

column_sum[ column_sum!= 0 ]

intercept    299.0
dtype: float64

5.3 Backward-Difference Encoder¶

summary.loc[ 'category_encoders.backward_difference.BackwardDifferenceEncoder', 'inp2out_map' ]

Grand summary¶

summary[columns_show]

This notebook is getting a little long. We’ve covered simple, one-to-one mapping encoders. Let’s do target encoders in another notebook!