Category Encoders: Catalog & Experiments (Part 1)
We shall explore 24 encoders from 4 libraries:
| library | one-hot encoders | other simple encoders | contrast encoders | target/Bayesian encoders |
|---|---|---|---|---|
| sklearn | OneHotEncoder | LabelEncoder OrdinalEncoder LabelBinarizer | ||
| category_encoders | OneHotEncoder | OrdinalEncoder BinaryEncoder BaseNEncoder CountEncoder HashingEncoder | HelmertEncoder SumEncoder BackwardDifferenceEncoder PolynomialEncoder | TargetEncoder MEstimateEncoder WOEEncoder JamesSteinEncoder LeaveOneOutEncoder CatBoostEncoder GLMMEncoder |
| pandas | get_dummies | factorize | ||
| keras.utils | to_categorical |
Encoders map the original categories (often dtype=string) to a set of representing values (often dtype=int for simple encoders; dtype=float for target encoders). This notebook walks through a tour of the encoders listed in the table, exploring each non-target encoder one by one, producing a comparison table at the end. Target encoders shall be explored in detail in a separate notebook.
When to use which encoder to solve what problems? There is a good guide here: [Encode Smarter: How to Easily Integrate Categorical Encoding into Your Machine Learning Pipeline](https://innovation.alteryx.com/encode-smarter).
from sklearn import preprocessing
from category_encoders import OrdinalEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, HashingEncoder
from category_encoders import HelmertEncoder, SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder
from category_encoders import TargetEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, LeaveOneOutEncoder, CatBoostEncoder, GLMMEncoder
from keras import utils
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings, gc, time
warnings.simplefilter('ignore') # once | error | always | default | module
# We shall be compiling a summary table as we go along.
summary = pd.DataFrame({'inp2out_map': pd.Series(dtype=object), # input-to-output map
'nunique' : pd.Series(dtype=int), # number of unique (or distinct) values in output
'unique' : pd.Series(dtype='object'), # unique values in output
'shape' : pd.Series(dtype=int), # rows-by-columns of output array
'tictoc' : pd.Series(dtype=int)}) # computation time i seconds
summary.index.name = 'encoder'
# The grand summary is printed at the end of this notebook.train = pd.read_csv('/kaggle/input/tabular-playground-series-mar-2021/train.csv', index_col='id') # [['cat10', 'cat5', 'target']]
train.sample(5)Is encoding optional?¶
Not always. Some packages can’t digest string-type data without encoding. ‘Donkey’, ‘horse’ and ‘mule’, for instance, would not work whereas 0, 1 and 2 would.
Even when the package can digest data without encoding, they sometimes learn encoded data better.
Encoder types: a broad-stroke scan¶
We’ve got 2 dozen encoders here. Let’s take an overview by trying to group them into families according to observable behaviors.
%%time
# Would the output differ whether or not we supply the target as input?
# Let's run a test with 10 encoders which optionally accept the target as input:
pick = train.columns[train.columns.str.startswith('cat')]
for ncoda in [OrdinalEncoder, HelmertEncoder, SumEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, BackwardDifferenceEncoder]:
tis = ncoda().fit_transform(train[pick])
tat = ncoda().fit_transform(train[pick], train['target'])
# Print 'True' if same; print 'False' otherwise
print((tis==tat).all().all(), ncoda)True <class 'category_encoders.ordinal.OrdinalEncoder'>
True <class 'category_encoders.helmert.HelmertEncoder'>
True <class 'category_encoders.sum_coding.SumEncoder'>
True <class 'category_encoders.one_hot.OneHotEncoder'>
True <class 'category_encoders.binary.BinaryEncoder'>
True <class 'category_encoders.basen.BaseNEncoder'>
True <class 'category_encoders.count.CountEncoder'>
True <class 'category_encoders.backward_difference.BackwardDifferenceEncoder'>
CPU times: user 3min 48s, sys: 2min 49s, total: 6min 38s
Wall time: 6min 36s
# Some encoders use the target for computing the output; they can't run without being given the target. These are the target encoders.
for ncoda in [TargetEncoder, MEstimateEncoder, WOEEncoder, JamesSteinEncoder, LeaveOneOutEncoder, CatBoostEncoder, GLMMEncoder]:
try:
# Run without train['target']:
tis = ncoda().fit_transform(train[pick])
print('Passed:', ncoda)
except Exception as complaint:
print(complaint)
print('See, told ya it was going to break:', ncoda)
gc.collect()fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.target_encoder.TargetEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.m_estimate.MEstimateEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.woe.WOEEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.james_stein.JamesSteinEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.leave_one_out.LeaveOneOutEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.cat_boost.CatBoostEncoder'>
fit_transform() missing argument: y
See, told ya it was going to break: <class 'category_encoders.glmm.GLMMEncoder'>
# Let's do a scan for encoders which output a column named, 'intercept', which suggests contrast encoding, which we will see in the last section.
for ncoda in [OrdinalEncoder, OneHotEncoder, BinaryEncoder, BaseNEncoder, CountEncoder, HashingEncoder,
HelmertEncoder, SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder]:
out = ncoda().fit_transform(train[pick])
if 'intercept' in out.columns:
print(str(ncoda))<class 'category_encoders.helmert.HelmertEncoder'>
<class 'category_encoders.sum_coding.SumEncoder'>
<class 'category_encoders.backward_difference.BackwardDifferenceEncoder'>
<class 'category_encoders.polynomial.PolynomialEncoder'>
One-to-one simple, target-independent encoders¶
# Let's zoom into a single column.
train['cat10'].nunique(), train['cat10'].unique()
# cat10 alone has 299 unique values altogether. This value in termed 'cardinality'.
# This is an extreme case. Cardinalities are usually lower e.g. exam grades = A, B, C, D, E would have cardinality=5.(299,
array(['LO', 'HJ', 'DJ', 'KV', 'DP', 'GE', 'HQ', 'HC', 'EK', 'GS', 'HG',
'BY', 'HX', 'JK', 'FJ', 'LM', 'HK', 'MD', 'IG', 'JG', 'AN', 'AD',
'MC', 'KW', 'CK', 'LF', 'CS', 'GK', 'DC', 'LB', 'FM', 'IH', 'LN',
'IK', 'DF', 'IB', 'CB', 'LY', 'JW', 'FI', 'CR', 'IE', 'LE', 'HB',
'HV', 'LG', 'BG', 'KP', 'LI', 'HL', 'BF', 'LU', 'O', 'GI', 'DQ',
'IR', 'DV', 'HA', 'KB', 'FP', 'AT', 'IF', 'HN', 'GC', 'C', 'KC',
'G', 'JA', 'CU', 'BC', 'AB', 'KF', 'MB', 'HE', 'BL', 'FQ', 'IA',
'MJ', 'FO', 'V', 'JT', 'AU', 'IO', 'GQ', 'CC', 'JR', 'BM', 'HH',
'AV', 'GT', 'I', 'IU', 'JN', 'EV', 'MV', 'EQ', 'LW', 'FN', 'IT',
'AA', 'DK', 'IJ', 'GU', 'P', 'JH', 'CM', 'GA', 'R', 'LX', 'IX',
'DY', 'D', 'FL', 'CP', 'GL', 'DI', 'CD', 'IV', 'FS', 'FR', 'J',
'MP', 'MH', 'EL', 'JD', 'AP', 'AE', 'F', 'LC', 'BP', 'BI', 'MF',
'DO', 'MG', 'MT', 'LD', 'CW', 'KS', 'BV', 'JV', 'BB', 'AM', 'KX',
'FK', 'AH', 'LV', 'W', 'DU', 'FB', 'JX', 'KA', 'CO', 'AR', 'KR',
'JI', 'T', 'JP', 'LQ', 'FX', 'FD', 'EY', 'Y', 'JO', 'EC', 'HM',
'AC', 'DW', 'HU', 'FH', 'AY', 'AL', 'GD', 'GB', 'DS', 'FT', 'KH',
'CG', 'JB', 'E', 'CN', 'BT', 'X', 'BX', 'HW', 'EI', 'ID', 'KT',
'GR', 'L', 'KG', 'EA', 'HO', 'GX', 'K', 'AS', 'DM', 'AK', 'FC',
'MS', 'HR', 'EU', 'ES', 'JY', 'HP', 'KL', 'FE', 'CY', 'EO', 'KJ',
'CJ', 'CI', 'JL', 'IC', 'S', 'DH', 'GN', 'BS', 'AG', 'M', 'EW',
'FA', 'LJ', 'GJ', 'KQ', 'HF', 'MR', 'BQ', 'ED', 'FG', 'LL', 'EG',
'HY', 'EH', 'GW', 'BD', 'IQ', 'Q', 'DA', 'DD', 'GM', 'KN', 'MQ',
'GY', 'KD', 'JJ', 'CL', 'IY', 'KU', 'CT', 'KK', 'DN', 'BO', 'IP',
'LH', 'IM', 'DE', 'ME', 'EE', 'LT', 'LR', 'MI', 'CF', 'DR', 'EB',
'KI', 'DX', 'DL', 'MW', 'FF', 'EF', 'EP', 'MU', 'MA', 'GG', 'CQ',
'DT', 'FV', 'CH', 'AF', 'AJ', 'IN', 'JC', 'EN', 'JU', 'JE', 'ML',
'AW', 'HI', 'MO', 'GF', 'MK', 'GH', 'FW', 'GV', 'JF', 'BA', 'LK',
'IL', 'CX'], dtype=object))%%time
for which in [preprocessing.LabelEncoder, preprocessing.OrdinalEncoder, OrdinalEncoder, # Section 1
preprocessing.OneHotEncoder, OneHotEncoder, # Section 2
preprocessing.LabelBinarizer, BinaryEncoder, BaseNEncoder, # Section 3
CountEncoder, # Section 4
HelmertEncoder, SumEncoder, BackwardDifferenceEncoder]: # Section 5
if which==preprocessing.OrdinalEncoder or which==preprocessing.OneHotEncoder:
inp = train['cat10'].values.reshape(-1, 1)
else:
inp = train['cat10']
tic = time.time()
if which==preprocessing.OneHotEncoder:
out = which(sparse=False).fit_transform(inp)
else:
out = which().fit_transform(inp)
tictoc = time.time() - tic
inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
inp2out_map.set_index('inp', inplace=True, drop=True)
unik = np.unique(inp2out_map.values)
# Grab the label, apply some minor hiding cosmetics:
label = str(which).replace("<class '", "").replace("'>", "")
if inp2out_map.isnull().any().any():
print(label, "doesn't map one-to-one")
summary.loc[label] = inp2out_map, len(unik), unik, inp2out_map.shape, tictoc
columns_show = ['nunique', 'unique', 'shape', 'tictoc']
summary[columns_show]CPU times: user 29.3 s, sys: 8.26 s, total: 37.5 s
Wall time: 37.5 s
1. Label & Ordinal encoders¶
From the table we find the first 3 rows:
sklearn.preprocessing._label.LabelEncoder
sklearn.preprocessing._encoders.OrdinalEncoder
category_encoders.ordinal.OrdinalEncoder
rather similar to each other:
they all output 299 unique numbers, where 299 is the cardinality of the original input;
they all output a single column;
they basically do one-to-one mapping of the original input;
they run quickly compared to the rest.
1.1 LabelEncoder vs OrdinalEncoder¶
LabelEncoder encodes one variable at a time; meant for encoding target labels (as in classification problems).
OrdinalEncoder encodes multiple variables/columns at a time; meant for encoding features (plural).
Let’s see that in action:
try:
out = LabelEncoder().fit_transform(train[['cat10', 'cat5']])
except Exception as complaint:
print(complaint)
print('See, told ya it was going to break.') name 'LabelEncoder' is not defined
See, told ya it was going to break.
out = OrdinalEncoder().fit_transform(train[['cat10', 'cat5']])
# no complains1.2 pandas does label encoding too¶
def redressOutput(out):
inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
inp2out_map.set_index('inp', inplace=True, drop=True)
unik = np.unique(inp2out_map.values)
return inp2out_map, len(unik), unik, inp2out_map.shapetic = time.time()
out = pd.factorize(train['cat10'])[0]
tictoc = time.time() - tic
summary.loc['pd.factorize'] = redressOutput(out) + (tictoc, )
labelordinal_encoders = ['sklearn.preprocessing._label.LabelEncoder',
'sklearn.preprocessing._encoders.OrdinalEncoder',
'category_encoders.ordinal.OrdinalEncoder',
'pd.factorize']
summary.loc[labelordinal_encoders, columns_show ]# like scikit's LabelEncoder, pd.factorize can only handle one column at a time
try:
out = pd.factorize(train[['cat10', 'cat5']])
except Exception as complaint:
print(complaint)
print('See, told ya it was going to break.') could not broadcast input array from shape (300000,2) into shape (300000)
See, told ya it was going to break.
summary.loc[ summary.index.str.contains('OneHot') , columns_show ]
# We've got two one-hot encoders so far. One from sklearn.preprocessing; another by category_encoders. Both work in a similar way. We can use either.Compared to label and ordinal encoders, we find that with one-hot encoders:
nuniquedropped from 299 to 2;the number of columns increased from 1 to 299.
Let’s see how a one-hot encoder maps input to output:
inp2out_map = summary.loc['category_encoders.one_hot.OneHotEncoder', 'inp2out_map']
inp2out_mapOne-hot encoding is thus name because for each row there is strictly one 1; all other columns must be zero. Let’s do a quick check:
for row_idx, row_data in inp2out_map.iterrows():
vcount = row_data.value_counts().sort_index()
if not (vcount==pd.Series({0: 298, 1: 1})).all():
print('oopsy')
# Loop passes without any oopsy, confirming that each row had strictly 1 one and 298 zeros.# Let's take the chance to visualise the input-to-output mapping.
plt.imshow(inp2out_map, cmap='gray'); plt.axis('equal'); _ = plt.axis('off')
# black = zero; white = one. We find strictly 1 one on each row, zero everywhere else.
2.2 by pandas¶
tic = time.time()
out = pd.get_dummies(train['cat10'])
tictoc = time.time() - tic
summary.loc['pd.get_dummies'] = redressOutput(out) + (tictoc, )
onehot_encoders = ['sklearn.preprocessing._encoders.OneHotEncoder',
'category_encoders.one_hot.OneHotEncoder',
'pd.get_dummies']
summary.loc[onehot_encoders, columns_show ]2.3 by keras¶
But with numeric input only. cat10 is string, not numeric. We would need to first convert from string to numeric.
try:
utils.to_categorical(train['cat10'])
except Exception as complaint:
print(complaint)
print('See, told ya it was going to break.')invalid literal for int() with base 10: 'LO'
See, told ya it was going to break.
tic = time.time()
borrow = preprocessing.LabelEncoder().fit_transform(train['cat10'])
out = utils.to_categorical(borrow)
tictoc = time.time() - tic
summary.loc['utils.to_categorical'] = redressOutput(out) + (tictoc, )
onehot_encoders = ['sklearn.preprocessing._encoders.OneHotEncoder',
'category_encoders.one_hot.OneHotEncoder',
'pd.get_dummies',
'utils.to_categorical']
summary.loc[onehot_encoders, columns_show ]
# We have at our disposal 4 one-hot encoders by different libraries.Warning¶
keras.utils.to_categorical doesn’t work with negative input.
def redressOutput(out):
inp2out_map = pd.concat([pd.DataFrame({'inp': train['cat10']}, columns=['inp']),
pd.DataFrame(out, index=train.index)], axis=1).drop_duplicates()
inp2out_map.set_index('inp', inplace=True, drop=True)
unik = np.unique(inp2out_map.values)
return inp2out_map, len(unik), unik, inp2out_map.shapeinp = [0, 1, 2, 3, 4]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# All good: 5 unique values in, 5 unique values out.5inp = [-1, 0, 1, 2, 3]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# 5 unique values in but just 4 out. What's happening here?4inp = [-2, -1, 0, 1, 2]
out = utils.to_categorical(inp)
len(pd.DataFrame(out).drop_duplicates())
# Now it's even worse: 5 unique values in, just 3 unique values out.3for before, after in zip (inp, out):
print(before, after)
# This explains why. Negative values weren't mapped the way we thought.
# -2 was mapped to the same outcome as 1.
# -1 got mapped to the same outcome as 2.-2 [0. 1. 0.]
-1 [0. 0. 1.]
0 [1. 0. 0.]
1 [0. 1. 0.]
2 [0. 0. 1.]
3. Binary & Base-N Encoders¶
Base-N encoding is the superset of
binary encoding (N=2);
one-hot encoding (N=1). By default
category_encoders.BaseNEncodertakes N=2; the output is there for identical tocategory_encoders.BinaryEncoder:
summary.loc[ ['category_encoders.binary.BinaryEncoder', 'category_encoders.basen.BaseNEncoder'] ][ columns_show ]tis = summary.loc['category_encoders.binary.BinaryEncoder', 'inp2out_map']
tat = summary.loc['category_encoders.basen.BaseNEncoder', 'inp2out_map']
(tis==tat).all().all()TrueBut do we really need 10 columns?¶
2^8, 2^9 = 256, 512 so we should only need 9 columns to binary-encode 299 categories.
tis.apply(lambda x: np.unique(x))
# Column cat10_0 is all zeros and is therefore redundant.cat10_0 [0]
cat10_1 [0, 1]
cat10_2 [0, 1]
cat10_3 [0, 1]
cat10_4 [0, 1]
cat10_5 [0, 1]
cat10_6 [0, 1]
cat10_7 [0, 1]
cat10_8 [0, 1]
cat10_9 [0, 1]
dtype: object# We can pass the option drop_invariant=True to avoid that redundancy.
BaseNEncoder(drop_invariant=True).fit_transform(train['cat10'])
# Now the redundant column disappears; we get 9 columns instead of 10.4. Count Encoder¶
summary.loc['category_encoders.count.CountEncoder', 'inp2out_map']
# The count encoder seems to output all sorts of integers.# Let's take a look where those values come from. For sampling sake we take the last 3 values and try to derive them.
out = CountEncoder().fit_transform(train['cat10'], train['target'])
inp, out.tail(3)([-2, -1, 0, 1, 2],
cat10
id
499996 3011
499997 565
499999 5917)# Where did 3011 come from?
(train['cat10']=='HC').sum()3011# Where did 565 come from?
(train['cat10']=='BF').sum()565# Where did 5917 come from?
(train['cat10']=='LM').sum()59175. Contrast Encoders¶
These are contrast encoders characterised by the presence of an intercept in the output.
5.1 Helmert Encoder¶
summary.loc[ 'category_encoders.helmert.HelmertEncoder', 'inp2out_map' ]5.2 Sum Encoder¶
inp2out_map = summary.loc[ 'category_encoders.sum_coding.SumEncoder', 'inp2out_map' ]
inp2out_mapcolumn_sum = inp2out_map.sum()
column_sum
# This is the signature of sum encoding: except the ```intercept``` column all columns sum to zero.intercept 299.0
cat10_0 0.0
cat10_1 0.0
cat10_2 0.0
cat10_3 0.0
...
cat10_293 0.0
cat10_294 0.0
cat10_295 0.0
cat10_296 0.0
cat10_297 0.0
Length: 299, dtype: float64column_sum[ column_sum!= 0 ]intercept 299.0
dtype: float645.3 Backward-Difference Encoder¶
summary.loc[ 'category_encoders.backward_difference.BackwardDifferenceEncoder', 'inp2out_map' ]Grand summary¶
summary[columns_show]This notebook is getting a little long. We’ve covered simple, one-to-one mapping encoders. Let’s do target encoders in another notebook!