Categorical Encoding

Type of Categorical Variables

Categorical Variables:
- Nomial: no order associated with like gender (male & female) → using Label Encoder or Mapping Dictionary
- Ordinal: order associated
- Cyclical: Monday → Tuesday → .. → Sunday
- Binary: only has 0 and 1
- Rare Category: a category which is not seen very often, or a new category that is not present in train

Rule of Thumbs

Rule 1: Fill na with string → convert all values to string
- data[feat].fillna("Other").astype(str)
Rule 2: Filter Good & Problematic Categorical Columns which will affect Encoding Procedure
- For example: Unique values in Train Data are different from Unique values in Valid Data → Solution: ensure values in Valid Data is a subset of values in Train Data
- The simplest approach, however, is to drop the problematic categorical columns.

# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if
                   set(X_valid[col]).issubset(set(X_train[col]))]

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))

print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

The simplest approach, however, is to drop the problematic categorical columns.

# Drop categorical columns that will not be encoded
X_train = X_train.drop(bad_label_cols, axis=1)
X_valid = X_valid.drop(bad_label_cols, axis=1)

Rule 3: Investigating Cardinality
- Cardinality: # of unique entries of a categorical variable
- High cardinality columns can either be dropped from the dataset, or we can use ordinal encoding.

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

Rule 4: A rare category is a category which is not seen very often, or a new category that is not present in train
- Define our criteria for calling a value “rare” category, so for those categorical which have the count less than certain threshold during the training we can map it to "rare" category

# below code is to find those categories in col "ord_4" which have the count less then 2000, and assign them to the same category "rare"
df.loc[df["ord_4"].map(df["ord_4"].value_counts()) < 2000, "ord_4"] = "RARE"

The map on how to determine which encoder we should use

Type of Encoders

Encoder	Type of Variable	Support High Cardinality	Handle Unseen	Task	Cons
Label Encoding	Nominal	Yes	No
Ordinal Encoding	Ordinal	Yes	Yes
One-Hot Encoding	Nominal	Not	Yes		Large Dataset
Target Encoding / Leave One Out Encoding	Nominal	Yes	Yes	Only Classification	Target Leakage & Un-even Category Distribution
Count / Frequency Encoding	Nominal	Yes	Yes		Similar encodings for categories with the same counts
Binary / BaseN Encoding	Nominal	Yes	Yes
Hash Encoding	Nominal	Yes	Yes		Irreversible & Information Loss

Label Encoding / Ordinal Encoding

Label Encoder is used for nominal categorical variables (categories without order i.e., red, green, blue)
- Only encode one column at a time and multiple label encoders must be initialized for each categorical column.
Ordinal Encoder is used for ordinal categorical variables (categories with order i.e., small, medium, large). - If not specify the order when initialise the Ordinal Encoder, it is equivalent to Label Encoder - For example: This Ordinal Encoder assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).

df.loc[:, "ord_2"] = df["ord_2"].fillna("Other")
# Label Encoder Example
from sklearn.preprocessing import LabelEncoder
lbl_enc = LabelEncoder()
lbl_enc.fit(df["ord_2"].values)

df.loc[:,"ord_2"] = lbl_enc.transform(df["ord_2"].values)
# getting the mapping
lbl_mapping = dict(zip(lbl_enc.classes_, lbl_enc.transform(lbl_enc.classes_)))
# {'Boiling Hot': 0, 'Cold': 1, 'Freezing': 2, 'Hot': 3, 'Lava Hot': 4, 'Other': 5, 'Warm': 6}


# Ordinal Encoder Example
ordinal_enc = OrdinalEncoder(
    categories=[["Freezing", "Warm", "Cold", "Boiling Hot", "Hot", "Lava Hot"]], # specify the order 0,1,2,3,4,5
    handle_unknown="use_encoded_value",
    unknown_value=-1, # unknown value will be mapped to -1
)
ordinal_enc.fit(df["ord_2"].values)

df.loc[:,"ord_2"] = ordinal_enc.transform(df["ord_2"].values)
encoded_categories = ordinal_enc.transform(ordinal_enc.categories_[0].reshape(-1, 1))
ordinal_mapping = dict(zip(ordinal_encoder.categories_[0], encoded_categories.squeeze()))
# {'Freezing': 0.0, 'Warm': 1.0, 'Cold': 2.0, 'Boiling Hot': 3.0, 'Hot': 4.0, 'Lava Hot': 5.0}

One Hot Encoding

Given that the cardinality (number of categories) is n, One-Hot Encoder encodes the data by creating n additional columns.

df.loc[:, "ord_2"] = df["ord_2"].fillna("Other")

ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')
ohe.fit(df['ord_2'].values.reshape(-1, 1))

# Fit encoder on training data (returns a separate DataFrame)
data_ohe = pd.DataFrame(ohe.transform(df[["ord_2"]].values).toarray())
data_ohe.columns = [col for col in ohe.categories_[0]]

#    Boiling Hot  Cold  Freezing  Hot  Lava Hot  Other  Warm
# 0          0.0   0.0       0.0  1.0       0.0    0.0   0.0
# 1          0.0   0.0       0.0  0.0       0.0    0.0   1.0
# 2          0.0   0.0       1.0  0.0       0.0    0.0   0.0

# for unknow value, will result in all 0s vector
ohe.transform([['Unseen']]).toarray()
#     array([[0.,  0.,       0.,   0.,        0.,    0.,   0.]])

Target Encoding / Leave One Out Encoding

Target Encoding uses Bayesian posterior probability to encode categorical variables to the mean of the target variable (numerical variable).
There are two ways to implement target encoding
- Mean Encoding: The encoded values are the mean of the target values with smoothing applied
- Leave-One-Out Encoding: The encoded values are the mean of the target values except for the data point that we want to predict

Target (Mean) Encoding

Example: we want to encode "Favorite Color" column with the target column "Loves Troll 2"

From above example, because less data supports the value Red (only 1 record), so we have less confidently placed the encoded value for Red in compare with Blue and Green
The solution for this is Weigthed Mean with the smoothing m

Code Implementation

import category_encoders as ce

# Target (Mean) Encoding - fit on training data, transform test data
encoder = ce.TargetEncoder(cols=["ord_2"], smoothing=1.0)
df["ord_2_encoded"] = encoder.fit_transform(df["ord_2"], df["target"])

Cons of Target Encoder

Target Leakage: Even with smoothing, this may result in target leakage and overfitting. Leave-One-Out Encoding and introducing Gaussian noise in the target variable can be used to address the overfitting problem
Uneven Category Distribution: The category distribution can differ in train and validation/test data and result in categories being encoded with incorrect or extreme values

Leave One Out Encoding

In the leave one out encoding, the current target value is reduced from the overall mean of the target to avoid leakage.
For example, we want to encode the column ord_2 and we have
- At the index 0, the ord_2 has the category Hot corresponding to target = 0
- At the index 9, the ord_2 has the category Lava Hot corresponding to target = 1

	ord_2	ord_2_encoded	target
0	Hot	0.205179	0
9	Lava Hot	0.290751	1

For each category in the ord_2 column, let calculate what is the count & the target sum accordingly

df.groupby('ord_2')['target'].agg(['count', 'sum'])

ord_2	count	sum
Boiling Hot	84790	20689
Cold	97822	14889
Freezing	142726	18876
Hot	67508	13851
Lava Hot	64840	18853
Other	18075	3373
Warm	124239	21792

For row_id = 0, Hot will be encoded as (13851 - 0) / (67508-1) = 0.205179
For row_id = 9, Lava Hot will be encoded as (18853 - 1) / (64840-1) = 0.29075

Count / Frequency Encoding

Count and Frequency Encoding encodes categorical variables to the count of occurrences and frequency (normalized count) of occurrences respectively.
Cons: Similar encodings
- If all categories have similar counts, the encoded values will be the same

import category_encoders as ce

# Count Encoding - fit on training data, transform test data
encoder = ce.CountEncoder(cols="type")
data_train["type_count_encoded"] = encoder.fit_transform(data_train["type"])
data_test["type_count_encoded"] = encoder.transform(data_test["type"])

# Frequency (normalized count) Encoding
encoder = ce.CountEncoder(cols="type", normalize=True)
data_train["type_frequency_encoded"] = encoder.fit_transform(data_train["type"])
data_test["type_frequency_encoded"] = encoder.transform(data_test["type"])

Binary / BaseN Encoding

Binary Encoding encodes categorical variables into integers, then converts them to binary code. The output is similar to One-Hot Encoding, but lesser columns are created.
This addresses the drawback to One-Hot Encoding where a cardinality of n does not result in n number of columns, but log2(n) columns.
BaseN Encoding follows the same idea but uses other base values instead of 2, resulting in logN(n) columns.
Pros:
- Nominal Variables: Binary and BaseN Encoder are used for nominal categorical variables
- High Cardinality: Binary and BaseN encoding works well with a high number of categories
- Missing or Unseen Variables: Binary and BaseN Encoder can handle unseen variables by encoding them with 0 values across all columns

import category_encoders as ce

# Binary Encoding - fit on training data, transform test data
encoder = ce.BinaryEncoder()
data_encoded = encoder.fit_transform(data_train["type"])
encoder.transform(data_test["type"])

# BaseN Encoding - fit on training data, transform test data
encoder = ce.BaseNEncoder(base=5)
data_encoded = encoder.fit_transform(data_train["type"])
encoder.transform(data_test["type"])

Hash Encoding

Hash Encoding encodes categorical variables into distinct hash values using a hash function. The output is similar to One-Hot Encoding, but you can choose the number of columns created.
Hash encoding can encode high-cardinality data to a fixed-sized array as the number of new columns is manually specified.
Pros:
- Nominal Variables: Hash Encoder is used for nominal categorical variables
- High Cardinality: Hash encoding works well with a high number of categories
- Missing or Unseen Variables: Hash Encoder can handle unseen variables by encoding them with null values across all columns
Cons:
- Irreversible: Hashing functions are one-direction such that the original input can be hashed into a hash value, but the original input cannot be retrieved from the hash value
- Information Loss or Collision: If too few columns are created, hash encoding can lead to loss of information as multiple different inputs may result in the same output from the hash function
Hash encoding can be done with FeatureHasher from the sklearn package or with HashingEncoder from the category encoders package.

from sklearn.feature_extraction import FeatureHasher

# Hash Encoding - fit on training data, transform test data
encoder = FeatureHasher(n_features=2, input_type="string")
data_encoded = encoder.fit_transform(data_train["type"]).toarray()

# Using category_encoders
import category_encoders as ce

# Hash Encoding - fit on training data, transform test data
encoder = ce.HashingEncoder(n_components=2)
data_encoded = encoder.fit_transform(data_train["type"])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical Encoding

Type of Categorical Variables

Rule of Thumbs

Type of Encoders

Label Encoding / Ordinal Encoding

One Hot Encoding

Target Encoding / Leave One Out Encoding

Target (Mean) Encoding

Cons of Target Encoder

Leave One Out Encoding

Count / Frequency Encoding

Binary / BaseN Encoding

Hash Encoding

FilesExpand file tree

categorical_encoding.md

Latest commit

History

categorical_encoding.md

File metadata and controls

Categorical Encoding

Type of Categorical Variables

Rule of Thumbs

Type of Encoders

Label Encoding / Ordinal Encoding

One Hot Encoding

Target Encoding / Leave One Out Encoding

Target (Mean) Encoding

Cons of Target Encoder

Leave One Out Encoding

Count / Frequency Encoding

Binary / BaseN Encoding

Hash Encoding