You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nomial: no order associated with like gender (male & female) → using Label Encoder or Mapping Dictionary
Ordinal: order associated
Cyclical: Monday → Tuesday → .. → Sunday
Binary: only has 0 and 1
Rare Category: a category which is not seen very often, or a new category that is not present in train
Rule of Thumbs
Rule 1: Fill na with string → convert all values to string
data[feat].fillna("Other").astype(str)
Rule 2: Filter Good & Problematic Categorical Columns which will affect Encoding Procedure
For example: Unique values in Train Data are different from Unique values in Valid Data → Solution: ensure values in Valid Data is a subset of values in Train Data
The simplest approach, however, is to drop the problematic categorical columns.
# Categorical columns in the training dataobject_cols= [colforcolinX_train.columnsifX_train[col].dtype=="object"]
# Columns that can be safely ordinal encodedgood_label_cols= [colforcolinobject_colsifset(X_valid[col]).issubset(set(X_train[col]))]
# Problematic columns that will be dropped from the datasetbad_label_cols=list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
The simplest approach, however, is to drop the problematic categorical columns.
# Drop categorical columns that will not be encodedX_train=X_train.drop(bad_label_cols, axis=1)
X_valid=X_valid.drop(bad_label_cols, axis=1)
Rule 3: Investigating Cardinality
Cardinality: # of unique entries of a categorical variable
High cardinality columns can either be dropped from the dataset, or we can use ordinal encoding.
# Columns that will be one-hot encodedlow_cardinality_cols= [colforcolinobject_colsifX_train[col].nunique() <10]
# Columns that will be dropped from the datasethigh_cardinality_cols=list(set(object_cols)-set(low_cardinality_cols))
Rule 4: A rare category is a category which is not seen very often, or a new category that is not present in train
Define our criteria for calling a value “rare” category, so for those categorical which have the count less than certain threshold during the training we can map it to "rare" category
# below code is to find those categories in col "ord_4" which have the count less then 2000, and assign them to the same category "rare"df.loc[df["ord_4"].map(df["ord_4"].value_counts()) <2000, "ord_4"] ="RARE"
The map on how to determine which encoder we should use
Type of Encoders
Encoder
Type of Variable
Support High Cardinality
Handle Unseen
Task
Cons
Label Encoding
Nominal
Yes
No
Ordinal Encoding
Ordinal
Yes
Yes
One-Hot Encoding
Nominal
Not
Yes
Large Dataset
Target Encoding / Leave One Out Encoding
Nominal
Yes
Yes
Only Classification
Target Leakage & Un-even Category Distribution
Count / Frequency Encoding
Nominal
Yes
Yes
Similar encodings for categories with the same counts
Binary / BaseN Encoding
Nominal
Yes
Yes
Hash Encoding
Nominal
Yes
Yes
Irreversible & Information Loss
Label Encoding / Ordinal Encoding
Label Encoder is used for nominal categorical variables (categories without order i.e., red, green, blue)
Only encode one column at a time and multiple label encoders must be initialized for each categorical column.
Ordinal Encoder is used for ordinal categorical variables (categories with order i.e., small, medium, large). - If not specify the order when initialise the Ordinal Encoder, it is equivalent to Label Encoder - For example: This Ordinal Encoder assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
Given that the cardinality (number of categories) is n, One-Hot Encoder encodes the data by creating n additional columns.
df.loc[:, "ord_2"] =df["ord_2"].fillna("Other")
ohe=preprocessing.OneHotEncoder(handle_unknown='ignore')
ohe.fit(df['ord_2'].values.reshape(-1, 1))
# Fit encoder on training data (returns a separate DataFrame)data_ohe=pd.DataFrame(ohe.transform(df[["ord_2"]].values).toarray())
data_ohe.columns= [colforcolinohe.categories_[0]]
# Boiling Hot Cold Freezing Hot Lava Hot Other Warm# 0 0.0 0.0 0.0 1.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0 0.0 0.0 1.0# 2 0.0 0.0 1.0 0.0 0.0 0.0 0.0# for unknow value, will result in all 0s vectorohe.transform([['Unseen']]).toarray()
# array([[0., 0., 0., 0., 0., 0., 0.]])
Target Encoding / Leave One Out Encoding
Target Encoding uses Bayesian posterior probability to encode categorical variables to the mean of the target variable (numerical variable).
There are two ways to implement target encoding
Mean Encoding: The encoded values are the mean of the target values with smoothing applied
Leave-One-Out Encoding: The encoded values are the mean of the target values except for the data point that we want to predict
Target (Mean) Encoding
Example: we want to encode "Favorite Color" column with the target column "Loves Troll 2"
From above example, because less data supports the value Red (only 1 record), so we have less confidently placed the encoded value for Red in compare with Blue and Green
The solution for this is Weigthed Mean with the smoothing m
Code Implementation
importcategory_encodersasce# Target (Mean) Encoding - fit on training data, transform test dataencoder=ce.TargetEncoder(cols=["ord_2"], smoothing=1.0)
df["ord_2_encoded"] =encoder.fit_transform(df["ord_2"], df["target"])
Cons of Target Encoder
Target Leakage: Even with smoothing, this may result in target leakage and overfitting. Leave-One-Out Encoding and introducing Gaussian noise in the target variable can be used to address the overfitting problem
Uneven Category Distribution: The category distribution can differ in train and validation/test data and result in categories being encoded with incorrect or extreme values
Leave One Out Encoding
In the leave one out encoding, the current target value is reduced from the overall mean of the target to avoid leakage.
For example, we want to encode the column ord_2 and we have
At the index 0, the ord_2 has the category Hot corresponding to target = 0
At the index 9, the ord_2 has the category Lava Hot corresponding to target = 1
ord_2
ord_2_encoded
target
0
Hot
0.205179
0
9
Lava Hot
0.290751
1
For each category in the ord_2 column, let calculate what is the count & the target sum accordingly
For row_id = 0, Hot will be encoded as (13851 - 0) / (67508-1) = 0.205179
For row_id = 9, Lava Hot will be encoded as (18853 - 1) / (64840-1) = 0.29075
Count / Frequency Encoding
Count and Frequency Encoding encodes categorical variables to the count of occurrences and frequency (normalized count) of occurrences respectively.
Cons: Similar encodings
If all categories have similar counts, the encoded values will be the same
importcategory_encodersasce# Count Encoding - fit on training data, transform test dataencoder=ce.CountEncoder(cols="type")
data_train["type_count_encoded"] =encoder.fit_transform(data_train["type"])
data_test["type_count_encoded"] =encoder.transform(data_test["type"])
# Frequency (normalized count) Encodingencoder=ce.CountEncoder(cols="type", normalize=True)
data_train["type_frequency_encoded"] =encoder.fit_transform(data_train["type"])
data_test["type_frequency_encoded"] =encoder.transform(data_test["type"])
Binary / BaseN Encoding
Binary Encoding encodes categorical variables into integers, then converts them to binary code. The output is similar to One-Hot Encoding, but lesser columns are created.
This addresses the drawback to One-Hot Encoding where a cardinality of n does not result in n number of columns, but log2(n) columns.
BaseN Encoding follows the same idea but uses other base values instead of 2, resulting in logN(n) columns.
Pros:
Nominal Variables: Binary and BaseN Encoder are used for nominal categorical variables
High Cardinality: Binary and BaseN encoding works well with a high number of categories
Missing or Unseen Variables: Binary and BaseN Encoder can handle unseen variables by encoding them with 0 values across all columns
importcategory_encodersasce# Binary Encoding - fit on training data, transform test dataencoder=ce.BinaryEncoder()
data_encoded=encoder.fit_transform(data_train["type"])
encoder.transform(data_test["type"])
# BaseN Encoding - fit on training data, transform test dataencoder=ce.BaseNEncoder(base=5)
data_encoded=encoder.fit_transform(data_train["type"])
encoder.transform(data_test["type"])
Hash Encoding
Hash Encoding encodes categorical variables into distinct hash values using a hash function. The output is similar to One-Hot Encoding, but you can choose the number of columns created.
Hash encoding can encode high-cardinality data to a fixed-sized array as the number of new columns is manually specified.
Pros:
Nominal Variables: Hash Encoder is used for nominal categorical variables
High Cardinality: Hash encoding works well with a high number of categories
Missing or Unseen Variables: Hash Encoder can handle unseen variables by encoding them with null values across all columns
Cons:
Irreversible: Hashing functions are one-direction such that the original input can be hashed into a hash value, but the original input cannot be retrieved from the hash value
Information Loss or Collision: If too few columns are created, hash encoding can lead to loss of information as multiple different inputs may result in the same output from the hash function
Hash encoding can be done with FeatureHasher from the sklearn package or with HashingEncoder from the category encoders package.
fromsklearn.feature_extractionimportFeatureHasher# Hash Encoding - fit on training data, transform test dataencoder=FeatureHasher(n_features=2, input_type="string")
data_encoded=encoder.fit_transform(data_train["type"]).toarray()
# Using category_encodersimportcategory_encodersasce# Hash Encoding - fit on training data, transform test dataencoder=ce.HashingEncoder(n_components=2)
data_encoded=encoder.fit_transform(data_train["type"])