Skip to content

Possible Bug with Asymmetric Loss Matrices in ginipred #74

@HanzhangRen

Description

@HanzhangRen

This issue may be related to #65. Like the author of that issue, I also encountered some xerror that are quite different from rel error when using an asymmetric loss matrix.

I think this might reflect a possible bug on Line 160 of gini.c

temp = prior[i] * loss[i * numclass + j];

It seems to me that the line is reading the loss matrix in the wrong order, but I'm not 100% sure. I barely know any C language. I used Claude a lot when diagnosing the problem. It also helped me write the rest of this issue : )

Here's a reproducible example:

library(rpart)

data(kyphosis)

# kyphosis$Kyphosis: "absent" (majority, n=64), "present" (minority, n=17)
# Factor levels are alphabetical: absent=1, present=2
n       <- nrow(kyphosis)
n_abs   <- sum(kyphosis$Kyphosis == "absent")   # 64
n_pres  <- sum(kyphosis$Kyphosis == "present")  # 17

# Inverse-frequency loss matrix
# L[absent -> present] = 1 (cheap), L[present -> absent] = n_abs/n_pres (expensive)
loss_mat <- matrix(c(0,             1,
                     n_abs/n_pres,  0),
                   nrow = 2, byrow = TRUE)

fit <- rpart(
  Kyphosis ~ Age + Number + Start,
  data    = kyphosis,
  method  = "class",
  parms   = list(loss = loss_mat),
  control = rpart.control(xval = n)   # LOOCV
)

# Note that with LOOCV and inverse-frequency loss, the expected behavior is
# that the model will never make a correct prediction for the test case.
# Leaving a majority case out will cause minority cases to be overrepresented 
# in the training fold, so the algorithm will predict the minority case and 
# make a mistake with the test case.
# Leaving a minority case out will cause majority cases to be overrepresented 
# in the training fold, so the algorithm will predict the minority case and 
# make a mistake with the test case.

cpt <- as.data.frame(printcp(fit))

# Under the ginipred transposition bug, the large loss n_abs/n_pres is
# applied to n_abs misclassified majority obs instead of n_pres minority obs,
# and the small loss 1 is applied to n_pres minority obs instead of n_abs.
# This yields:
#   xrisk_cv_with_bug  = n_abs * (n_abs/n_pres) + n_pres * 1  =  n_abs^2/n_pres + n_pres
#   root_risk = n_abs  (both predictions tie at the tipping point, doesn't 
#                       matter what we predict
#                       If we predict all present, we get n_abs mistakes with 
#                       loss 1, for total loss n_abs.
#                       If we predict all absent, we get n_pres mistakes with 
#                       loss n_abs/n_pres, for total loss n_abs)
#   xerror_with_bug    = xrisk_cv_with_bug / root_risk = n_abs/n_pres + n_pres/n_abs
#   xrisk_cv_correct    = n_abs * 1 + n_pres * (n_abs/n_pres) = 2*n_abs
#   xerror_correct      = xrisk_cv_correct / root_risk = 2

expected_xerror_bug     <- n_abs/n_pres + n_pres/n_abs   # ~4.03 for kyphosis
expected_xerror_correct <- 2 # expected under correct implementation

cat(sprintf("\nn_absent = %d, n_present = %d\n", n_abs, n_pres))
cat(sprintf("Observed  xerror at nsplit=0:        %.9f\n", cpt[1, "xerror"]))
cat(sprintf("Predicted xerror under bug:          %.9f\n", expected_xerror_bug))
cat(sprintf("Expected  xerror (correct behavior): ~%.1f\n", expected_xerror_correct))
cat(sprintf("Match: %s\n",
            isTRUE(all.equal(cpt[1, "xerror"], expected_xerror_bug, tolerance = 1e-9))))

Possible bug: ginipred indexes loss matrix with transposed indices, producing incorrect cross-validated error estimates

Summary

ginipred in src/gini.c indexes the loss matrix with swapped row/column indices relative to how the matrix is stored and how ginidev accesses it. This causes cross-validated error estimates in printcp() to silently apply the transpose of the user-supplied loss matrix during evaluation, while tree building uses the correct orientation. The result is that xerror and xstd in the cptable are incorrect whenever an asymmetric loss matrix is supplied.

How the loss matrix is stored

In giniinit, the loss matrix is populated in column-major order:

for (i = 0; i < numclass; i++)       // i = true class (row)
    for (j = 0; j < numclass; j++) { // j = predicted class (col)
        k = numclass * j + i;        // column-major: loss[pred * numclass + true]
        loss[k] = parm[numclass + k];
    }

So loss[pred * numclass + true] = L[true, pred].

How ginidev accesses it (correctly)

for (i = 0; i < numclass; i++)    /* i = predicted class */
    for (j = 0; j < numclass; j++)
        temp += freq[j] * loss[i * numclass + j] * prior[j];

Accesses loss[pred * numclass + true] = L[true, pred]. Correct.

How ginipred accesses it (incorrectly)

double ginipred(double *y, double *pred) {
    i = (int) y[0] - 1;       // i = true class
    j = (int) *pred - 1;      // j = predicted class
    temp = prior[i] * loss[i * numclass + j];

Accesses loss[true * numclass + pred] = L[pred, true]. This is the transpose of the intended lookup.

Proposed fix

Swap the index to match the storage convention used everywhere else:

temp = prior[i] * loss[j * numclass + i];

Observed consequence

With a 2-class asymmetric loss matrix and LOOCV (xval = nrow), the xerror at nsplit = 0 equals the ratio of majority to minority class observations — a value that grows arbitrarily large with class imbalance. Tracing through the LOOCV logic in xval.c shows this arises directly from the transposed loss lookup in ginipred: the large off-diagonal loss intended for minority misclassification is instead applied to majority misclassification, and vice versa.

This means the cptable cannot be used reliably for pruning decisions whenever an asymmetric loss matrix is provided, as the cross-validated error being minimized does not correspond to the user-specified costs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions