Possible Bug with Asymmetric Loss Matrices in ginipred

This issue may be related to #65. Like the author of that issue, I also encountered some `xerror` that are quite different from `rel error` when using an asymmetric loss matrix.

I think this might reflect a possible bug on Line 160 of gini.c

https://github.com/bethatkinson/rpart/blob/6bd282ad284db1dd777c2b4624b2d6ceb779239e/src/gini.c#L160

It seems to me that the line is reading the loss matrix in the wrong order, but I'm not 100% sure. I barely know any C language. I used Claude a lot when diagnosing the problem. It also helped me write the rest of this issue : )

Here's a reproducible example:

```
library(rpart)

data(kyphosis)

# kyphosis$Kyphosis: "absent" (majority, n=64), "present" (minority, n=17)
# Factor levels are alphabetical: absent=1, present=2
n       <- nrow(kyphosis)
n_abs   <- sum(kyphosis$Kyphosis == "absent")   # 64
n_pres  <- sum(kyphosis$Kyphosis == "present")  # 17

# Inverse-frequency loss matrix
# L[absent -> present] = 1 (cheap), L[present -> absent] = n_abs/n_pres (expensive)
loss_mat <- matrix(c(0,             1,
                     n_abs/n_pres,  0),
                   nrow = 2, byrow = TRUE)

fit <- rpart(
  Kyphosis ~ Age + Number + Start,
  data    = kyphosis,
  method  = "class",
  parms   = list(loss = loss_mat),
  control = rpart.control(xval = n)   # LOOCV
)

# Note that with LOOCV and inverse-frequency loss, the expected behavior is
# that the model will never make a correct prediction for the test case.
# Leaving a majority case out will cause minority cases to be overrepresented 
# in the training fold, so the algorithm will predict the minority case and 
# make a mistake with the test case.
# Leaving a minority case out will cause majority cases to be overrepresented 
# in the training fold, so the algorithm will predict the minority case and 
# make a mistake with the test case.

cpt <- as.data.frame(printcp(fit))

# Under the ginipred transposition bug, the large loss n_abs/n_pres is
# applied to n_abs misclassified majority obs instead of n_pres minority obs,
# and the small loss 1 is applied to n_pres minority obs instead of n_abs.
# This yields:
#   xrisk_cv_with_bug  = n_abs * (n_abs/n_pres) + n_pres * 1  =  n_abs^2/n_pres + n_pres
#   root_risk = n_abs  (both predictions tie at the tipping point, doesn't 
#                       matter what we predict
#                       If we predict all present, we get n_abs mistakes with 
#                       loss 1, for total loss n_abs.
#                       If we predict all absent, we get n_pres mistakes with 
#                       loss n_abs/n_pres, for total loss n_abs)
#   xerror_with_bug    = xrisk_cv_with_bug / root_risk = n_abs/n_pres + n_pres/n_abs
#   xrisk_cv_correct    = n_abs * 1 + n_pres * (n_abs/n_pres) = 2*n_abs
#   xerror_correct      = xrisk_cv_correct / root_risk = 2

expected_xerror_bug     <- n_abs/n_pres + n_pres/n_abs   # ~4.03 for kyphosis
expected_xerror_correct <- 2 # expected under correct implementation

cat(sprintf("\nn_absent = %d, n_present = %d\n", n_abs, n_pres))
cat(sprintf("Observed  xerror at nsplit=0:        %.9f\n", cpt[1, "xerror"]))
cat(sprintf("Predicted xerror under bug:          %.9f\n", expected_xerror_bug))
cat(sprintf("Expected  xerror (correct behavior): ~%.1f\n", expected_xerror_correct))
cat(sprintf("Match: %s\n",
            isTRUE(all.equal(cpt[1, "xerror"], expected_xerror_bug, tolerance = 1e-9))))
```

## Possible bug: `ginipred` indexes loss matrix with transposed indices, producing incorrect cross-validated error estimates

### Summary

`ginipred` in `src/gini.c` indexes the loss matrix with swapped row/column indices relative to how the matrix is stored and how `ginidev` accesses it. This causes cross-validated error estimates in `printcp()` to silently apply the **transpose** of the user-supplied loss matrix during evaluation, while tree building uses the correct orientation. The result is that `xerror` and `xstd` in the `cptable` are incorrect whenever an asymmetric loss matrix is supplied.

### How the loss matrix is stored

In `giniinit`, the loss matrix is populated in **column-major** order:

```c
for (i = 0; i < numclass; i++)       // i = true class (row)
    for (j = 0; j < numclass; j++) { // j = predicted class (col)
        k = numclass * j + i;        // column-major: loss[pred * numclass + true]
        loss[k] = parm[numclass + k];
    }
```

So `loss[pred * numclass + true]` = L[true, pred].

### How `ginidev` accesses it (correctly)

```c
for (i = 0; i < numclass; i++)    /* i = predicted class */
    for (j = 0; j < numclass; j++)
        temp += freq[j] * loss[i * numclass + j] * prior[j];
```

Accesses `loss[pred * numclass + true]` = L[true, pred]. **Correct.**

### How `ginipred` accesses it (incorrectly)

```c
double ginipred(double *y, double *pred) {
    i = (int) y[0] - 1;       // i = true class
    j = (int) *pred - 1;      // j = predicted class
    temp = prior[i] * loss[i * numclass + j];
```

Accesses `loss[true * numclass + pred]` = L[pred, true]. **This is the transpose of the intended lookup.**

### Proposed fix

Swap the index to match the storage convention used everywhere else:

```c
temp = prior[i] * loss[j * numclass + i];
```

### Observed consequence

With a 2-class asymmetric loss matrix and LOOCV (`xval = nrow`), the `xerror` at `nsplit = 0` equals the ratio of majority to minority class observations — a value that grows arbitrarily large with class imbalance. Tracing through the LOOCV logic in `xval.c` shows this arises directly from the transposed loss lookup in `ginipred`: the large off-diagonal loss intended for minority misclassification is instead applied to majority misclassification, and vice versa.

This means the `cptable` cannot be used reliably for pruning decisions whenever an asymmetric loss matrix is provided, as the cross-validated error being minimized does not correspond to the user-specified costs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Bug with Asymmetric Loss Matrices in ginipred #74

Possible bug: `ginipred` indexes loss matrix with transposed indices, producing incorrect cross-validated error estimates

Summary

How the loss matrix is stored

How `ginidev` accesses it (correctly)

How `ginipred` accesses it (incorrectly)

Proposed fix

Observed consequence

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible Bug with Asymmetric Loss Matrices in ginipred #74

Description

Possible bug: ginipred indexes loss matrix with transposed indices, producing incorrect cross-validated error estimates

Summary

How the loss matrix is stored

How ginidev accesses it (correctly)

How ginipred accesses it (incorrectly)

Proposed fix

Observed consequence

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Possible bug: `ginipred` indexes loss matrix with transposed indices, producing incorrect cross-validated error estimates

How `ginidev` accesses it (correctly)

How `ginipred` accesses it (incorrectly)