Hello,
As you can see in this reprex, I suspect that it seems that when a predictor is ordered and potentially when its names are numbers that the reported splits are incorrect.
In rpart(y ~ x_ordered), x_ordered=100 is appearing in nodes that I don't think it should be?
library(rpart)
library(rpart.plot)
library(tree)
x <- rep(c(2,5,10, 25, 50, 100), each = 10)
x_ordered <- as.ordered(x)
y<- (x*100)
rpart(y ~ x)
#> n= 60
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 60 711000000 3200.0000
#> 2) x< 75 50 156120000 1840.0000
#> 4) x< 37.5 40 31300000 1050.0000
#> 8) x< 17.5 30 3266667 566.6667 *
#> 9) x>=17.5 10 0 2500.0000 *
#> 5) x>=37.5 10 0 5000.0000 *
#> 3) x>=75 10 0 10000.0000 *
tree(y ~ x)
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 60 711000000 3200.0
#> 2) x < 75 50 156100000 1840.0
#> 4) x < 37.5 40 31300000 1050.0
#> 8) x < 17.5 30 3267000 566.7 *
#> 9) x > 17.5 10 0 2500.0 *
#> 5) x > 37.5 10 0 5000.0 *
#> 3) x > 75 10 0 10000.0 *
# rpart(y ~ x) |> rpart.plot::prp(type = 5)
rpart(y ~ x_ordered)
#> n= 60
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 60 711000000 3200.0000
#> 2) x_ordered=2,5,10,25,50 50 156120000 1840.0000
#> 4) x_ordered=2,5,10,25 40 31300000 1050.0000
#> 8) x_ordered=2,5,10 30 3266667 566.6667 *
#> 9) x_ordered=25,50,100 10 0 2500.0000 *
#> 5) x_ordered=50,100 10 0 5000.0000 *
#> 3) x_ordered=100 10 0 10000.0000 *
tree(y ~ x_ordered)
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 60 711000000 3200.0
#> 2) x_ordered: 2,5,10,25,50 50 156100000 1840.0
#> 4) x_ordered: 2,5,10,25 40 31300000 1050.0
#> 8) x_ordered: 2,5,10 30 3267000 566.7 *
#> 9) x_ordered: 25 10 0 2500.0 *
#> 5) x_ordered: 50 10 0 5000.0 *
#> 3) x_ordered: 100 10 0 10000.0 *
# rpart(y ~ x_ordered) |> rpart.plot::prp(type = 5)
Created on 2025-04-18 with reprex v2.1.1
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.0 (2024-04-24)
#> os Ubuntu 22.04.5 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en_CA:en
#> collate en_CA.UTF-8
#> ctype en_CA.UTF-8
#> tz America/Toronto
#> date 2025-04-18
#> pandoc 3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
#> digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
#> evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1)
#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.1)
#> fs 1.6.5 2024-10-30 [1] CRAN (R 4.4.1)
#> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
#> knitr 1.49 2024-11-08 [1] CRAN (R 4.4.1)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.1)
#> reprex 2.1.1 2024-07-06 [1] RSPM
#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.1)
#> rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.4.1)
#> rpart * 4.1.23 2023-12-05 [2] CRAN (R 4.4.0)
#> rpart.plot * 3.1.2 2024-02-26 [1] RSPM
#> rstudioapi 0.17.1 2024-10-22 [1] RSPM (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.4.0)
#> tree * 1.0-44 2024-12-11 [1] RSPM
#> withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
#> xfun 0.49 2024-10-31 [1] CRAN (R 4.4.1)
#> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.1)
#>
#> [1] /home/mstruong/R/x86_64-pc-linux-gnu-library/4.4
#> [2] /opt/R/4.4.0/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Hello,
As you can see in this reprex, I suspect that it seems that when a predictor is
orderedand potentially when its names are numbers that the reported splits are incorrect.In
rpart(y ~ x_ordered),x_ordered=100is appearing in nodes that I don't think it should be?Created on 2025-04-18 with reprex v2.1.1
Session info