Skip to content

TpT#80

Open
MathiasValla wants to merge 1 commit intosimonprovost:mainfrom
MathiasValla:TpT
Open

TpT#80
MathiasValla wants to merge 1 commit intosimonprovost:mainfrom
MathiasValla:TpT

Conversation

@MathiasValla
Copy link
Copy Markdown

@MathiasValla MathiasValla commented Dec 16, 2025

Hi Simon,

I’m happy to share that the TpT implementation is now finalized code-wise on my side. This MR introduces a working TpTDecisionTreeClassifier with a depth-first builder, compatible with scikit-lexicographical-trees / scikit-longitudinal.

This PR is to consider jointly with the Scikit Lexicographical Trees PR here: simonprovost/scikit-lexicographical-trees#1

References to cite

[1] Valla, M. Time-penalised trees (TpT): introducing a new tree-based data mining algorithm for time-varying covariates. Ann Math Artif Intell 92, 1609–1661 (2024). https://doi.org/10.1007/s10472-024-09950-w

[2] Mathias Valla, Xavier Milhaud. Time-penalized trees: consistency results and simulations. 2025. ⟨hal-05022929⟩ https://cnrs.hal.science/hal-05022929

Scope of this PR (minimal, functional)

  • Estimator: TpTDecisionTreeClassifier (classification only in this PR’s target scope).
  • Builder: DepthFirstTreeBuilder (no Best-First in this scope).
  • Criterion: Gini (others left out intentionally for the minimal version).
  • I/O and structure: Cython implementation aligned with the existing tree stack.
  • (No plot_tree adaptation included here; regular sklearn.tree.plot_tree is usable for quick inspection.)

Note: I also have a working regression path and more features locally, but to keep this PR focused and easy to land, I’m proposing only the minimal, agreed-upon surface.

Where the code lives

  • TpT estimator code: scikit_longitudinal/estimators/trees/TpT/
    (primary class: TpTDecisionTreeClassifier, splitter, builder, structs)
  • Long→Wide converter (temporary): scikit_longitudinal/estimators/trees/TpT/_preprocessing.py
    This should be moved under or near LongitudinalDataset per your design. I’d be grateful if you could drop it into the right place in a temporary branch; I’ll review once moved. It currently handles long → wide only (not the reverse), without TIDAL/polars. It’s a first step toward 💡 Feature Request - From Wide to Long and vice-versa Longitudinal data formatting, inspired from TIDAL #64 but does not fully solve it.

Quick example (concise)

import pandas as pd
from sklearn import tree
from scikit_longitudinal.estimators.trees import TpTDecisionTreeClassifier

# Example dataset from tests (long format):
# columns include: ["id", "time_point", "duration", "target", ... features ...]
df = pd.read_csv("path/to/stroke.csv", sep=";")

X = df.drop(columns=["target", "cholesterol"])  # minimal cleanup for this file
y = df["target"].astype(int)

clf = TpTDecisionTreeClassifier(
    gamma=0.1,              # time penalty (λ)
    criterion="gini",
    id_col="id",
    time_col="time_point",
    duration_col="duration",
    assume_long_format=True,
    time_step=1,
    max_horizon=1000,
    min_samples_split=2,
    max_depth=1000,
    random_state=42,
).fit(X, y)

# Optional: quick textual sanity checks
print("max_depth:", clf.tree_.max_depth)
print("node_count:", clf.tree_.node_count)
print("split_time_index[:10]:", clf.tree_.split_time_index[:10])

# Optional: quick visual check (works for fast inspection)
feature_names = getattr(clf, "_wide_feature_names_", None)
_ = tree.plot_tree(
    clf, filled=True, fontsize=10,
    feature_names=feature_names, tpt_time_scale=0.5
)

That’s intentionally minimal (no non-essential utilities, no external metrics). It just loads data, fits TpT, prints a few structural fields, and plots.

Dependencies

  • I temporarily removed ray from the dependencies due to local issues.
    Please feel free to restore it where appropriate (e.g., as an optional dependency / extra for parallelism or tests).

Ask / next steps

I’ve pushed this as far as I can right now. It would be ideal if you could manage the integration so TpT aligns perfectly with sklong’s patterns (API surface, dataset plumbing via LongitudinalDataset, docs structure, examples, CI, etc.). I’ll follow up with any fixes you need during review.

I’m also working on two additional papers involving TpT and will of course cite scikit-longitudinal as the reference implementation. For any future article specifically about the implementation, I’d be happy to include you as co-author for your guidance and help.

Thanks a lot, and please let me know how you’d like to proceed with the preprocessing relocation!

— Mathias

@simonprovost
Copy link
Copy Markdown
Owner

@MathiasValla Your work is amazing! Thanks so very much! Happy new year as well!

I will be busy in the coming weeks / months but as always as soon as I can I'll work on making this PR ready — Some news incoming within the year for Sklong (positive!); and TpT will definitely be useful :)

Wishing you well,

Cheers

@MathiasValla
Copy link
Copy Markdown
Author

@simonprovost Thanks, bets wishes to you too !

Great to hear.
I only have one constraint: with your blessing, I'd like to submit the TpT code in a journal in the first half of the year (at most !), so I guess it's better to wait your integration of TpT within the existing framework. If you agree (and if interested), I could prepare a draft (including you as co-author for all your help on that integration !) in that direction.

In the meantime, I still need to finalize 2 articles on TpT, where I'll refer to sklong, of course. I'm also working on an animation explaining how TpT works (animated with manim), it could be usefull to include it on the scikit-longitudinal website (if you also think it's a good idea).

Cheers,
Mathias

@simonprovost
Copy link
Copy Markdown
Owner

simonprovost commented Jan 14, 2026

@simonprovost Thanks, bets wishes to you too !

Great to hear. I only have one constraint: with your blessing, I'd like to submit the TpT code in a journal in the first half of the year (at most !), so I guess it's better to wait your integration of TpT within the existing framework. If you agree (and if interested), I could prepare a draft (including you as co-author for all your help on that integration !) in that direction.

In the meantime, I still need to finalize 2 articles on TpT, where I'll refer to sklong, of course. I'm also working on an animation explaining how TpT works (animated with manim), it could be usefull to include it on the scikit-longitudinal website (if you also think it's a good idea).

Cheers, Mathias

You are working so efficiently @MathiasValla ! Kind of jealous 🙏

That's fantastic to hear, and thank you for your kind gesture, mate! I am very interested! I would be pleased, as I have great trust in your research. If you want to continue the discussion by email regarding adding me as the (last) co-author of your manuscript, please do so 🫡.

I suppose I'll be more transparent so you're aware. In two weeks, I'll begin a Longitudinal ML-based research visit at the University of Edinburgh, and in the meantime, I'm applying here and there arounds for Post-Docs and trying to finish writing my Ph.D. thesis, so the faster I get, the more time I'll have to work on TpT integration.

This being said, given your journal wishes, I believe I could spend time on TpT in a month or so. Perhaps set yourself a reminder (I wish there was bots agents in Github Issues 👀 that could chase meself) and chase back, saying, "Hey Simon, you are late:)"? My head tends to be everywhere at once...

This is a fantastic idea the TpT animation. I had in mind last year to create a visualisation for each of Sklong's primitives at the top of their respective API reference page so that users could grasp them easily. Definitely include it, or give it to me and I'll include it in the API ref of TpT; it's a modern documentation notion that I plan (during my Post-Doc) to extend to others so be the first one with TpT :) PS: I adore Manim! Look here: Automated machine learning for longitudinal classification — AIDA (UKC) @ https://simonprovostdev.vercel.app/talks.

Let us keep each other transparently updated

Cheers,

@MathiasValla
Copy link
Copy Markdown
Author

Hey Simon !
I hope you're well. Quick ping for an update: do you have a rough timeline for reviewing/merging the TpT integration into scikit-longitudinal? Happy to help with that, of course. 😄

@simonprovost
Copy link
Copy Markdown
Owner

Hey @MathiasValla hoping all is well! Sorry for being a lil bit AWOL I am interning at the University of Edinburgh at the moment so things are a bit intense but I aim to work on your PR within the month. I wondered if you may rebase with main in order to be "up-to-date" please?

git checkout main
git pull
git checkout -
git rebase main

And solve potential conflicts! Once you'll do that you'll have to git push --force and then I'll be able to work on a clean updated plate

Do not hesitate to ask me anything prior if you do not feel comfortable with that!

Cheers

@MathiasValla
Copy link
Copy Markdown
Author

Hi Simon ! No worries at all, hope the internship is going well. 😄

I’ve rebased my TpT branch onto main, resolved the conflicts, and force-pushed the updated branch. Everything should now be clean and up-to-date for you to review.

Let me know if you’d like me to adjust anything else !

@simonprovost
Copy link
Copy Markdown
Owner

simonprovost commented Mar 4, 2026

Hi Simon ! No worries at all, hope the internship is going well. 😄

I’ve rebased my TpT branch onto main, resolved the conflicts, and force-pushed the updated branch. Everything should now be clean and up-to-date for you to review.

Let me know if you’d like me to adjust anything else !

I set a reminder in my diary, hope to spend time on your great work mate soon! Thanks for your promptness! more than brillant!

PS: I've made some general comments!

Copy link
Copy Markdown
Owner

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @MathiasValla I made some general comments not directly about the algorithms but more about Python code quality, which we expect in the main production track of Sklong. This should give you a bit of work until I’m fully available to test your amazing contribution myself on my computer :)

PS: I’ll most probably include TpT in a toolkit I’m building for social and clinical researchers at Edinburgh! That’s going to be nice!

Cheers,

Comment thread .cursor/rules/tptrules.mdc Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/tests/PBCseq2.csv Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/tests/test_cython_fix.py Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/tests/test_cython_fix.py Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/tests/test_regression_tv.py Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/TpT_decision_tree_regressor.py Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/tests/test_cython_fix.py Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/_preprocessing.py
Comment thread README.md Outdated
Comment thread pyproject.toml
@MathiasValla
Copy link
Copy Markdown
Author

Hey Simon
Thanks a lot for the thorough review 🙏 Super helpful and totally fair.
I’ll tackle your comments later this week: remove the cursor config, trim the module/file docstrings + inline separators, rename the tests more cleanly / move what makes sense, and beef up the main class docstrings with proper refs + parameter docs.

Also yeah… I’m not a dev by training, so sorry if some bits look a bit clunky and handmade 😅 . But I’m happy to clean it up and improve !

@simonprovost
Copy link
Copy Markdown
Owner

Hey Simon Thanks a lot for the thorough review 🙏 Super helpful and totally fair. I’ll tackle your comments later this week: remove the cursor config, trim the module/file docstrings + inline separators, rename the tests more cleanly / move what makes sense, and beef up the main class docstrings with proper refs + parameter docs.

Also yeah… I’m not a dev by training, so sorry if some bits look a bit clunky and handmade 😅 . But I’m happy to clean it up and improve !

That's why we have Pull Request Review do not worry mate! You've done a grand job really! Plus, the first true contributor, we had others but not from a brand new algorithm so I might say this is better than anything else done thus far 🙏

Take your time and if you have questions, do not worry no stupid questions exist here, ask anything at anytime!

Looking forward to find time to test the algorithm out!

PS: I know you're not a dev by training but you doing as great as them, so take the pride dude 👌

Cheers

@MathiasValla
Copy link
Copy Markdown
Author

Simon,

I had trouble running the test_TpT file (now in the correct tests folder). I spent some time trying to understand what went wrong, but could not figure it out, so if you manage to test it at some point, it would be great.

Apart from that, I've been able to make all modifications.

Cheers,
Mathias

@simonprovost
Copy link
Copy Markdown
Owner

Simon,

I had trouble running the test_TpT file (now in the correct tests folder). I spent some time trying to understand what went wrong, but could not figure it out, so if you manage to test it at some point, it would be great.

Apart from that, I've been able to make all modifications.

Cheers, Mathias

Hi Mathias! Hoping all is well! Thanks for your prompt work as always!

Have you tried: https://scikit-longitudinal.readthedocs.io/latest/developers/#running-tests ?

Cheers

Copy link
Copy Markdown
Owner

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing changes @MathiasValla ! Thanks much, here are a couple more in the meantime !

Comment thread PHASE1_PHASE2_TEST.ipynb Outdated
Comment thread tree_cost_results.csv Outdated
Comment thread tpt_synthetic_classification_100x24.csv Outdated
Comment thread tree_cost_vs_gamma.png Outdated
Comment thread pyproject.toml Outdated
Comment on lines +89 to +90
[tool.uv.sources]
scikit-lexicographical-trees = { path = "../scikit-lexicographical-trees" }
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this comment open for the time being please.

@simonprovost, review: simonprovost/scikit-lexicographical-trees#1

Copy link
Copy Markdown
Owner

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good @MathiasValla ! I'll have to simply try it out myself on my machine to make sure it's all working but that's nice, well done! One change is needed but that's in the README :) !

I'll test that out very soon, hopefully! But just to re-assure you, having looked at the code it's now brills! And thanks thanks thanks for that!

Comment thread README.md
Comment thread scikit_longitudinal/tests/test_TpT.py
Comment thread scikit_longitudinal/estimators/trees/TpT/_preprocessing.py
@simonprovost simonprovost added documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers Classification Related content to classification tests labels Mar 12, 2026
@simonprovost
Copy link
Copy Markdown
Owner

@MathiasValla we are getting there!

Do not forget to git checkout main, git pull, git checkout -, git rebase main as we've merged multi-class support #83 !

Also note that I've worked on your TpT PR on Scikit Lexicographical Trees to accelerate things out 🥳

Cheers

@simonprovost
Copy link
Copy Markdown
Owner

Hey @MathiasValla! I have finally finalised the full rebranding of Sklong. See the new README and the new docs utilising the new state-of-the-art Rust doc generator, Zensical, at https://scikit-longitudinal.readthedocs.io/latest/! Can't wait to put TpT as part of the API reference :)

Now it is much easier to read through the library, to be honest.

Cheers!

Copy link
Copy Markdown
Owner

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick one!

Comment thread scikit_longitudinal/estimators/trees/TpT/TpT_decision_tree_regressor.py Outdated
Comment thread scikit_longitudinal/estimators/trees/TpT/TpT_decision_tree.py Outdated
Copy link
Copy Markdown
Owner

@simonprovost simonprovost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good @MathiasValla ! Nearly there! Only the lexicographical trees need some adjustments 👌

Two little comments below though 👇

Cheers

Comment thread scikit_longitudinal/estimators/trees/__init__.py Outdated
Comment thread scikit_longitudinal/estimators/trees/__init__.py Outdated
@MathiasValla
Copy link
Copy Markdown
Author

Hi Simon,

I finally went through your review points carefully and I addressed them in both PRs. I'll copy/paste the same comment below oin both PRs:

On the scikit-lexicographical-trees side, I removed the leftover “New TpT” mentions, rewrote the relevant comments in proper English, and restored the generic threshold_gain validation to its original range so it stays aligned with the lexicographical estimators.

On the scikit-longitudinal side, I switched the tree exports back to absolute imports and kept the broader TpT gamma validation local to the TpT wrappers, so TpT still accepts larger gamma values without widening the generic tree API.
Some precision on that point: for TpT, we need a gamma hyperparameter. Since its role is relatively close to the "threshold_gain" you've introduced, I made the (questionnable) decision to recycle "threshold_gain" and use it for "gamma". In theory, TpT's gamma can be > 1 (explaining my modification of the parameter constraints), so for now, I've kept the original threshold_gain constaints but allowed gamma to be >= 1 if needed (in practice, that is almost never usefull, but in theory that is possible). If needed, I can absolutely leave "threshold_gain" as it was and define "gamma" as a distinct hyperparameter. It's better practice, I guess ?

I also reran focused checks after that: rebuilt the fork locally, confirmed TpT still fits with gamma>1, and confirmed the generic DecisionTreeClassifier now rejects threshold_gain > 1 again.

Sorry as well for the “New TpT” mentions that reappeared earlier. I think I got a bit lost in my local commits while iterating, and I’m still learning to keep that workflow cleaner.

Cheers,
Mathias

@simonprovost
Copy link
Copy Markdown
Owner

Hello, Hello @MathiasValla! Hoping all is well!

Happy to tell you that your TpT splitter is now merged and released on PyPi, see more in [1]. Soon after merging, I just thought that we had forgotten to look at the unit tests of your PR, which is entirely my fault for not paying much attention to refactoring Scikit lexicographical trees to be a great open-source project, with automation and all (now tracked at #86). Anyway, to the best of my surprise, I run the tests and a lot of errors were introduced. Do not worry, I just merged [2] and now it is all released on PyPI 🥳

I have now updated the Scikit-Longitudinal main track to be up to date with the new version of Scikit-Lexicographical-Trees. Could you please rebase your branch with main, and make sure (i) the unit tests go through, see more in [3] (PS: If you have some Ray errors, that's totally fine), and then (ii) run your own sort of basic experiment to check if TpT behaves like you wanted, please?

Once that is good, I'll do a last pass here in the changes tab and will finally accept and merge your amazing contrib!

Let's try to speed things up because of your submission ✅

Thanks once more for your work, brills! TOP-1!

Cheers

@simonprovost simonprovost force-pushed the main branch 2 times, most recently from ec52efb to d8d9927 Compare April 15, 2026 01:25
@MathiasValla
Copy link
Copy Markdown
Author

Hi Simon,

Thanks again, and great catch on the post-merge test issue.

I rebased the branch on the updated main, reran the tests, and pushed the rebased branch. On my side, everything goes through apart from the Ray-dependent parallel tests, which fail only because ray is not installed in the environment, so that seems consistent with your note.

I also reran a basic TpT sanity check on a toy dataset, and the behavior still looks right.

So from my side the branch should be ready for your last pass.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Classification Related content to classification documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants