Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stump has no leaf_count inside dump_model() output #5962

Open
thatlittleboy opened this issue Jul 8, 2023 · 4 comments · May be fixed by #5964
Open

Stump has no leaf_count inside dump_model() output #5962

thatlittleboy opened this issue Jul 8, 2023 · 4 comments · May be fixed by #5964

Comments

@thatlittleboy
Copy link

thatlittleboy commented Jul 8, 2023

Description

I'll preface this by saying I'm not quite sure if this is a bug, just that it's a little bit of an inconsistent API.

When we have a stump (single-node tree), the .dump_model() dictionary output doesn't contain an leaf_count. It only has a leaf_values. I'm wondering why that is. It should be possible to assign a count (i.e., the number of samples that was used to train the model) to the root node or am I mistaken?

Reproducible example

(I'm intentionally creating a stump here..)

from lightgbm import LGBMClassifier
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=5, centers=3, n_features=3, random_state=42)
test_model = LGBMClassifier(n_estimators=5, boosting_type='gbdt', objective="multiclass", n_jobs=1)
test_model.fit(X, y)
print(test_model.booster_.dump_model())

The output is something like this:

{'name': 'tree',
 'version': 'v3',
 'num_class': 3,
 'num_tree_per_iteration': 3,
 'label_index': 0,
 'max_feature_idx': 2,
 'objective': 'multiclass num_class:3',
 'average_output': False,
 'feature_names': ['Column_0', 'Column_1', 'Column_2'],
 'monotone_constraints': [],
 'feature_infos': {},
 'tree_info': [{'tree_index': 0,
   'num_leaves': 1,
   'num_cat': 0,
   'shrinkage': 1,
   'tree_structure': {'leaf_value': -0.916290731874155}},
  {'tree_index': 1,
   'num_leaves': 1,
   'num_cat': 0,
   'shrinkage': 1,
   'tree_structure': {'leaf_value': -0.916290731874155}},
  {'tree_index': 2,
   'num_leaves': 1,
   'num_cat': 0,
   'shrinkage': 1,
   'tree_structure': {'leaf_value': -1.6094379124341003}}],
 'feature_importances': {},
 'pandas_categorical': None}

Note how there isn't an leaf_count inside the tree_info, which exists normally if the tree were allowed to grow to higher depths. (Just bump up n_samples to 5000 above (say), and inspect the dump output to see what I mean).

Thank you!

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

pip install lightgbm

python 3.10, MacOS 12.5.1

Additional Comments

This is a bit of an edge case, but I would still appreciate if you could somehow unify the API (dump_model output) slightly wherever reasonable.

The background is that I'm working a bug fix for the shap package, and we are parsing the dump_model() output. Right now, if we receive a stump tree, we aren't able to recover the number of samples that was used to train the model (i.e., count in the root node) unless we get the user to pass in their training data.

@thatlittleboy thatlittleboy changed the title Stump has no internal_count inside dump_model() output Stump has no leaf_count inside dump_model() output Jul 8, 2023
@jameslamb
Copy link
Collaborator

Thabks very much for the write-up! This request makes sense to me, and I'd support adfing that information at the root node.

Are you interested in working on this and submitting a pull request? If not, we can take this as a feature request in the project's backlog and subscribing here will notify you when someone picks it up.

@thatlittleboy
Copy link
Author

Great! Thank you @jameslamb , I'll take a stab at a PR, I'll report back if it doesn't work out.

@thatlittleboy thatlittleboy linked a pull request Jul 9, 2023 that will close this issue
@jameslamb
Copy link
Collaborator

@thatlittleboy Your post here led me to go look at the shap repo for the first time in a while, and I saw this comment from @connortann: shap/shap#2943 (comment)

... we are working on broadening the pool of maintainers ...

I'd be happy to come help out with some things if you need more assistance. shap is an important and heavily-used project, and we've really appreciated all of @slundberg 's help here in LightGBM over the years.

Feel free to @ me on issues there or to email me at the email in my profile here if you think there are areas where I could be helpful.

@thatlittleboy
Copy link
Author

@jameslamb

Feel free to @ me on issues there or to email me at the email in my profile here if you think there are areas where I could be helpful.

Awesome, thanks for the offer! Myself, @connortann, and @dsgibbons have been working on getting shap back to a maintainable state (bug fixes, remove deprecation warnings etc.) over the past month, and just made a new release a couple days ago. We'll be looking to make more regular & frequent releases moving forward.

Off the top of my head, I think we require additional expertise in C extensions (our _cext and _cext_gpu extensions; general maintenance since those files have not been updated for a long while), as well as perhaps some advice on how to test GPU extensions on CI/CD (right now we're skipping all of our gpu tests, which is not ideal).

If we get specific issues on these (as well as lightgbm-core, of course), we'll be sure to ping you. Thanks once again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants