Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Implement a .remove_data function for Results #533

Open
jpweytjens opened this issue May 31, 2023 · 0 comments
Open

Suggestion: Implement a .remove_data function for Results #533

jpweytjens opened this issue May 31, 2023 · 0 comments
Assignees

Comments

@jpweytjens
Copy link

jpweytjens commented May 31, 2023

Description

Fitted results from linearmodels can be pickled with pickle.dump. These pickled files contain the estimated parameters, along side all the data required to estimate these parameters. Saving the data required to estimate the results is generally (always?) not desired, as keeping this data in the results substantially increases the size of the pickled files. The estimated parameters however no longer require these potentially large datasets to be displayed or processed.

Example

My use case is as follows, with a large (N = 500'000, T = 123) panel dataset.

  • Create a list of all desired model specifications and comparisons
  • Estimate all the different models
  • Save different comparison of these results with compare

In pseudocode

specifications = pd.DataFrame({"formulas": formulas, "criterium": criteria})

results = []
for formula in specifications["formulas"]:
    model = PanelOLS(y, x)
    res = model.fit()
    results.append(res)

specifications["results"] = results

for criteria in specifications["criteria"].unique():
    results = specificiations.query("criterium == @criteria")["results"]
    comparison = compare(results)
    comparison.summary.as_latex()

As my dataset is very large, pickeling results or the DataFrame specifications takes up multiple Gb's to store just a few number of estimated parameters. Ideally, I would be able to store/pickle the results. That way, I can separate the estimating the models from comparing the models. For example, this would allow someone to do the estimations during the night and kill the process once done.

Workaround

I created this hacky workaround to remove a lot of attributes from the model and result that aren't required if you're only interested in storing the results. With this, I can reduce the size of the pickle objects from ~50Gb to around 250Mb.

import functools


def fake_cov(_deferred_cov, *args, **kwargs):
    return _deferred_cov


def shrink_mod_and_res(mod, res):
    """
    Remove any DataFrame and large objects that are unnecessarily stored in the model and results objects.
    """
    mod.dependent._frame = mod.dependent._frame.head(1)
    mod.dependent._original = None
    mod.dependent._panel = None
    mod.exog._frame = mod.exog._frame.head(1)
    mod.exog._original = None
    mod.exog._panel = None
    mod.weights._frame = mod.weights._frame.head(1)
    mod.weights._original = None
    mod.weights._panel = None
    mod._cov_estimators = None
    mod._x = None
    mod._y = None
    mod._w = None
    mod._not_null = None
    mod._original_index = None

    res._resids = None
    res._wresids = None
    res._original_index = None
    res._effects = None
    res._index = None
    res._fitted = None
    res._idiosyncratic = None
    res._not_null = None

    _deferred_cov = res._deferred_cov()
    res._deferred_cov = functools.partial(fake_cov, _deferred_cov=_deferred_cov)

    return mod, res

model = PanelOLS(y, x)
res = model.fit()
mod, res = shrink_mod_and_res(mod,res)

It's not clear to my why the calculation of the covariance is deferred? I suppose if you want to change the covariance estimator after the estimation, that this hacky method needs to store all possible covariance estimations.

Suggestion

Implement a (cleaner) method to remove large datasets contained in the Results, similar to the remove_data flag in the .save() method of statsmodels' models.

@bashtage bashtage self-assigned this Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants