Suggestion: Implement a `.remove_data` function for Results #533

jpweytjens · 2023-05-31T12:50:32Z

Description

Fitted results from linearmodels can be pickled with pickle.dump. These pickled files contain the estimated parameters, along side all the data required to estimate these parameters. Saving the data required to estimate the results is generally (always?) not desired, as keeping this data in the results substantially increases the size of the pickled files. The estimated parameters however no longer require these potentially large datasets to be displayed or processed.

Example

My use case is as follows, with a large (N = 500'000, T = 123) panel dataset.

Create a list of all desired model specifications and comparisons
Estimate all the different models
Save different comparison of these results with compare

In pseudocode

specifications = pd.DataFrame({"formulas": formulas, "criterium": criteria})

results = []
for formula in specifications["formulas"]:
    model = PanelOLS(y, x)
    res = model.fit()
    results.append(res)

specifications["results"] = results

for criteria in specifications["criteria"].unique():
    results = specificiations.query("criterium == @criteria")["results"]
    comparison = compare(results)
    comparison.summary.as_latex()

As my dataset is very large, pickeling results or the DataFrame specifications takes up multiple Gb's to store just a few number of estimated parameters. Ideally, I would be able to store/pickle the results. That way, I can separate the estimating the models from comparing the models. For example, this would allow someone to do the estimations during the night and kill the process once done.

Workaround

I created this hacky workaround to remove a lot of attributes from the model and result that aren't required if you're only interested in storing the results. With this, I can reduce the size of the pickle objects from ~50Gb to around 250Mb.

import functools


def fake_cov(_deferred_cov, *args, **kwargs):
    return _deferred_cov


def shrink_mod_and_res(mod, res):
    """
    Remove any DataFrame and large objects that are unnecessarily stored in the model and results objects.
    """
    mod.dependent._frame = mod.dependent._frame.head(1)
    mod.dependent._original = None
    mod.dependent._panel = None
    mod.exog._frame = mod.exog._frame.head(1)
    mod.exog._original = None
    mod.exog._panel = None
    mod.weights._frame = mod.weights._frame.head(1)
    mod.weights._original = None
    mod.weights._panel = None
    mod._cov_estimators = None
    mod._x = None
    mod._y = None
    mod._w = None
    mod._not_null = None
    mod._original_index = None

    res._resids = None
    res._wresids = None
    res._original_index = None
    res._effects = None
    res._index = None
    res._fitted = None
    res._idiosyncratic = None
    res._not_null = None

    _deferred_cov = res._deferred_cov()
    res._deferred_cov = functools.partial(fake_cov, _deferred_cov=_deferred_cov)

    return mod, res

model = PanelOLS(y, x)
res = model.fit()
mod, res = shrink_mod_and_res(mod,res)

It's not clear to my why the calculation of the covariance is deferred? I suppose if you want to change the covariance estimator after the estimation, that this hacky method needs to store all possible covariance estimations.

Suggestion

Implement a (cleaner) method to remove large datasets contained in the Results, similar to the remove_data flag in the .save() method of statsmodels' models.

The text was updated successfully, but these errors were encountered:

bashtage self-assigned this Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Implement a `.remove_data` function for Results #533

Suggestion: Implement a `.remove_data` function for Results #533

jpweytjens commented May 31, 2023 •

edited

Loading

Suggestion: Implement a .remove_data function for Results #533

Suggestion: Implement a .remove_data function for Results #533

Comments

jpweytjens commented May 31, 2023 • edited Loading

Description

Example

Workaround

Suggestion

Suggestion: Implement a `.remove_data` function for Results #533

Suggestion: Implement a `.remove_data` function for Results #533

jpweytjens commented May 31, 2023 •

edited

Loading