Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet #247

Open
AlvaroMarquesAndrade opened this issue Sep 17, 2020 · 0 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@AlvaroMarquesAndrade
Copy link
Contributor

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

When defining a feature set, it's expected that pivot will have all categories and, as a consequence, the resulting Source dataframe will be suitable to be transformed. When a different behavior happens, FeatureSet and AggregatedFeatureSet breaks.

Feature related:

Age: legacy

Estimated cost: investigation_needed

Type: documentation, coding and testing.

Description 📋

If we have a pivot transformation defined in a reader, it's straightforward to define the expected categories as features during FeatureSet or AggregatedFeatureSet instantiation. If for some reason, not all categories are found in the Source resulting dataframe (this could happen if we use a smaller time window, for instance), then our feature set will break due to not finding this expected column.

In order to illustrate what's happening, suppose we have the following resulting dataframe from the Source:

    +---+---+-------+------+----+-----+
    | id| ts|balcony|fridge|oven| pool|
    +---+---+-------+------+----+-----+
    |  1|  1|   null| false|true|false|
    |  2|  2|  false|  null|null| null|
    |  1|  3|   null|  null|null| null|
    |  1|  4|   null|  null|null| true|
    |  1|  5|   true|  null|null| null|
    +---+---+-------+------+----+-----+

As a result, a possible AggregatedFeatureSet could be:

aggregated_feature_set=AggregatedFeatureSet(
                name="example_agg_feature_set",
                entity="entity",
                description="Just a single example. "
                keys=[
                    KeyFeature(
                        name="id",
                        description="House id.",
                        dtype=DataType.BIGINT,
                    )
                ],
                timestamp=TimestampFeature(from_column="ts"),
                features=[
                    Feature(
                        name="balcony_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="balcony",
                    ),
                    Feature(
                        name="fridge_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="fridge",
                    ),
                    Feature(
                        name="oven_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="oven",
                    ),
                    Feature(
                        name="pool_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="pool",
                    ),
                ],
            )

Now, if we take a different time window and, for some reason, there is no information regarding the pool amenity, we'd have a resulting Source dataframe like this:

    +---+---+-------+------+----+
    | id| ts|balcony|fridge|oven|
    +---+---+-------+------+----+
    |  1|  6|   null| false|true|
    |  2|  7|  false|  null|null|
    |  1|  8|   null|  null|null|
    |  1|  9|   null|  null|null|
    +---+---+-------+------+----+

Therefore, the pool_amenity feature would break, since there's no pool column anymore.

Impact 💣

We'll not be able to use the pivot operation for incremental loads (since we can't be sure that all categories will be available).

Solution Hints :shipit:

We could have a parameter for making a given feature optional. As a result, the expected behavior should be the following: if the column that this feature is dependent exists, then we perform the transformations, otherwise we could simply consider as null (we could raise a warning in these cases).

Observations 🤔

We should take care, when implementing this solution, to avoid hiding errors.

@AlvaroMarquesAndrade AlvaroMarquesAndrade added bug Something isn't working good first issue Good for newcomers labels Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant