Pivot missing categories breaks FeatureSet/AggregatedFeatureSet #247

AlvaroMarquesAndrade · 2020-09-17T14:37:37Z

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

When defining a feature set, it's expected that pivot will have all categories and, as a consequence, the resulting Source dataframe will be suitable to be transformed. When a different behavior happens, FeatureSet and AggregatedFeatureSet breaks.

Feature related:

Age: legacy

Estimated cost: investigation_needed

Type: documentation, coding and testing.

Description 📋

If we have a pivot transformation defined in a reader, it's straightforward to define the expected categories as features during FeatureSet or AggregatedFeatureSet instantiation. If for some reason, not all categories are found in the Source resulting dataframe (this could happen if we use a smaller time window, for instance), then our feature set will break due to not finding this expected column.

In order to illustrate what's happening, suppose we have the following resulting dataframe from the Source:

    +---+---+-------+------+----+-----+
    | id| ts|balcony|fridge|oven| pool|
    +---+---+-------+------+----+-----+
    |  1|  1|   null| false|true|false|
    |  2|  2|  false|  null|null| null|
    |  1|  3|   null|  null|null| null|
    |  1|  4|   null|  null|null| true|
    |  1|  5|   true|  null|null| null|
    +---+---+-------+------+----+-----+

As a result, a possible AggregatedFeatureSet could be:

aggregated_feature_set=AggregatedFeatureSet(
                name="example_agg_feature_set",
                entity="entity",
                description="Just a single example. "
                keys=[
                    KeyFeature(
                        name="id",
                        description="House id.",
                        dtype=DataType.BIGINT,
                    )
                ],
                timestamp=TimestampFeature(from_column="ts"),
                features=[
                    Feature(
                        name="balcony_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="balcony",
                    ),
                    Feature(
                        name="fridge_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="fridge",
                    ),
                    Feature(
                        name="oven_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="oven",
                    ),
                    Feature(
                        name="pool_amenity",
                        description="description",
                        transformation=AggregatedTransform(
                            functions=[Function(functions.count, DataType.INTEGER)]
                        ),
                        from_column="pool",
                    ),
                ],
            )

Now, if we take a different time window and, for some reason, there is no information regarding the pool amenity, we'd have a resulting Source dataframe like this:

    +---+---+-------+------+----+
    | id| ts|balcony|fridge|oven|
    +---+---+-------+------+----+
    |  1|  6|   null| false|true|
    |  2|  7|  false|  null|null|
    |  1|  8|   null|  null|null|
    |  1|  9|   null|  null|null|
    +---+---+-------+------+----+

Therefore, the pool_amenity feature would break, since there's no pool column anymore.

Impact 💣

We'll not be able to use the pivot operation for incremental loads (since we can't be sure that all categories will be available).

Solution Hints

We could have a parameter for making a given feature optional. As a result, the expected behavior should be the following: if the column that this feature is dependent exists, then we perform the transformations, otherwise we could simply consider as null (we could raise a warning in these cases).

Observations 🤔

We should take care, when implementing this solution, to avoid hiding errors.

The text was updated successfully, but these errors were encountered:

AlvaroMarquesAndrade added bug Something isn't working good first issue Good for newcomers labels Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet #247

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet #247

AlvaroMarquesAndrade commented Sep 17, 2020

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet #247

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet #247

Comments

AlvaroMarquesAndrade commented Sep 17, 2020

Pivot missing categories breaks FeatureSet/AggregatedFeatureSet

Summary

Description 📋

Impact 💣

Solution Hints

Observations 🤔