Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter null rows before aggregation #23052

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jackychen718
Copy link
Contributor

@jackychen718 jackychen718 commented Jun 21, 2024

Description

Filter rows with all fileds null before aggregation.

Motivation and Context

The issue comes from #17486
Add optimiztion to drop groups/rows with all fields null

Impact

Add another opitimization rule: RemoveNullRowInAggregation.java

Test Plan

Test if any error made after applying the new optimization rule.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... :pr:`12345`
* ... :pr:`12345`

Hive Connector Changes
* ... :pr:`12345`
* ... :pr:`12345`

== NO RELEASE NOTE ==

{
PlanNode source = node.getSource();
if (source instanceof FilterNode) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the new filter added by this rule be appended to the existing filter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was proposed to prevent the AggregationNode being optimized many times. Sure,it should be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check should be something like "have all is not null filters already been added to the input to the aggregation node", instead of "is the input to the aggregation node is a filter node"

@jaystarshot
Copy link
Member

Can you add unit tests? Also there must be a feature flag controlling this since I am not sure if this optimization will be always effective.
What happens in count(*) case?

@tdcmeehan tdcmeehan self-assigned this Jun 23, 2024
Copy link
Contributor

@ClarenceThreepwood ClarenceThreepwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @jackychen718 !

  1. This changes the sql semantics and can cause wrong results. Therefore this rule should be behind a feature flag and should definitely be disabled by default
  2. Please add unit tests.
  3. There are many related test failures. Please fix them before marking it ready for review

{
PlanNode source = node.getSource();
if (source instanceof FilterNode) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check should be something like "have all is not null filters already been added to the input to the aggregation node", instead of "is the input to the aggregation node is a filter node"

}
Collection<Aggregation> aggregations = node.getAggregations().values();
for (Aggregation aggregation : aggregations) {
if (aggregation.getCall().getArguments().size() > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why limit to aggregate functions of only one argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the example of the problem has only one argument. As the first attempt, I only consider one argument. I will consider arbitrary arguments.

public Result apply(AggregationNode aggregationNode, Captures captures, Context context)
{
PlanNode source = aggregationNode.getSource();
if (context.getLookup().resolveGroup(source).findFirst().filter(node -> node instanceof FilterNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary since you already check for this in the pattern detection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will modify the pattern detection part later on.


RowExpression andForm = new SpecialFormExpression(AND, BOOLEAN, ImmutableList.copyOf(nullFormList));
RowExpression notCallPredicate = new CallExpression("not",
new FunctionResolution(createTestFunctionAndTypeManager().getFunctionAndTypeResolver()).notFunction(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use createTestFunctionAndTypeManager. Get it from the optimizer metadata. See other optimizer rules for examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jackychen718
Copy link
Contributor Author

jackychen718 commented Jun 26, 2024

Can you add unit tests? Also there must be a feature flag controlling this since I am not sure if this optimization will be always effective. What happens in count(*) case?

I will add unit tests. Currently, I did not consider count(*) as a special way. From your reminding, I do not think I can still add filter before aggregation in this scenario. I will exclude this scenario in the pattern match process. Is it ok? @jaystarshot

nullFormList.add(nullForm);
}

RowExpression andForm = new SpecialFormExpression(AND, BOOLEAN, ImmutableList.copyOf(nullFormList));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use ImmutableList.builder() for nullFormList, call add in the loop, and then pass nullFormList.build() when generating the andForm expression. It prevents copying the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants