Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the memory usage of our own HLL implementation #8098

Open
wants to merge 1 commit into
base: sliding_hyperloglog
Choose a base branch
from

Conversation

MMeent
Copy link
Contributor

@MMeent MMeent commented Jun 18, 2024

We don't use sliding windows, so we can just drop the historical snapshot requirement from the implementation, thus removing the 50%+ tracking overhead.

A nice benefit is that this allows working set cardinality estimation across the full lifetime of the instance, rather than just a recent window. A demerit is that you can't take an arbitrary window, the end of any estimation window must be current or in the future.

Problem

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@MMeent MMeent requested a review from knizhnik June 18, 2024 16:53
@MMeent MMeent requested review from a team as code owners June 18, 2024 16:53
@MMeent MMeent requested review from arssher and removed request for a team June 18, 2024 16:53
Copy link

github-actions bot commented Jun 18, 2024

3222 tests run: 3079 passed, 1 failed, 142 skipped (full report)


Failures on Postgres 14

  • test_sliding_working_set_approximation: debug
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sliding_working_set_approximation[debug-pg14]"
Flaky tests (1)

Postgres 14

  • test_metric_collection: debug

Test coverage report is not available

The comment gets automatically updated with the latest test results
2faa0df at 2024-06-18T17:46:27.550Z :recycle:

We don't use sliding windows, so we can just drop the historical snapshot
requirement from the implementation, thus removing some tracking overhead.
{
cState->regs[index].fpm[j++] = cState->regs[index].fpm[i];
}
cState->regs[index][i] = now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why do we need tp assign current timestamp to all elements with index less or equal then count?
Why it is not enough just to assign:

cState->regs[index][count] = now;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we don't have to start at the end of the array of timestamps when we try to get the summary.

Your suggestion would work as well, but it'd require a scan from high counts, rather than low counts. As higher counts are less often set when the summary would result in a low distinct count, generating a summary could be more expensive on average.

I don't mind it going either way though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not understand it.
We need to find maximal count, right?
So your loop:

for (size_t i = 0; i < HLL_C_BITS + 1; i++)
	{
		if (reg[i] >= since)
		{
			max = i;
		}
	}

is actually locating the largest R with timestamp within specified period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, but by also updating all registers < count, you can do a binary search or linear search with early break, as all values in the array are sorted (with potential duplicates, but still, sorted).

	for (size_t i = 0; i < HLL_C_BITS + 1; i++)
	{
		if (reg[i] >= since)
			max = i;
		else
			break;
	}

Thus saving a few compares for this bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Binary search in array of size 22 seems to have no sense.
  2. Time of calculating estimation is not critical: it is needed rarely when autoscaler agent needs to update its working set estimation: once per second or even per minute.
  3. Time of updating hashes is much more critical because it is called on each SMGR read operation

So we should optimise addHLL rather than getMaximum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants