Improve the memory usage of our own HLL implementation #8098

MMeent · 2024-06-18T16:53:55Z

We don't use sliding windows, so we can just drop the historical snapshot requirement from the implementation, thus removing the 50%+ tracking overhead.

A nice benefit is that this allows working set cardinality estimation across the full lifetime of the instance, rather than just a recent window. A demerit is that you can't take an arbitrary window, the end of any estimation window must be current or in the future.

Problem

Summary of changes

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-06-18T17:00:46Z

3222 tests run: 3079 passed, 1 failed, 142 skipped (full report)

Failures on Postgres 14

test_sliding_working_set_approximation: debug

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sliding_working_set_approximation[debug-pg14]"

Flaky tests (1)

Postgres 14

test_metric_collection: debug

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
2faa0df at 2024-06-18T17:46:27.550Z :recycle:}

We don't use sliding windows, so we can just drop the historical snapshot requirement from the implementation, thus removing some tracking overhead.

knizhnik · 2024-06-19T09:50:46Z

pgxn/neon/hll.c

-		{
-			cState->regs[index].fpm[j++] = cState->regs[index].fpm[i];
-		}
+		cState->regs[index][i] = now;


I wonder why do we need tp assign current timestamp to all elements with index less or equal then count?
Why it is not enough just to assign:

cState->regs[index][count] = now;

This way we don't have to start at the end of the array of timestamps when we try to get the summary.

Your suggestion would work as well, but it'd require a scan from high counts, rather than low counts. As higher counts are less often set when the summary would result in a low distinct count, generating a summary could be more expensive on average.

I don't mind it going either way though.

Sorry, I do not understand it.
We need to find maximal count, right?
So your loop:

for (size_t i = 0; i < HLL_C_BITS + 1; i++) { if (reg[i] >= since) { max = i; } }

is actually locating the largest R with timestamp within specified period.

correct, but by also updating all registers < count, you can do a binary search or linear search with early break, as all values in the array are sorted (with potential duplicates, but still, sorted).

for (size_t i = 0; i < HLL_C_BITS + 1; i++) { if (reg[i] >= since) max = i; else break; }

Thus saving a few compares for this bucket.

Binary search in array of size 22 seems to have no sense.

Time of calculating estimation is not critical: it is needed rarely when autoscaler agent needs to update its working set estimation: once per second or even per minute.

Time of updating hashes is much more critical because it is called on each SMGR read operation

So we should optimise addHLL rather than getMaximum

That makes sense.

MMeent requested a review from knizhnik June 18, 2024 16:53

MMeent requested review from a team as code owners June 18, 2024 16:53

MMeent requested review from arssher and removed request for a team June 18, 2024 16:53

MMeent mentioned this pull request Jun 18, 2024

implement rolling hyper-log-log algorithm #8068

Open

5 tasks

Improve the memory usage of our own HLL implementation

2faa0df

We don't use sliding windows, so we can just drop the historical snapshot requirement from the implementation, thus removing some tracking overhead.

MMeent force-pushed the recent_hyperloglog branch from b89e6fe to 2faa0df Compare June 18, 2024 17:01

knizhnik reviewed Jun 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the memory usage of our own HLL implementation #8098

Improve the memory usage of our own HLL implementation #8098

MMeent commented Jun 18, 2024 •

edited

Loading

github-actions bot commented Jun 18, 2024 •

edited

Loading

Postgres 14

knizhnik Jun 19, 2024

MMeent Jun 19, 2024

knizhnik Jun 19, 2024

MMeent Jun 19, 2024

knizhnik Jun 20, 2024

MMeent Jun 20, 2024

Improve the memory usage of our own HLL implementation #8098

Are you sure you want to change the base?

Improve the memory usage of our own HLL implementation #8098

Conversation

MMeent commented Jun 18, 2024 • edited Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jun 18, 2024 • edited Loading

3222 tests run: 3079 passed, 1 failed, 142 skipped (full report)

Failures on Postgres 14

Postgres 14

Test coverage report is not available

knizhnik Jun 19, 2024

Choose a reason for hiding this comment

MMeent Jun 19, 2024

Choose a reason for hiding this comment

knizhnik Jun 19, 2024

Choose a reason for hiding this comment

MMeent Jun 19, 2024

Choose a reason for hiding this comment

knizhnik Jun 20, 2024

Choose a reason for hiding this comment

MMeent Jun 20, 2024

Choose a reason for hiding this comment

MMeent commented Jun 18, 2024 •

edited

Loading

github-actions bot commented Jun 18, 2024 •

edited

Loading