clipmenud: add option to deduplicate entries #227

N-R-K · 2024-06-11T16:45:47Z

uses a linear search to find the duplicate and a memmove to move it over to the end.

not perticularly efficient since each snip is 256 bytes while we're only interested in the first 8. and so iterating over it linearly like this isn't very cache friendly. the memmove worst case can also be around 250KiB with the default config.

Closes: #224

N-R-K · 2024-06-11T23:48:17Z

the memmove worst case [...]

I was thinking, maybe it might be worth implementing tombstone now and instead of memmoving, simply marking the older entry as "dead" to be removed later.

Though I'd assume that the linear search is probably the bigger issue. Maybe maintaining a hashmap of hash -> index? That seems fairly complicated though given the existing clip-store design.

A simpler method might be to have a probabilistic cache of hash -> index instead. It'd give us an index to start the (circular) search. This way we won't have to worry about the cache being consistent. It can be wrong, but that just means we'll end up with a linear search. And when it's correct, we get to find the entry faster.

For now though, I haven't really noticed much of an issue in practice. Though I use a fairly low max clip size of 128. And both the linear search + memmove only ever happens when there's an actual duplicate, not on each insertion.

If performance becomes an issue, I can look into doing the probabilistic cache or tombstone, whichever is the bottleneck.

cdown

Thanks!

src/store.c

cdown · 2024-06-17T23:41:00Z

src/store.c

@@ -546,7 +566,7 @@ int cs_add(struct clip_store *cs, const char *content, uint64_t *out_hash) {
        *out_hash = hash;
    }

-    return cs_snip_add(cs, hash, line, nr_lines);
+    return (ret == 1) ? 0 : cs_snip_add(cs, hash, line, nr_lines);


Once you use expect() you can avoid this and just pass through again.

cs_snip_add resizes the store and increments nr_snips by +1. We don't really want that in the dedup case.

Unless by "pass through" you meant passing through the return value ret as is.

src/store.c

cdown · 2024-06-17T23:52:17Z

Also to note: I in general am a bit worried about performance here, so I should be clear I am not certain this will be merged yet. But also this should not be the regular case: we already do cheap "last dupe" checks up front.

src/store.c

N-R-K · 2024-06-18T03:31:05Z

I in general am a bit worried about performance here

I did a bit of micro benchmarking a couple days ago. It's not super scientific but the observation are as follows:

When the duplicate is near the "top" (i.e recent) already, the operation is very fast. Roughly 200~400ns (note that the linear search currently starts at the "bottom").
When duplicate is near the bottom, it takes considerably more time: 2~8us.
Since adding a cache to try and speed up the linear search was relatively non invasive change, I tried it out but it barely made any difference.

These numbers are with the default of 1k max clips.

All of this seems to suggest to me that the linear search is not the bottleneck but the memmove is. So doing tombstone is probably more important for performance. It'd be a bit more invasive change to iteration functions and removal functions, but I think it can be done if needed.

The rest of the review all seems very reasonable. I will amend the patch and address them sometime tomorrow. Cheers.

cdown · 2024-06-18T12:01:16Z

I think that sounds fine: my main concern was the nominal case (dupe at head), but that's already handled by is_possible_partial in a faster way, so that shouldn't present a problem. Thanks!

Closes: cdown#224

cdown requested changes Jun 17, 2024

View reviewed changes

cdown reviewed Jun 17, 2024

View reviewed changes

src/store.c Outdated Show resolved Hide resolved

store: introduce dupe_policy

ed276eb

N-R-K force-pushed the dedup branch 2 times, most recently from 82787ed to fe4b794 Compare June 18, 2024 21:44

config: Add an option to deduplicate entries

e25c379

Closes: cdown#224

N-R-K force-pushed the dedup branch from fe4b794 to e25c379 Compare June 21, 2024 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clipmenud: add option to deduplicate entries #227

clipmenud: add option to deduplicate entries #227

N-R-K commented Jun 11, 2024

N-R-K commented Jun 11, 2024

cdown left a comment

cdown Jun 17, 2024

N-R-K Jun 18, 2024

cdown commented Jun 17, 2024

N-R-K commented Jun 18, 2024

cdown commented Jun 18, 2024

clipmenud: add option to deduplicate entries #227

Are you sure you want to change the base?

clipmenud: add option to deduplicate entries #227

Conversation

N-R-K commented Jun 11, 2024

N-R-K commented Jun 11, 2024

cdown left a comment

Choose a reason for hiding this comment

cdown Jun 17, 2024

Choose a reason for hiding this comment

N-R-K Jun 18, 2024

Choose a reason for hiding this comment

cdown commented Jun 17, 2024

N-R-K commented Jun 18, 2024

cdown commented Jun 18, 2024