Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 dict behaviors for PMap #178

Open
jtrakk opened this issue Oct 14, 2019 · 9 comments
Open

Python 3 dict behaviors for PMap #178

jtrakk opened this issue Oct 14, 2019 · 9 comments

Comments

@jtrakk
Copy link

jtrakk commented Oct 14, 2019

Python 3 introduces several new behaviors for dict which have been mentioned in previous issues, but a unifying issue might be useful to the discussion.

Both of these changes are semantically significant in practice, so it might be worth documenting the behavior until a Python 3-alike version of PMap is implemented to ensure nobody gets caught unaware?

Setlike view objects

Python 3 dicts have set-like keys(), values(), and items() views. In Python 3, order does not matter when comparing these views:

{1: 2, 3: 4}.keys() == {3: 4, 1: 2}.keys()
True

and they support set operations:

In [1]: {1: 2, 3: 4}.keys() & {3: 4, 1: 2}.keys()                                                                                               
Out[1]: set([1, 3])

Maintaining key order

Since 3.7, dict keys are guaranteed to iterate in the same order in which they were added.

@tobgu
Copy link
Owner

tobgu commented Oct 14, 2019

Creating an umbrella issue may be good to keep things together although I think it would be fully reasonable to only implement parts of the characteristics described here.

Keeping the pmap behaviour close to the common dict is probably a good idea even though it has never been part of any behavioural contract to do that. As observed the behaviour differs depending on version for Python and I've chosen to stick to a behaviour close to Python 2 so far. I would not have anything against PRs moving the behaviour closer to that of recent Python 3 versions given that performance does not take a too big hit though. Especially since Python 2 is soon EOL.

I have recently done some shallow investigations into switching the underlying implementation of the pmap to make it more performant (see #160). I've tried using https://github.com/MagicStack/immutables (and hence bring in an additional dependency since I don't have the time required myself to write at HAMT or similar in C at the moment). The commit for this is on https://github.com/tobgu/pyrsistent/tree/test-different-map-implementation. Unfortunately the immutables map seg faults sporadically in some of the pmap tests. I have not dug into the root cause of these crashes (help would be much appreciated!). I could see these behavioural changes going into the same, general pmap renovation effort.

@noahbenson
Copy link
Contributor

I should have mentioned this issue in pull request #243.

That pull request causes pm.keys() to return a PSet object (this seemed right—PSet already uses a PMap for its implementation under the hood, so making a PSet with the map's keys an O(1) operation).

It also returns a view object for values and items that mimic the dict_values and dict_items builtin types used by dict. I believe that these get equality right also (though note that for dict views of values, objects appear to never be equal).

Maintaining key order is totally possible with a persistent dict type but would probably take some serious work—I don't imagine such a change would be efficient unless it involved rewriting the guts of the C implementation.

@tobgu I have a lot of experience with C programming and with writing low-level HAMTs and similar data structures if you want to chat about the prospect of rewriting these data structures from scratch.

@tobgu
Copy link
Owner

tobgu commented Feb 10, 2022

Thanks @noahbenson!

Yeah, looking into a speedier implementation of PMap would be nice. It's currently bolted onto the PVector to enjoy some of perf-benefits of that low level implementation while avoiding the maintenance burden of multiple implementations for other data structures than the PVector. It would most definitely be much quicker with a proper C-implementation though. See #160 for more on the same topic.

I have not looked at https://github.com/MagicStack/immutables for quite a while. There seems to have been some activity since last so it might be worth revisiting. If it still isn't up to standards for use as underlying implementation for PMap it may be possible to contribute to that project to make it happen.

In general I'm slightly worried that adding too much low level C-code (with required duplication in a pure Python version for PyPy, etc) will make the maintenance burden and barrier to entry for outside contributors higher. I personally don't have a ton of time available to spend on pyrsistent.

As for the PVector I think the performance is decent but there are probably optimizations and structural changes that would improve it further. One thing that would be interesting to look into (as discussed in #127) would be to try to reduce the places were ref. counting is needed to improve memory reuse between parent and child processes after fork for example.

@noahbenson
Copy link
Contributor

noahbenson commented Feb 11, 2022

I took a look at the MagicStack library earlier—IMHO it's 80% of the way there for maps? That said I didn't actually pull it and do tests or anything, so this is all based on just browsing their repo. (Also, full confession: I haven't looked closely at the pyrsistent.PVector implementation.)

One of the really useful things about the pyrsistent-like library I maintain in Julia is that PVector and PDict objects use the exact same underlying (optimized) data structure, which is just a persistent HAMT type that maps unsigned 64-bit integer hashes (the keys) to arbitrary values without any collision handling. It's made such that if you map the hash-values 0, 1, 2... to x1, x2, x3... like the x values were an array, the HAMT tree that gets returned is always the minimal depth required, just like with most persistent vector implementations I've seen. TL;DR—there's no need for there to be more than 1 data structure implemented in C; the same data structure can be optimal for both pvectors and pmaps if designed correctly.

I spent a few hours writing out some of the base code for a C extension of this kind of persistent HAMT last night, and one of my conclusions is that the whole thing can be written using only the Python.h header file; there's no need to add anything to the code that would be remotely problematic in cross-platform compilation. That said I have relatively little experience with Python C extensions (I've never tried to published one), so I'm probably underestimating the complexity.

To be honest I don't have a ton of time to work on this kind of thing either right now, but writing a c implementation of a persistent HAMT is a fairly small lift for me, so I'd be interested in writing/testing then pull-requesting something like the data structure described above if you'd be interested in doing the bulk of the integration work. Though, now that I think about it, a more logical approach would probably be for me to just publish the c-implementation of the HAMT as a small independent library and let you incorporate it into pyrsistent where/as appropriate.

Re: ref-counting. Eee. I'm not sure how you are going to be able to get away with that. Any given node in any persistent structure that you make might get reused in an arbitrary number of other structures, and higher nodes that point to them might get garbage collected at any time. Fundamentally, every node you allocate in C has to be garbage-collectible somehow. If you aren't going to use Python's garbage collection schema, then you need something that is compatible with it, and I don't see how to write a sweep-based garbage collector on top of Python's ref-counting. I thought about doing the refcounting internally (i.e., a separate more efficient refcounter for the HAMT nodes), but if you do this, then you break Python's ability to detect loops in the reference graph (for example: v = pvector([[]]); v[0][0] = v; del v might never garbage collect the objects involved).

All that said, I could be mistaken about any number of things in the above paragraph (it would take someone who knows the guts of the Python refcounting/GC system better than I to evaluate this). Also, it's not clear to me from the original question why the refcounts for the internal HAMT nodes are getting updated on reads—so maybe this is the root of the problem? (I'll have to take a closer look at the code).

@tobgu
Copy link
Owner

tobgu commented Feb 13, 2022

I'd be very interested in seeing what such a structure would look like. If it turns out to be a feasible replacement for the current PVector (feature-wise and perf-wise) I may consider switching the underlying implementation given that I can find the time to do the integration work. If you have an initial implementation sketched out I'd be happy to have a look at it. :-)

@noahbenson
Copy link
Contributor

Well, short answer is that I have a woking prototype of it that has even been extensively tested already... it's just written in Julia 😅. This kind of low-level Julia is pretty similar to C in many ways (enough that I'm almost translating this library into C line-by-line), so if you want to take a look, all the relevant code is in this file. Once I check-in the C version I'm working on, I'll ping you also.

The ptree.jl file linked above will probably be hard to follow—it's not that the code is in a bad state, just that I haven't spent the time to write a clear high-level explanation of how the data structure works the way the MagicStack group has. Here's a brief stab at that. I'm not sure how useful it will be as-is, but it might help if you want to browse the code.

For simplicity, let's assume that the hash type is a 64-bit integer. A very simple version of a persistent hash array mapped trie (PHAMT) is one in which, in order to look up the leaf-node (value) for a hash integer (key), you start by dicing the hash into a sequence of integers represented by every five bits of the hash. Every 5 bits can be represented by an unsigned integer from 0 to 31 (the first 4 bits become an integer in [0,15]). If your hash integer gets diced into the values [4, 28, 19, ...] then, to find the associated leaf node, you traverse to the 4th child of the root node, the 28th child of that node, the 19th child of that node, etc. Each node in the tree contains an array of 32 children (many of which are just null). When you update the PHAMT, you just reallocate the nodes with appropriate changes as you traverse up the tree from the leaf node associated with the change. (Just to be clear, I'm visualizing the PHAMT's tree as a descending tree, such that the root has a depth of 0, its children have a depth of 1, and the nodes with the maximum depth ("twigs") store as their children ("leaves") the values that have been associated with the 64-bit integer hash-values.)

The version I'm currently translating from Julia to C is like this, but it makes 2 major optimizations:

  1. If, for any node above the max depth, exactly one child is non-empty, then there's no reason to allocate that node—a child node, which must be allocated anyway and that is aware of its position in the PHAMT tree is sufficient. This means that we needn't stop at every depth of the tree when traversing it unless it is quite full. (A node's "position" in the PHAMT can be specified by its depth in the tree and the minimum hash value that can be found in the leaves that are reachable via its children; these are, equivalently, the bits in the hash integer that the node shares with all of its children.) In practice, what this means is that if you insert the keys/hash-values 0, 1, 2, 3 into an empty PHAMT, the resulting PHAMT will consist of only 1 node of max depth, whose children include the values mapped to those four keys. If you insert the values 0 though 45, then there would be 3 nodes: two of max depth (storing 0–31 and storing 32-45), and one node of max depth minus one.
  2. You don't need to allocate 32 empty children for every node—you can just store the number of children you have in hash-sorted order and use some fancy bit arithmetic to translate hash-values into array indices. This causes a slight performance loss in terms of speed, but the bit arithmetic is O(1) and very fast, and the time saved during updates (due to faster allocation) is so substantial, that I consider it well worth it (to say nothing of the memory savings).

So because of optimization 1 above, when using a PHAMT as the backend for a pvector type, you can just treat the array index as the hash-value and the resulting PHAMT will be memory-optimal. It sacrifices a bit of speed, but not much (and honestly doing hard work like numerical operations on pvector types is probably never going to be efficient). Simultaneously, pmap types do just fine with this data structure; the only real change is that they tend to traverse fewer nodes during lookup and update operations.

Just to be clear: I'm not claiming to have discovered either of these optimizations, and the MagicStack implementation includes a slightly more sophisticated version of optimization 2 (they have a more sophisticated version of the same idea, but they're both so low-level and similar that it'll be hard to guess which is faster).

@noahbenson
Copy link
Contributor

@tobgu Here's the repo for this project—so far just the initial commit of the mostly-written-but-untested implementation. I don't recommend trying to use it yet, but most of the code is there if you want to look through it (let me know if you have questions/comments)!

@tobgu
Copy link
Owner

tobgu commented Feb 15, 2022

Thanks, keeping an eye on it!

I guess some sort of hash collision handling must be implemented outside of the structure to make it usable for mappings (which is probably less efficient but cleaner from an implementation perspective)?

I also think that some kind of transient ("evolver" in pyrsistent lingo) is required to make the structure efficient for bulk operations. I've not looked at the corresponding Julia code if there's something similar in place though.

@noahbenson
Copy link
Contributor

Yeah—the collision detection can pretty easily be handled by having one PHAMT (p1) map key-hashes to PHAMTs (or just lists) of indices then to have a separate PHAMT (p2) that just keeps keeps an ordered list of everything inserted. So to look-up key k, you get h = hash(k) then find indices = p1[h]. The indices will just contain one index unless there has been a collision for h, and that index will be for the hash/key in p2 that maps to the original key-value pair. As a bonus, this lets you easily iterate the dict in insertion order because p2 just stores the elements in insertion order (every time you insert a new key, you append the key-value pair to p2 and put the index of that new pair in p1.

Anyway, long story short—putting collision outside the PHAMT makes the PHAMT more performant for persistent vectors/lists, since they don't need that kind of collision detection.

Re:transients—I just haven't gotten around to writing these yet, but since it's all in C, it can be written pretty trivially by sticking a bit in the PHAMT data-structure that flags whether the object is transient or not. Edits to transient objects can then easily reallocate the persistent nodes into transient ones as changes are made then just flip all those bits over to "persistent" when the object gets persisted.

When I wrote the transient system in Julia, I was surprised to discover that building a very large dictionary (a million elements) with transients was only ~20% faster than doing it all with persistent objects. This is likely very different for arrays, where appending to a persistent structure will constantly reallocate the same node, but I think that with dictionaries, the sparsity of keys is so high that you don't save much time with the transient approach. Anyway, worth testing with this implementation as it comes along.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants