Elmord's Magic Valley

Software, lingüística e rock'n'roll. Sometimes in English.

Persistent hashmaps

2019-05-23 21:36 -0300. Tags: comp, prog, pldesign, fenius, in-english

Fenius got dictionaries! Now you can say:

fenius> let d = dict("foo" => 1, "bar" => 2)
dict("foo" => 1, "bar" => 2)
fenius> d["foo"]
fenius> d.with("foo" => 3, "quux" => 4)
dict("foo" => 3, "bar" => 2, "quux" => 4)
fenius> d.without("foo")
dict("bar" => 2)

Dictionaries are immutable: the with and without methods return new dictionaries, rather than modifying the original. I plan to add some form of mutable dictionary in the future, but that will probably wait until I figure Fenius's mutability story. (Right now, you can mutate the variables themselves, so you can write d := d.with("foo" => 42), for example.)

Dictionaries are persistent hashmaps, much like the ones in Clojure. Fenius's implementation of hashmaps is simpler than Clojure's, because it does less: beside persitent/immutable hashmaps, Clojure also supports transient hashmaps, a mutable data structure onto which entries can be inserted and removed efficiently, which then can be turned into a persistent hashmap. I want to implement something like this in Fenius at some point. On the flip side, the current Fenius implementation is easier to understand than its Clojure counterpart.

The underlying data structure is based on hash array mapped tries (HAMTs). The standard reference for it is Bagwell (2000). The persistent variant used by Clojure, Fenius and other languages is slightly different, but the basic idea is the same. In this post, I intend to explain how it works.

I will begin by explaining an auxiliary data structure used by the hashmap, which I will call a sparse vector here.

Sparse vectors

Suppose you want to create vectors of a fixed size (say, 8 elements), but you know most of the positions of the vector will be empty much of the time. If you create lots of these vectors in the naive way, you will end up wasting a lot of memory with the empty positions.

23 42

A cleverer idea is to split this information into two parts:

00001010 23 42

(Note that the first position is represented by the lowest bit.) If we know beforehand that the number of positions is limited (say, 32), we can fit the bitmap into a single integer, which makes for a quite compact representation.

That takes care of representing the vectors. But how do we fetch a value from this data structure given its position index i? First, we need to know if the position i is filled or empty. For that, we bitwise-and the bitmap with a mask containing only the ith bit set, i.e., bitmap & (1 << i) (or, equivalently, bitmap & 2i). If the result is non-zero, then the ith bit was set in the bitmap, and the element is present. (If it's not, we return a default empty value.)

Once we know the element is there, we need to find its place in the content vector. To do that, we need to know how many elements are there before the one we want, and skip those. For that:

The result of all this is the number of filled positions before the one we want. Because positions are counted from 0, it also happens to be the position of the wanted element in the content vector (e.g., if there is 1 element before the one we want, the one we want is at position 1 (which is the second position of the vector)). In summary:

def sparsevec_get(sparsevec, position):
    if sparsevec.bitmap & (1 << position):
        actual_position = bit_count(sparsevec.bitmap & ((1 << position) - 1))
        return sparsevec.content[actual_position]
        return SOME_EMPTY_VALUE

To insert an element at position i into the vector, we first figure if position i is already filled. If it is, we find its actual position in the content vector (using the same formula above) and replace its contents with the new value. If it is currently empty, we figure out how many positions are filled before i (again, same formula), and insert the new element just after that (and update the bitmap setting the ith bit). Analogously, to remove the ith element, we check if it is filled, and if it is, we find its position in the content vector, remove it, and clear the ith bit in the bitmap.

In our real implementation, sparse vectors will be persistent, so updates are performed by creating a new vector that is just like the original except for the added or removed elements.

Enough with sparse vectors. Let's get back to HAMTs.

Hash Array-Mapped Tries

A trie ("retrieval tree") is a tree where the path from the root to the desired element is determined by the symbols of the element's key. More specifically:

A hash function is a function f such that if two elements x and y are equal, then f(x) = f(y). (The opposite is not true: there are typically infinitely many elements x for which f(x) produces the same result.) The result of f(x) is called a hash of x. A hash is typically a finite sequence of bits, or symbols... from an alphabet. You can see where this is going.

A hash trie is a trie indexed by hashes. To find an element with key x in the trie, you first compute the hash f(x), and then consume successive bits (or symbols) from the hash to decide the path from the root of the trie to the desired element.

In principle, this would mean that the depth of the trie would be the same as the length of the hash. For example, if our hashes have 32 bits, we would have to traverse 32 children until we reach the element, which would be pretty wasteful in both storage and time. However, we don't need to use the entire hash: we can use just enough bits to distinguish the element from all other elements in the trie. For example, if the trie has elements x with hash 0010, y with hash 0100, and z with hash 1000, we build a trie like:

                  0 / \ 1
                   /   \
                  *     z
                0/ \1
                /   \
               x     y

That is: z is the only element whose hash starts with 1, so we can store it right below the root as the child labelled 1. Both x and y have hashes beginning with 0, so we create a subtree under the root's 0 for them; the second bit of the hash is 0 for x and 1 for y, so we store them below the subtree under the appropriate label. We don't need to go any deeper to disambiguate them.

To insert an element in the tree, we compute its hash and start traversing the tree until either:

If our hashes have a finite number of bits, it may happen that two distinct elements end up having the same hash. There are two ways of handling this problem:

The only problem with this scheme is that our trees can still get pretty deep as we add more elements to them. For example, if we add 1024 elements (210) to the tree, we need at least 10 bits of hash to distinguish them, which means we need to go at least 10 levels deep in the tree to find them; the deeper the tree is, the slower it is to find an element in it. We can reduce the depth if, instead of branching on single bits of the hash, we use groups of bits of a fixed size, say, 5 bits. Then instead of each node having 2 children, each node has 25 = 32 children, labelled {00000, 00001, ..., 11110, 11111}, and we consume 5 bits of the hash for each level we traverse. Now a tree of 210 elements will typically be 2 levels deep rather than 10, which makes it much faster to traverse.

The only problem with this scheme is that each node now needs to have space for 32 children, even when the children are empty. For example, if I store two elements, x with hash 00000 00001, and y with hash 00000 00100, the root of the tree will be a node with 32 children, of which only the 00000 child will be non-empty. This child will contain a subtree containing x at position 00001, y at position 00100, and all other 30 positions empty. If only we had a way to only store those positions that are actually filled. A sparse vector, if you will...

Congratulations, we have just invented hash array-mapped tries, or HAMTs. A HAMT is a hash trie in which each non-leaf node is a sparse vector, indexed by a fixed-length chunk of bits from the hash. To find an element in the HAMT, we traverse the sparse vectors, consuming successive chunks from the hash, until we either find the element we want, or we consume the entire hash and reach a collision list (in which case we look for the element inside the list), or we reach an empty child (in which case the element is not found). Because the sparse vector only allocates space for the elements actually present, each node is compact, and because each level is indexed by a large-ish chunk of bits, the tree is shallow. Win win.

The sparse vectors are immutable, and so is our tree. To add an element to the tree, we have to change the nodes from the root to the final place of the element in the tree, which means making copies of them with the desired parts changed. But the nodes that are not changed (i.e., those that are not part of the path from the root to the new element) are not copied: the new tree will just point to the same old nodes (which we can do because we know they won't change). So adding an element to the tree does not require making a full copy of it.

Removing an element from a HAMT requires some care. Basically, we have to replace the node where the element is with an empty child. But if the subtree where the element was had only two elements, and after removal is left with only one, that one element takes the place of the whole subtree (you never have a subtree with a single leaf child in a HAMT, because the purpose of a subtree is to disambiguate multiple elements with the same hash prefix; if there is no other element sharing the same prefix with the element, there is no point in having a subtree: the element could have been stored directly in the level above instead).

Miscellaneous notes

When profiling my implementation in a benchmark inserting a million elements in a HAMT, I discovered that most of the time was spent on an auxiliary function I wrote to copy sequences of elements between vectors (when updating sparse vectors). This would probably be more efficient if R6RS (or Chez) had an equivalent of memcpy for vectors. It does have bytevector-copy!, but not a corresponding vector-copy!. Go figure.

R7RS does have a vector-copy!, but I'm using Chez, which is an R6RS implementation. Moreover, R7RS(-small) does not have bitwise operations. But it totally has gcd (greatest common divisor), lcm (least common multiple) and exact-integer-sqrt. I mean, my idea of minimalism is a bit different. Also, it has a timestamp function which is based on TAI instead of UTC and thus requires taking account of leap seconds, except it's not really guaranteed to return accurate time anyway ("in particular, returning Coordinated Universal Time plus a suitable constant might be the best an implementation can do"). Yay. [/rant]

Implementing transient/mutable HAMTs efficiently is a bit more complicated. For it to be efficient, you need to be able to insert new elements in a sparse vector in-place, but you can only do that if you pre-allocate them with more space than they actually need, so you have room for growing. Finding a proper size and grow factor is left as an exercise to the reader.

Comparing performance with Clojure HAMTs is not a very exact benchmark, because the implementations are not in the same language (Clojure's is in Java and Fenius's is in Chez Scheme). In my tests doing 10M insertions, Clojure sometimes did much faster than Chez, sometimes much slower, with times varying between 16s and 48s; the JVM works in mysterious ways. Chez's fastest was not as fast as Clojure, but performance was consistent across runs (around ~35s). Note that this is the time for using the hashmap implementation from ChezScheme, not Fenius; doing the benchmark directly in Fenius would be much slower because the language is currently interpreted and the interpreter is very naive. Note also that in actual Clojure, you would do all the insertions on a transient hashmap, and then turn it into a persistent hashmap after all the insertions, so the benchmark is not very representative of actual Clojure usage.

End of file

That's all for now, folks. I wanted to discuss some other aspects of Fenius dictionaries (such as syntax), but this post is pretty big already. Enjoy your hashmaps!

Comentários / Comments (0)

Deixe um comentário / Leave a comment

Main menu

Posts recentes

Comentários recentes


comp (129) prog (64) life (45) in-english (38) pldesign (32) unix (32) lang (30) random (28) about (25) mind (23) lisp (22) mundane (21) fenius (17) web (17) ramble (16) img (13) hel (12) rant (12) privacy (10) scheme (9) freedom (8) esperanto (7) bash (7) academia (7) copyright (7) music (7) lash (7) mestrado (6) shell (6) home (6) misc (5) conlang (5) php (4) editor (4) latex (4) book (4) politics (4) etymology (4) worldly (4) security (3) kbd (3) film (3) tour-de-scheme (3) network (3) c (3) android (3) emacs (3) wrong (3) poem (2) cook (2) physics (2) llvm (2) lows (2) comic (2) treta (2) perl (1) wm (1) audio (1) german (1) kindle (1) pointless (1) philosophy (1)


Quod vide

Copyright © 2010-2019 Vítor De Araújo
O conteúdo deste blog, a menos que de outra forma especificado, pode ser utilizado segundo os termos da licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Powered by Blognir.