Okay, I'm gonna write this down now to distract myself for a while before I get back to Master's stuff.
In a recent post I talked about the problem of cross-process garbage collection, and suggested wrapping objects in a reference-counted container when crossing process boundaries as a possible solution, but I remarked that this would have a large overhead when passing many small objects. The prime example would be passing a linked list, as (at least naively) every node of the list would get wrapped as the elements of the list are accessed.
Now, I particularly cared about this case because the linked list (based on cons cells) is a very prominent data structure in Lisp. And although they have some nice properties (they are conceptually simple, you can insert and remove elements into the middle/end of a list by mutating the cdrs), they also are not exactly the most efficient data structure in the world: half the memory they use is just for storing the "next" pointer (which fills processor cache), whereas in a vector you just need a header of constant size (indicating the vector size and other metadata) and the rest of the memory used is all payload. Also, vectors have better locality. On the other hand, "consing" (i.e., nondestructively inserting) an element into a vector is O(n), because you have to copy the whole vector, and even destructive insertion may require a whole copy every once in a while (when you exceed the current capacity of the vector). I've been wondering for a long time: could you make a Lisp based on a data structure that is halfway between a linked list and a vector?
If we are to allow the common Lisp idioms with this new kind of list, it has to support consing and taking the tail of the list efficiently. (Another possibility is to replace the common idioms with something else. That is much more open-ended and requires more thought.)
What I've been thinking of as of late is roughly a linked list of vectors, with some bells and whistles; each vector would be a chunk of the list. Each vector/chunk would have a header containing: (1) the number of elements in the chunk; (2) a link to the next chunk; (3) an index into the next chunk. Then comes the payload. So, for example, if you have the list (w x y z), and you want to append the list (a b c) on the front of it, you'd get a structure like this (the | separates graphically the header from the payload; it does not represent anything in memory):
[3 * 0 | a b c] | `->[4 * 0 | w x y z] | `-> ø
The reason for the index is that now you can return the tail of a list lst without the first n elements by returning a vector chunk with 0 length and a pointer into lst with index n: [0 lst n | ]. If the n is greater than the size of the first chunk (e.g., if you want to drop 5 elements from the (a b c w x y z) list above), we must follow the "next" pointers until we find the chunk where the desired tail begins. This is likely to be more efficient than the cons cell case, because instead of following n "next" pointers, you follow the number of chunks, subtracting the length of the skipped chunk from n each time. In the worst case, where there is one chunk for each element, the performance is the same as for cons cells, at least in number of pointers traversals. (We must only allow empty chunks, like the [0 lst n | ] example, at the beginning of a list, never in the middle of a chunk sequence. This ensures worst-case cons-like behavior. If we allowed empty chunks anywhere, reaching the nth element of a list could require arbitrarily many chunk traversals.)
One problem with this is that now (cdr lst) allocates memory (it creates a [0 lst 1 | ] chunk and returns it), unlike the cons cell case, where cdr never allocates memory (it just returns the value of the cell's "next" pointer). One possible solution is to try to make the result of cdr go in the stack rather than being heap-allocated, to reduce the overhead (the compiler could special-case cdr somehow to make it return multiple values rather than a new chunk, and build the chunk on the fly in the caller if it turns out to be necessary.) Another way around this would be to return a pointer into the middle of a chunk instead of a new chunk. I see two ways of achieving this:
Change the layout of the vector such that it begins with the payload, followed by a special value that marks the end of the payload, followed by the pointer to the next chunk. Now we don't need an index into the next chunk anymore, because we can just make the "next" pointer point into the middle of the next chunk. The drawback is that now finding the end of the chunk requires traversing the whole chunk, rather than looking at the length at the header. This would still be faster than with cons cells, because you're traversing adjacent elements in memory.
Require all chunks to have a fixed size and fixed memory alignment (say, every chunk has 64 bytes and is allocated at an address that is a multiple of 64). Then you can always find the header by zeroing out the last bits of the address.
Third idea that occurred to me now: if all chunks are, say, 16-byte aligned, you can encode an index 0~15 into the lower-order bits of the pointer.
All these have drawbacks. First, you need to know that the pointer you have is a pointer to a cons cell to be able to safely do the pointer arithmetic. (The fixed-size chunks case is simpler to solve: you zero out the pointer and see if it points to a chunk type tag.) Also, pointers into the middle of objects complicate garbage collection (and even more reference counting, I think). Finally, if you fix the size of chunks some of the advantages of using chunks in first place go away; if I allocate a 1000-element list at once, that should get me a single 1000-element chunk.
Or should it? Another problem here is that now garbage collection / reference counting can only collect whole chunks. If you choose your chunks badly, you may end up holding memory for longer than necessary. For instance, if you have a 1000-element list and at some point your program takes tails until it only remains with a reference to the last three elements, and the list was made out of a single 1000-element chunk, now you're stuck with a huge chunk most of which is unused – and more, all the elements in it are held from being collected too. Maybe we'd need a heuristic: if the tail size you want is less than some threshold size of the chunk, the system would return a copy of the tail rather than the tail. This would mess with mutability (you'd never know if the tail list you got shares storage with the original), but maybe immutable lists are the way to go anyway.
The other problem to solve is how to make cons efficient: the classical Lisp cons adds (non-destructively) one element to the front of an existing list, and we don't want to create a new chunk per cons invocation, otherwise the chunks just degenerate into cons cells. One idea I had is to allocate chunks with a least a certain amount of elements. For example, if you create a list with just a, you'd get a chunk with a few blank spaces (and enough metadata to know what is blank and what isn't; this could be an extra header element, or just a distinguished value meaning "blank"): [4 ø 0 | _ _ _ a]. Now, when you cons a new element x into that list, cons would check if there is a space immediately before the a in the existing chunk, and mutate it in place: [4 ø 0 | _ _ x a]. This won't mess with the program's view of the list because so far it only had references to the already filled part of the list. The problem with this is if you have multiple threads wanting to cons onto the same list at the same time: we must ensure only one of them gets to mutate the chunk. For example, say one thread want to cons x onto the list (a), and another thread wants to cons y onto the same list (a). We must make sure that only one gets to mutate the chunk in place ([4 ø 0 | _ _ x a]), and the other one will fail and fall back to either by copying the chunk and then mutating the copy, or by creating a new chunk that points to the old one ([4 [4 ø _ _ x a] 3 | _ _ _ y]; note that outer chunk points into the inner chunk with an index 3, skipping the first 3 elements, including the x added by the other thread). This could have a synchronization overhead. I'm not sure if it would be significant, though, because all you need is a compare-and-swap: "try to write into this space if it is blank". You don't need a lock because you don't need to wait anyone: if this first try fails (i.e., if the other thread got the space first), the space won't be available anymore, so you must immediately fall back to creating a new chunk rather than waiting for anything.
A possible side-effect of all of this is that now vectors as a separate data structure may not be necessary: you just allocate an n-element list at once, and it will largely have the same performance as an n-element vector. Well, unless we make lists immutable, then we may need (mutable) vectors. And lists still have some arithmetic overhead to find the position of the element (because in general we don't know that the list is a single chunk when performing an access, we have to find that out), so vectors may still be advantageous in many circumstances.
Now, back to (trying to) work.
[Update: Apparently I reinvented a half-hearted version of VLists. Also, I didn't mention that, but the Lisp Machine had a feature similar in spirit (but not in implementation) called CDR coding, which used a special tag in cons cells to mean that the rest of the list itself rather than a pointer to it was stored at the cdr place, thus saving one pointer and gaining locality. In the Lisp Machine, every memory object was tagged, so this special tag came more or less for free, which is generally not the case for modern architectures.]
Copyright © 2010-2022 Vítor De Araújo
O conteúdo deste blog, a menos que de outra forma especificado, pode ser utilizado segundo os termos da licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Powered by Blognir.