An approach to mutation :: Elmord's Magic Valley

An approach to mutation

2018-08-01 22:19 -0300. Tags: comp, prog, lisp, pldesign, in-english

I've been around lately with an idea for handling mutation in a new Lisp-like programming language. Most of these ideas are probably not new – in fact, while doing a little research I've found out about Clojure's transients, which embody some of the same ideas – and the parts that are possibly new are not necessary good. But I want to write this down for future reference, so there we go.

DISCLAIMER: This is one of those programming language design posts exploring a bunch of ideas and reaching no conclusion. Read at your own peril.

Some context

The problem with mutation is when the mutable data is shared with other parts of the program – especially when you don't know what parts of the program share the same data. For example, suppose you call a method blog_post.get_tags(), and it returns you a list ["comp", "prog", "lisp"] – can you mutate this list? For example, if I were to sort it, or remove elements from it, can I do it in-place, or I would be mutating a list used internally by the blog_post object and thus inadvertently affecting other parts of the program? Without looking at the method's source code, we don't know. If we wanted to be sure not to break anything, we would have to make a copy of the list and change the copy instead.

What if I am the person writing the get_tags() method? Should I always return a new copy of the list, wasting some memory and cycles but ensuring that whoever calls my function won't be able to affect the internal fields of the blog_post? Or should I always return the same list object, thus avoiding a new allocation but relying on the caller to do the right thing?

Now consider the strings inside that list. If I were to convert them to upper-case, should I do it in-place, or copy them first? In a language like C, this is the same problem as before: you have to know whether get_tags() gives you a copy of the original strings (which you can freely modify), or the internal strings used by blog_post (which you should not modify). But in a language like Java or Python, this problem does not come up: since strings are immutable in those languages, the only way to 'change' them is by making a new string, so modifying them in-place is not an option. On the other hand, the writer of the get_tags() method can now happily return the internal strings of the blog_post object, since they can be sure the strings cannot be modified by external code.

If you make all data structures immutable, you eliminate this problem – and that's the purely-functional approach, taken by languages like Haskell. Clojure is similar in making the core data types (such as lists, vectors and hashmaps) immutable, and having controlled forms of mutability. In traditional Lisps like Scheme and Common Lisp, on the other hand, most composite data types (including lists, vectors and strings) are mutable. The standards of those languages are careful in describing which functions always return freshly allocated data and which return values that may share parts with the function's arguments. This is basically part of the contract of those functions, which you have to know whenever you want to mutate values generated by them. The situation in traditional Lisps is somewhat aggravated by the fact that linked lists may share a tail: two lists (1 3 7 5) and (2 7 5) may actually share the same cons cells for the (7 5) part. In a mostly functional setting, this is okay, but if we want to mutate anything, we have to be extra careful not to be inadvertently changing something else. In this example, sorting the second list in place may end up messing the first list.

What I'm interested in is finding a middle ground between full immutability and full mutability. I want to be able to return immutable data from functions, so I can know the consumers of that data won't inadvertently change it, and I also want to be able to create mutable data which can be modified in place. It would also be nice to be able to use mutable data for temporary processing and make it immutable after we are ready. And I want to be able to tell at a glance if I'm dealing with mutable or immutable data.

So here is the idea…

The idea

First, we make all basic composite data types (lists, vectors, dictionaries, strings, etc.) immutable. Then we add a single mutable box type. Values of the box type have a single mutable field. This idea is at least as old as ML's ref type, so nothing new so far. I will use the notation &val to mean a box containing val, and the expression (set! box val) changes the contents of the box box to val. I will also use @ (read at) for the indexing function, so (@ vec idx) means the idxth element (starting at 0) from vector vec. (@ box) with no indices means to extract the box's contents.

So now we can make a mutable cell with an immutable vector inside, e.g., &[1 2 3]. We cannot mutate the vector directly, but we can replace the whole vector with another immutable vector. That may be elegant and all, but it's not as convenient as a mutable array, nor as efficient. There are some tricks we can play here, though.

The first trick is to make the assignment operator (set!) recognize vector indexing as its first argument, so if v is the vector-containing box &[1 2 3], we can write (set! (@ v 0) 42) to replace the vector [1 2 3] with the vector [42 2 3] inside the box. It looks like we are mutating the vector's first element, but actually we are replacing the whole original immutable vector with a new immutable vector with a different element at position 0.

This gets us convenience, but it's still inefficient: if I write a loop to mutate all elements of the vector, it will generate a fresh new vector on each iteration. But then comes the second trick: how can we tell the difference between a mutable cell with an immutable vector inside from an actual mutable vector? If we make the difference invisible to the programmer, then we can mutate the vector in-place as an optimization. So (set! (@ v 0) 42) syntactically looks like mutating a vector element, semantically means replacing the whole vector with a new one, but implementationally actually works by mutating the vector element anyway. I'm not sure about the wisdom of this double layer of self-cancelling illusions, but let's explore this idea further.

Let's call the naive implementation using a mutable box with an immutable vector inside, well, the naive implementation. And let's call the implementation which underlyingly uses a mutable vector to represent the box+vector combination the smart implementation (with the full understanding that it may actually be too smart for its own good, or maybe not smart enough to make this idea work well).

The most basic operation you can do with a box is extracting the contents. In the naive implementation, that's just returning the value inside. In the smart implementation, we must simulate this by copying the current contents of the mutable vector into a freshly allocated immutable vector and returning that. A user can then observe the difference between the two implementations by taking the contents of the same box twice and checking whether the results are eq? to each other, i.e., whether they are the same object in memory.

It seems to me that the solution to this problem is to ditch object identity for immutable objects from the language, i.e., get rid of the lower-level eq? operation (Python's is), or at least relegate it to a library of lower-level operations. Immutable objects should only be compared by its contents, not by identity: if I compare [1 2 3] with [1 2 3], it should not matter whether they are separate objects in memory or not: they have the same contents and that's what matters. The only way to tell two distinct objects with the same contents from each other (apart from eq?) is by mutating one of them and seeing if the other changes as well; but if the objects are immutable, this distinction disappears.

A possible optimization to reduce the amount of copying when extracting a box's contents is to return the actual underlying vector, but mark the box as copy-on-write, i.e., we postpone the copy to the next time we need to mutate the vector inside the box. If the box is not mutated afterwards, the vector is never copied. The problem with this may be performance: we need to check the copy-on-write flag before every change to the vector, and the whole point of these optimizations is performance. Sure, we avoid a copy, but we slow down every write to the vector. This is aggravated by the fact that this flag must be synchronized across threads, lest we end up with two threads making new copies and clobbering each others' view of the box.

Speaking of which, doesn't thread synchronization throw this whole idea out of the window anyway? Extracting the contents of a box must be an atomic operation, but someone might be mutating the underlying vector while we are copying it into a new immutable vector to return it. This is okay as long as we can guarantee that the resulting copy represents one possible atomic state of the box at the time of the extraction, but consider the following scenario:

The box b originally contains &[1 2 3 4 5].
Thread 1 evaluates (def v (@ b)). The implementation starts to copy the elements into a new immutable vector.
Right after thread 1 copies 1, thread 2 evaluates (set! (@ b 0) 10). Now b is &[10 2 3 4 5]. Thread 1 keeps on with its copying.
Thread 2 evaluates (set! (@ b 4) 50). Now b is &[10 2 3 4 50].
Thread 1 finishes copying. Because the copy saw the second mutation but not the first one, the resulting copy has [1 2 3 4 50].
However, [1 2 3 4 50] has never been a state of b: the only states b had were [1 2 3 4 5], [10 2 3 4 5], and [10 2 3 4 50]. Thus, the illusion that the box contains an immutable vector is broken.

To avoid this problem, the implementation would have to acquire a lock (or use some other form of thread synchronization) when extracting the contents of a box, thus slowing down what should be a cheap operation. The copy-on-write solution avoids the copy incoherence problem, because the copying happens from the now-immutable vector to the new mutable one, and not the opposite, so we know that the origin will not change mid-copy. But as we have seen, we need to ensure synchronization of the copy-on-write flag, so it's pretty much the same.

Maybe this synchronization requirement is a good thing: maybe we want copies to be atomic anyway, and this way the semantics of the language guarantees that. But maybe this is an unnecessary overhead most of the time.

Even if we go with the copy-on-extract solution, we can avoid copying in the case of extracting the object from the box and then discarding the box (e.g., if we want to create a mutable vector, do a bunch of mutating operations on it, and then make it immutable) by providing an (unbox! b) operation which returns the contents and sets the box contents to nil (or whatever other value to indicate that the box is "empty"). Because we know the vector will not be mutated again, we can just return the underlying vector and call it immutable. This is basically what Clojure's persistent! operation does (though I didn't know that when I had this idea).

Let's consider some other problems and optimizations.

Putting it back in the box

What about sequences of transformations? For example, suppose I implement a filter function, which takes a predicate function and a vector and returns a new vector containing only the elements for which the predicate function returns true, like this:

(def (filter pred vec)
  (let ([result &[]])
    ;; Collect satisfying items in the mutable result vector...
    (for [item in vec]
      (when (pred item)
        (push! item result)))
    ;; And then return the contents as an immutable vector.
    (unbox! result)))

What if I want to, say, filter a vector and then reverse it? If filter is written like this, I get an immutable vector back, so I would have to copy it into a mutable vector again just so I can reverse it. If only filter had not called unbox! at the end, I could have reversed it in-place without a new allocation! But if I don't unbox!, I will have to always manually unbox the result when I want to, and most of the time I do want an immutable result.

There is a possible trick to help us here: if we unbox a value just to immediately box it back again, we can actually keep using the same underlying storage with no copying. The problems with this optimization are: (1) We must be able to know that no other references to the object have been created between the unboxing and the re-boxing, and basically the only way to do this is with some sort of reference counting. Reference counting has its share of problems (cyclic data structures never reach count zero, we need to synchronize count updates across threads), so relying on an optimization which requires us to use reference counting is not good. (2) We need to make sure the reference count does not inadvertently rise above 1, which would preclude the optimization. Since there may be more going on under the scenes in the compiler/interpreter than reaches the eye, this would be an unreliable optimization, that sometimes does not trigger for non-obvious reasons.

An alternative to use reference counting would be to do this analysis at compile-time, either through some form of escape analysis (which is hard to do across functions), or some crazy type system with uniqueness types (like Rust's borrow checker), which does not mesh well with my goal of a dynamically typed language.

Nested data structures and sharing

What if I put nested data inside a box, like &[[1 2] [3 4]]? Should the sub-vectors become underlyingly mutable too? If so, when should we stop recursing through the structure, which could contain other composite data types in it? Should we do it lazily, using copy-on-write as we mutate the inner vectors? The implementation of this can get tricky. If I access (@ b 0), should I get an immutable [1 2], or should I get a mutable &[1 2] which shares the same memory as the original element, so that mutations on the &[1 2] vector are reflected on the &[[1 2] [3 4]] one?

I'm too tired to even analyse the possibilities right now.

Refactoring code to change data mutability

Suppose I have a data structure with a bunch of immutable fields, say, (Person name age) and I decide I want to make the age field mutable. In our conceptual framework, this means wrapping the value of the field in a mutable box, e.g., (Person "Hildur" &22), i.e., the field itself remains immutable, but its value is a mutable cell. That looks nice, and makes all mutability readily visible, but it also means that we have to change all code using the age field to extract the value from the box, even code that does not mutate the value.

Maybe this is a good thing: if the code was written under the assumption that the value does not change, maybe it is good that we have to revise everything when we turn it mutable. On the other hand, this makes it harder to try things out in code and run it to see what happens, and for I long time I have defended the ability to run incomplete (and even wrong) programs while prototyping. However, I also want to be able to run optional static checks, and it's easier to do so when code is explicit about its intentions. So there we go.

An alternative: mutability as a flag

An alternative to the mutability-as-boxes approach is just to make all composite data structures carry a 'mutable' flag. We can still use the notation &[1 2 3] to mean a vector with the mutable flag on. We can provide an operation like Clojure's persistent! which turns the mutability flag of an object off. An operation to turn it back on would be more dangerous, but might be useful for debugging purposes (though it's the kind of thing you can be certain will be abused if added to the language (though Lisps have traditionally always preferred to empower the user rather than impose decisions on them)).

In this scenario, the semantics of (set! (@ v 0) 42) is to actually modify the vector in-place, so we don't need the double illusion trick. If we want to return an immutable version of a mutable data structure without losing the mutability, we have to explicitly copy it. Perhaps more descriptive of intention, we may have a non-destructive persistent operation which returns an immutable version of a value, which may or may not actually involve a copy (it may actually use copy-on-write behind the scenes). Thread synchronization has to be done explicitly, otherwise you assume the risks of getting a partially-modified copy. This is somewhat unsatisfying, but inconsistencies across threads could happen with boxes anyway whenever you had to work with more than one box, so a better solution to synchronization is needed anyway.

(In)conclusion

This idea of using mutable boxes + immutable data structures + optimization tricks had been haunting me for a week and seemed really appealing at first, but thinking more deeply about it, it does have its share of problems. Maybe it's a cool idea anyway, maybe not. I have to think more about it. Said research will require more and abundant funding.

Comentários / Comments (1)

Nemly e Nemlerey, 2018-08-15 23:28:53 -0300 #

https://img.elo7.com.br/product/main/113FE59/painel-wolverine-1-5mx1-0m-mod-decoracao-de-festa.jpg

Elmord's Magic Valley

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.