Random project ideas :: Elmord's Magic Valley

Random project ideas

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A few days ago someone asked me what kind of projects I would like to work on. I have some ideas that have been wandering my mind for a long time, so I decided to write some of them down. (I'm sure someone will come around, steal some of them, and make millions of dollars, but there we go.) I have been meaning to write a post like this one for quite a while, but never got around it. For the next days I will be working on finishing my Master's monograph, so I decided write them down now. The text is kind of a mess, and especially towards the end it is more of a "thinking out loud" sort than a clear exposition of ideas, but it's probably a good idea (for me, anyway) to register these ideas at once before I forget them. As Zhuangzi would say, "Let me say a few careless words to you and you listen carelessly, all right?"

A programming language

Although there are plenty of nice programming languages around (and plenty of un-nice ones as well), I don't know any language I'm fully satisfied with, and I certainly would like to try creating one. I don't think there can be such a thing as one perfect language to rule them all, but I'd like to create one that is better suited to the way I like to do things and to the kinds of things I like to do. Some features of that hypothetical language would be:

Optional types

After writing some larger programs in Scheme, I realized that some static type analysis would go a long way in catching many programming mistakes. Moreover, while working (for a short time) with the Clang/LLVM codebase, I realized how types can be useful in finding other places in a program which need to be changed after a change in one part. However, I strongly believe in the ability to run incomplete (and incorrect) programs, both for the sake of debugging (it is often convenient to be able to run an incorrect program with some concrete data to see what is the state at the point of the error, rather than relying solely on static type errors), and for the sake of exploratory programming (when you want to try out stuff and figure out what program do you want to write after all).

(Some static-types people, when confronted with this notion, give replies like: "but static types help me with exploratory programming!". Well, dynamic typing helps me with exploratory programming. As I said before, I don't believe in a single programming language to rule them all, and neither do I believe that either static typing or dynamic typing is inherently superior in all circumstances and for all people. Indeed, I would actually like not to have to choose one or another, and that's kind of the point of this section.)

So, I would like to alternate between dynamic and static types according to convenience. First of all, I don't think static type mismatches should preclude compilation/execution; rather, these should emit warnings, and the compiler should emit code that runs as far as possible before hitting the error.

In dynamically-typed languages, it is relatively easy to call functions with arguments of the wrong type and keep running, because all values carry enough metadata (tags of some kind) to enable checking the type at any given moment. Statically-typed languages, on the other hand, usually employ "unboxed" representations for data (e.g., a 32-bit integer in memory is just 32 bits, with no extra tag indicating that it is an integer; a pointer is just another 32 (or 64) bits in memory, with no way of distinguishing it from an integer of the same size), therefore calling a function with a value of the wrong type (e.g., passing an integer to a function expecting a pointer) must be blocked; otherwise, there would be a violation of safety (e.g., the function would try to use the integer as a pointer, probably with disastrous consequences). Conventional statically-typed languages block execution of such unsafe function calls by refusing to compile the program, but that's not necessary: one might instead emit code that evaluates the program until the point where an unsafe function call would be performed, and abort execution just then. For instance, if f is a function taking integers, and g(x) returns a string, and the program contains the expression f(g(x)), you can still emit code that evaluates g(x) (and f), and then interrupts execution instead of calling f with the string. (I've written about this before, but anyway.)

So, you could have a language where (1) type declarations are optional; (2) if function parameter/return types are not declared, dynamic types and boxed representations are used by default; (3) if types are declared, static type checking is performed on function calls, and where the compiler cannot prove that the types match (e.g., because they don't match, or because a function with statically-typed parameters is called with a dynamically-typed value), it emits code to interrupt execution (and provide a decent diagnostic message, and/or trigger the debugger) just before the function is called. I wonder if interrupting execution at the function call wouldn't be too early (as ideally I'd like to be able run an incomplete/incorrect program as far as is reasonable). Perhaps there could be a compiler option/pragma to emit code for the entire module as if everything were dynamically typed, even when type declarations are provided (but still emit warnings upon detected type mismatches).

This mix of static and dynamic types brings some implementation problems. Because statically-typed functions can't just be called with any random argument due to representation mismatches, if a function expects another function as an argument (let's say, a function f expects a function taking an integer and returning an integer), you cannot pass it a dynamically-typed function g that happens to return integers when given integers, because the representation of the values it returns would be different (g returns boxed integers, but f expects a function that returns raw integers). Likewise, if a function f expects a dynamically-typed function, it cannot be given a function g that returns raw integers. One solution would be to generate a dynamically-typed wrapper around statically-typed functions when they are used in a context that expects a dynamic function, and vice-versa.

To research: It'd probably be good to see how Haskell deals with the situation where an Int -> Int function is used where an a -> a one is expected. Although now that I wrote this I realize that that's not the same problem, because in Haskell the concrete a is always statically known at the point where the integer value is used. Probably more promising would be to look at how Java and C# deal with this. (In Java, the difference between a raw integer and an Integer object is directly visible to the programmer, whereas I think that's not the case in C#.)

There has also been lots of work on the subject of gradual types, both recent and old, and I have to check it out.

Language interoperability

I'm probably beginning to sound repetitive, but as I said before, I don't believe in a single language for all circumstances, and not only I want to be able to use other existing programming languages, I'm probably going to create more than one in my lifetime. Still, I don't like the idea of having to reimplement code already available in a library just because the library is not available in the language I want to use. I'd like to be able to easily integrate pieces of code written in different programming languages, without having to manually write wrappers/bindings/etc. Ideally, there would be a "universal" way of exchanging data / invoking procedures across programming languages, and the "bridging" work would only have to be done once for each programming language (to make it support the "universal" way), rather than for each library and each language you want to use the library in.

This already kinda exists: the JVM and CLR do this – for languages that are implemented on the top of the JVM or CLR. The problem is that you are constrained to implement your language on the top of these virtual machines, which might not be a good match for the programming language in question (e.g., the VM may not support tail call optimization, or multiple inheritance, or you want to use the stack in a really strange way, or you want your strings to be UTF-8 instead of UTF-16, or you want direct access to operating system features, or whatever), or you may simply not want to use a bulky VM, or one whose evolution is in the hands of a single company and cramped by API copyright claims and/or patents.

An idea that occurred to me in the past is that of an extensible virtual machine: a machine where you could load modules that implemented new opcodes/features that would make it better suited to various programming languages. It is possible to do this without opcode conflicts between the different extensions by giving opcodes (fully-qualified) names instead of fixed numbers; the mapping from concrete opcode numbers to opcode names would be given in a header in the bytecode file, and would not be fixed. (I'm pretty sure I got this idea from WebAssembly, but I haven't been able to find a source right now.) So, for instance, you could have a Python extension for the VM, with Python-specific features and opcodes, and a Scheme extension, and so on. Of course, making all extensions work together is not just a matter of avoiding opcode conflicts, but having the freedom to extend the VM with features more convienient to each language to be supported sounds more promising than being forced to work with a fixed instruction set which did not have the language of interest in mind.

VMs are not the only possibility (although VMs have a number of other possible advantages, which I may write about in another post). Another way would be to let each language implementation be completely independent, but make it possible somehow for them to share data and call code from each other. This has the advantage of (potentially) making it easier to accomodate existing language implementations (rather than having to reimplement them as VM extensions), but of course brings with it a number of challenges to make interoperability work, such as:

Handling mismatches between data layout in memory. One idea that immediately comes to mind is message-passing/OO: it does not matter if your strings are UTF-8 or UTF-16 or something else if you always access their contents with a chatAt(i) method which knows how to find the ith character in it. You still have to agree on the format of characters, though. And it probably would be slow as hell to access everything through methods / message-passing. (But you could have methods like asUTF8() or asUTF16() which would return the string (or a copy thereof) in the specified format, which you could directly manipulate.) You also have to agree on a convention on how to invoke a method (but we would have to do that anyway, in principle).
Garbage-collecting data that crosses language boundaries. Basically you have multiple garbage collectors working at the same time, trying not to step on each others' toes. If language A calls a method from language B which returns an object x, now A holds a reference to x, but B's garbage collector is not prepared to go looking inside A's objects in memory for references to x. So it might end up collecting x while A still has references to it. One awkward solution would be to make everyone use the same garbage collector (for instance, a conservative one like Boehm–Demers–Weiser), but this precludes interesting implementations like Cheney-on-the-MTA linked before (which is used by Chicken Scheme, for instance). Another possibility would be to wrap every object that crosses language boundaries in a reference-counted container (which might also take care of other interoperability issues, such as containing enough information to allow inspecting the encapsulated object's methods). In the previous example, B would return x to A encapsulated within a wrapper X with a reference count of 2, meaning two runtimes (A and B) have access to the object. When B decides nothing in its memory holds a reference to x, the element would somehow (handwave) rememeber that x has been given to another runtime inside a container X, and decrement the container's refcount, instead of freeing x. When the refcount reaches zero, x can be freed. That could become unwieldy for large numbers of objects (e.g., if B gives A a linked list, every time A asks for the next element in the list, the element would get wrapped).

Another possibility, which got into my mind after reading this, would be a model in which no memory is shared between languages: everything is passed by copy, and things don't even have to be in the same process. (This brings up another problem that has been in my mind since forever: the fact that Unix streams, files, etc., are limited to bytes rather than structured data. Here we would be defininig a mechanism for exchanging higher-level data across processes.) This solves the data exchange problem, but not the code invocation problem. This could be done by message passing, but I wonder how that would work. For instance, suppose language A wants to use a regex library from language B.

A wants to search for all occurrences of a regex r within a 50MB string s. It would have to pass r and s by copy to B, and get the (potentially large) result by copy too, which sounds sub-optimal.
A wants to search for all occurrences of a regex r within a stream (open file, socket, generator, whatever) s. How would it pass s to B? Would it pass a handle h to B, and then listen for messages of the form "[nextItem h]" and answer accordingly? How would it know, in the general case, when to stop listening for messages about h from B? Wouldn't I have now a cross-process garbage collection problem, and now we're back to the wrapping solution?

To research: There is an infinite number of things to research here. One that I realized recently is GObject, GNOME/Gtk's way of doing objects in C which was created with the goal of easily supporting bindings to other languages in mind. Maybe GObject solves all the problems and there's nothing to do (other than making everything support GObject for exchange of arbitrary objects), or maybe GObject can be used as an inspiration. I should also look up other inter-process communication mechanisms (especially D-Bus, and Erlang/OTP). Other people have been working on this problem from other angles.

EOF

Sorry if this post looks like a mess of random ideas thrown around, because that's exactly what it is. As always, writing about stuff helps me organize my ideas about them and realize problems I hadn't thought about before (though how organized things are here is up to question). Feel free to comment.

Elmord's Magic Valley

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.