Elmord's Magic Valley

Software, lingüística e rock'n'roll. Às vezes em Português, sometimes in English.

Random project ideas

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A few days ago someone asked me what kind of projects I would like to work on. I have some ideas that have been wandering my mind for a long time, so I decided to write some of them down. (I'm sure someone will come around, steal some of them, and make millions of dollars, but there we go.) I have been meaning to write a post like this one for quite a while, but never got around it. For the next days I will be working on finishing my Master's monograph, so I decided write them down now. The text is kind of a mess, and especially towards the end it is more of a "thinking out loud" sort than a clear exposition of ideas, but it's probably a good idea (for me, anyway) to register these ideas at once before I forget them. As Zhuangzi would say, "Let me say a few careless words to you and you listen carelessly, all right?"

A programming language

Although there are plenty of nice programming languages around (and plenty of un-nice ones as well), I don't know any language I'm fully satisfied with, and I certainly would like to try creating one. I don't think there can be such a thing as one perfect language to rule them all, but I'd like to create one that is better suited to the way I like to do things and to the kinds of things I like to do. Some features of that hypothetical language would be:

Optional types

After writing some larger programs in Scheme, I realized that some static type analysis would go a long way in catching many programming mistakes. Moreover, while working (for a short time) with the Clang/LLVM codebase, I realized how types can be useful in finding other places in a program which need to be changed after a change in one part. However, I strongly believe in the ability to run incomplete (and incorrect) programs, both for the sake of debugging (it is often convenient to be able to run an incorrect program with some concrete data to see what is the state at the point of the error, rather than relying solely on static type errors), and for the sake of exploratory programming (when you want to try out stuff and figure out what program do you want to write after all).

(Some static-types people, when confronted with this notion, give replies like: "but static types help me with exploratory programming!". Well, dynamic typing helps me with exploratory programming. As I said before, I don't believe in a single programming language to rule them all, and neither do I believe that either static typing or dynamic typing is inherently superior in all circumstances and for all people. Indeed, I would actually like not to have to choose one or another, and that's kind of the point of this section.)

So, I would like to alternate between dynamic and static types according to convenience. First of all, I don't think static type mismatches should preclude compilation/execution; rather, these should emit warnings, and the compiler should emit code that runs as far as possible before hitting the error.

In dynamically-typed languages, it is relatively easy to call functions with arguments of the wrong type and keep running, because all values carry enough metadata (tags of some kind) to enable checking the type at any given moment. Statically-typed languages, on the other hand, usually employ "unboxed" representations for data (e.g., a 32-bit integer in memory is just 32 bits, with no extra tag indicating that it is an integer; a pointer is just another 32 (or 64) bits in memory, with no way of distinguishing it from an integer of the same size), therefore calling a function with a value of the wrong type (e.g., passing an integer to a function expecting a pointer) must be blocked; otherwise, there would be a violation of safety (e.g., the function would try to use the integer as a pointer, probably with disastrous consequences). Conventional statically-typed languages block execution of such unsafe function calls by refusing to compile the program, but that's not necessary: one might instead emit code that evaluates the program until the point where an unsafe function call would be performed, and abort execution just then. For instance, if f is a function taking integers, and g(x) returns a string, and the program contains the expression f(g(x)), you can still emit code that evaluates g(x) (and f), and then interrupts execution instead of calling f with the string. (I've written about this before, but anyway.)

So, you could have a language where (1) type declarations are optional; (2) if function parameter/return types are not declared, dynamic types and boxed representations are used by default; (3) if types are declared, static type checking is performed on function calls, and where the compiler cannot prove that the types match (e.g., because they don't match, or because a function with statically-typed parameters is called with a dynamically-typed value), it emits code to interrupt execution (and provide a decent diagnostic message, and/or trigger the debugger) just before the function is called. I wonder if interrupting execution at the function call wouldn't be too early (as ideally I'd like to be able run an incomplete/incorrect program as far as is reasonable). Perhaps there could be a compiler option/pragma to emit code for the entire module as if everything were dynamically typed, even when type declarations are provided (but still emit warnings upon detected type mismatches).

This mix of static and dynamic types brings some implementation problems. Because statically-typed functions can't just be called with any random argument due to representation mismatches, if a function expects another function as an argument (let's say, a function f expects a function taking an integer and returning an integer), you cannot pass it a dynamically-typed function g that happens to return integers when given integers, because the representation of the values it returns would be different (g returns boxed integers, but f expects a function that returns raw integers). Likewise, if a function f expects a dynamically-typed function, it cannot be given a function g that returns raw integers. One solution would be to generate a dynamically-typed wrapper around statically-typed functions when they are used in a context that expects a dynamic function, and vice-versa.

To research: It'd probably be good to see how Haskell deals with the situation where an Int -> Int function is used where an a -> a one is expected. Although now that I wrote this I realize that that's not the same problem, because in Haskell the concrete a is always statically known at the point where the integer value is used. Probably more promising would be to look at how Java and C# deal with this. (In Java, the difference between a raw integer and an Integer object is directly visible to the programmer, whereas I think that's not the case in C#.)

There has also been lots of work on the subject of gradual types, both recent and old, and I have to check it out.

Language interoperability

I'm probably beginning to sound repetitive, but as I said before, I don't believe in a single language for all circumstances, and not only I want to be able to use other existing programming languages, I'm probably going to create more than one in my lifetime. Still, I don't like the idea of having to reimplement code already available in a library just because the library is not available in the language I want to use. I'd like to be able to easily integrate pieces of code written in different programming languages, without having to manually write wrappers/bindings/etc. Ideally, there would be a "universal" way of exchanging data / invoking procedures across programming languages, and the "bridging" work would only have to be done once for each programming language (to make it support the "universal" way), rather than for each library and each language you want to use the library in.

This already kinda exists: the JVM and CLR do this – for languages that are implemented on the top of the JVM or CLR. The problem is that you are constrained to implement your language on the top of these virtual machines, which might not be a good match for the programming language in question (e.g., the VM may not support tail call optimization, or multiple inheritance, or you want to use the stack in a really strange way, or you want your strings to be UTF-8 instead of UTF-16, or you want direct access to operating system features, or whatever), or you may simply not want to use a bulky VM, or one whose evolution is in the hands of a single company and cramped by API copyright claims and/or patents.

An idea that occurred to me in the past is that of an extensible virtual machine: a machine where you could load modules that implemented new opcodes/features that would make it better suited to various programming languages. It is possible to do this without opcode conflicts between the different extensions by giving opcodes (fully-qualified) names instead of fixed numbers; the mapping from concrete opcode numbers to opcode names would be given in a header in the bytecode file, and would not be fixed. (I'm pretty sure I got this idea from WebAssembly, but I haven't been able to find a source right now.) So, for instance, you could have a Python extension for the VM, with Python-specific features and opcodes, and a Scheme extension, and so on. Of course, making all extensions work together is not just a matter of avoiding opcode conflicts, but having the freedom to extend the VM with features more convienient to each language to be supported sounds more promising than being forced to work with a fixed instruction set which did not have the language of interest in mind.

VMs are not the only possibility (although VMs have a number of other possible advantages, which I may write about in another post). Another way would be to let each language implementation be completely independent, but make it possible somehow for them to share data and call code from each other. This has the advantage of (potentially) making it easier to accomodate existing language implementations (rather than having to reimplement them as VM extensions), but of course brings with it a number of challenges to make interoperability work, such as:

Another possibility, which got into my mind after reading this, would be a model in which no memory is shared between languages: everything is passed by copy, and things don't even have to be in the same process. (This brings up another problem that has been in my mind since forever: the fact that Unix streams, files, etc., are limited to bytes rather than structured data. Here we would be defininig a mechanism for exchanging higher-level data across processes.) This solves the data exchange problem, but not the code invocation problem. This could be done by message passing, but I wonder how that would work. For instance, suppose language A wants to use a regex library from language B.

To research: There is an infinite number of things to research here. One that I realized recently is GObject, GNOME/Gtk's way of doing objects in C which was created with the goal of easily supporting bindings to other languages in mind. Maybe GObject solves all the problems and there's nothing to do (other than making everything support GObject for exchange of arbitrary objects), or maybe GObject can be used as an inspiration. I should also look up other inter-process communication mechanisms (especially D-Bus, and Erlang/OTP). Other people have been working on this problem from other angles.

EOF

Sorry if this post looks like a mess of random ideas thrown around, because that's exactly what it is. As always, writing about stuff helps me organize my ideas about them and realize problems I hadn't thought about before (though how organized things are here is up to question). Feel free to comment.

Comentários / Comments (0)

Deixe um comentário / Leave a comment

Main menu

Posts recentes

Comentários recentes

Tags

em-portugues (213) comp (133) prog (65) life (46) in-english (44) unix (33) pldesign (32) lang (31) random (28) about (26) mind (24) lisp (23) mundane (22) web (17) fenius (17) ramble (16) img (13) rant (12) hel (12) scheme (10) privacy (10) freedom (8) academia (7) esperanto (7) copyright (7) music (7) bash (7) lash (7) home (6) mestrado (6) shell (6) conlang (5) misc (5) emacs (4) politics (4) book (4) php (4) worldly (4) editor (4) latex (4) etymology (4) android (4) film (3) kbd (3) security (3) network (3) wrong (3) c (3) tour-de-scheme (3) poem (2) philosophy (2) comic (2) llvm (2) cook (2) treta (2) lows (2) physics (2) perl (1) german (1) wm (1) en-esperanto (1) audio (1) translation (1) old-chinese (1) kindle (1) pointless (1)

Elsewhere

Quod vide


Copyright © 2010-2019 Vítor De Araújo
O conteúdo deste blog, a menos que de outra forma especificado, pode ser utilizado segundo os termos da licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Powered by Blognir.