The Lispless Lisp :: Elmord's Magic Valley

The Lispless Lisp

2019-03-08 01:16 -0300. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

For a while I have been trying to design a nice Lisp-based syntax for Hel, trying to fit things like type information and default arguments in function definitions, devising a good syntax for object properties, etc., but never being satisfied with what I come up with. So a few days ago I decided to try something else entirely: to devise a non-Lisp syntax while maintaining a similar level of flexibility to define new language constructs. And I think I have come up with something quite palatable, though there are a few open problems to solve.

The idea is not entirely new. I know of at least Elixir, which is homoiconic but has more variety/flexibility in its syntax, though it has a bunch of hardcoded reserved words; and Dylan, which seems to have a pretty complex macro system, though maybe a little more complex than I'd wish. My non-Lisp Hel syntax has a bunch of hardcoded symbols (( ) [ ] { } , ; and newline), and the syntax for numbers and identifiers is hardcoded too (but that is hardcoded even in Lisp, though Common Lisp has reader macros to overcome this problem to an extent), but there are no reserved keywords, and I find it easier to read and analyze than Dylan. I hope you like it too (though feel free to comment in any case).

A taste of syntax

Here is a sample of the new syntax:

if x > 0 {
    print("foo")
    return x*2
} else {
    print("bar")
    return x*3
}

That looks like a pretty regular language, but the magic here is that none of the "keywords" is hardcoded. This is interpreted as a command if with the four arguments x > 0, the first block, else, and the second block. It is up to the operator/macro bound to the variable if to decide what to do with these arguments.

There are some caveats here, the most notable one being that the else must appear in the same line as the closing brace of the first block, otherwise it would be interpreted as an independent command rather than a part of the if command. I think this is an acceptable price to pay for the flexibility of not having a limited, hardcoded set of commands.

How does it work

The central component of the syntax is the command, or perhaps more precisely the phrase, since "command" gives the impression of a separation between statements and expressions which does not really exist in the syntax. A phrase is a sequence of space-separated constituents. A constituent is one of:

a constant, such as 42 or "hello";
an identifier, such as x;
an infix expression (consituent operator constituent), such as 2*3;
a prefix expression (operator constituent), such as -x;
a function call, such as f(x, y);
an indexing operation, such as vec[i];
a tuple, such as (1, 2, 3);
an array, such as [1, 2, 3];
a parenthesized expression, such as (42);
a block, such as { print("foo") }.

The arguments of a function call, indexing operation, parenthesized expression, and the elements of tuples and arrays are themselves phrases (i.e., you can have an if inside a function call).

Parsing of constituents is greedy. When looking at a series of tokens such as if x > 0 { ... } and trying to determine where a consituent ends and the next one starts, the parser will consider each consituent to be the longest sequence of tokens from left to right that can be validly interpreted as a constituent. In this example:

The first constituent is if, since if x is not a valid constituent.
The second constituent is x > 0, since x > 0 { (or any longer sequence) cannot be a valid constituent.
The third constituent is the whole block.

Phrases are separated by newlines or semicolons. To avoid the effect of a newline, a \ can be used at the end of the line (as in Python). Within parentheses or brackets, newlines are ignored (also as in Python).

That's pretty much all there is to it, in general lines. Except...

Operator precedence

An operator is any sequence of one or more of the characters ! @ $ % ^ & * - + = : . < > / ? | ~. So + and * are operators, but so are ++ and |> and @.@.

One open problem with this syntax is how to handle operator precedence in a general way. In my current prototype, I have hardcoded the precedence of the arithmetic operators, but I need to have sane precedence rules for user-defined operators.

One way is to allow the user to specify the precedence for custom operators (like the infixl and infixr declarations in Haskell). The problem is that this means a program cannot be parsed without interpreting the fixity declarations, which I find annoying, especially given that operators can be imported from other modules, and being unable to parse (read in Lisp parlance) a program without compiling the dependencies is deeply annoying from a Lisp perspective. Another problem is that it is not only hard to parse for the parser, it's hard to parse for the human too.

Another way is to have a fixed rule to assign each operator a precedence. I remember seeing a language once which gave each operator a precedence based on its first character (so, for example, *.* would have the same precedence as *). [Update: I don't remember which language it was, but it turns out that Scala does the same.] I like this solution a lot because it's easy for the human to know the precedence of an operator they've never seen before. The problem is reconciling this with the natural precedences expected from some operators (for instance, = and == usually have different precedences). I still have to think about this. Suggestions are welcome.

[Update: Maybe it makes more sense for the last character to determine precedence, since we want things like += to have the same precedence as =. On the other hand, an operator like => makes more sense as having the same precedence as = than >. Don't = and > have the same precedence anyway, though, since == and > do?]

Gotchas

Constituents vs. phrases

The consituent parsing rules may cause some surprises. Consider the following example:

let answer = if x > 0 { 23 } else { 42 }

In principle, this looks like assigning the result of the if to the variable answer. However, the parsing rules as stated above will lead to this being parsed as the sequence of constituents:

let
answer = if
x > 0
{ 23 }
else
{ 42 }

which is not quite what was intended. To use the if expression (which is a whole phrase, not a constituent) as the right-hand side of =, one would have to surround it with parentheses.

Commas delimit phrases

The arguments of a function call are phrases, not constituents, so an if expression can appear as the argument of a function call without having to surround it with an extra pair of parentheses on the top of those already required by the function call. But function arguments are delimited by commas, so, to avoid ambiguity, commas are not allowed to appear outside parentheses. For example, you cannot have a command like:

for x, y in items { ... }

because in a function call like:

f(for x, y in items { ... })

it would be ambiguous whether this is a single argument or two. The solution is to require:

for (x, y) in items { ... }

instead.

This also means that import foo, bar must be import (foo, bar) instead, though this limitation might be lifted outside parentheses.

Function calls vs. constituent+tuple

A space cannot appear between a function and its argument list. The reason is that we don't want for (x, y) in list to be interpreted as containing a function call for(x, y), nor do we want for x in (1, 2, 3) to be interpreted as containing a call in(1, 2, 3). I don't typically write spaces between a function and its arguments anyway, but I feel ambivalent about this space sensitivity. Perhaps the most important thing here (and in the above gotchas as well) is to have good error messages and (optional, on by default) warnings when things go wrong. For example, upon seeing something that might be a function call with a spurious space inside:

foo.hel:13: error: `foo` is not a command
foo.hel:13: hint: don't put spaces between a function and its argument list
13   foo (x, y)

This might be harder for the macros (e.g., for identifying that the call in(1, 2, 3) it received as argument was meant to be two separate constituents), but I think it can be done, especially for rule-based macros (as opposed to procedural ones).

A different solution is to get rid of tuples entirely and use for [x, y] instead, except this does not really solve the problem because this is ambiguous with the indexing operation.

That's all for now

I already have a prototype parser, but it's pretty rough, and I still have to work on the interpreter, so I have not published it yet. If you have comments, suggestions, constructive criticism, or two cents to give, feel free to comment.

Elmord's Magic Valley

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.