Elmord's Magic Valley

Posts com a tag: `prog`

Partial goals, and other rambles

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

Designing a programming language is a huge undertaking. The last few days I have been thinking about types and classes and interfaces/traits and I got the feeling I was getting stuck in an analysis paralysis stage again. There is also lots of other questions I have to thinking about, such as: How will I handle mutability, and concurrency, and how those things interact? Can I add methods to classes at runtime? Can I implement interfaces to arbitrary existing types? How do I get a trait implementation 'in scope' / available? How static and dynamic typing will interact? How does this get implemented under the hood? And so on, and so forth…

At the same time, it may seem that not solving those problems undermine the whole point of designing a new language in the first place. If you don't solve the hard problems, why not keep using an existing language?

And that's the recipe for paralysis.

But the thing is, even if you solve part of the problems, you can already have something valuable. If Fenius 0.3 gets to be just a 'better' Scheme with nicer syntax and fewer namespacing problems*, that would be already a language I would like to use.

So what would a Minimum Viable Fenius need?

Records, with single inheritance. We already have those, actually, it's just that there is no syntax for inheritance exposed yet.
Throwing and catching exceptions of a given (record) type.
Basic string capabilities. There is the whole Unicode thing, and the problem of text vs. binary data, but we can use the host Scheme strings as a start.
Basic pattern matching, and particularly pattern matching over ASTs. This is necessary to make macros convenient to write. No need for the whole syntax-rules and hygiene machinery right now.
Basic file manipulation (opening, reading, writing, closing, testing for existence).
An R5RS-comparable standard library. That means basic arithmetic, iteration (map, foreach, filter), that kind of thing.
Fixing up parser corner cases (e.g., add the ability to refer to operators as identifiers, allow expressions to continue in the next line by either adding \ at the end or ending a line with an infix operator).
More robust error handling (and more meaningful error messages).
Fix the module import path situation. (Just allowing to specify the import path (say, via an environment variable) would already be a usable solution.)

These would go a long way already. These are all pretty doable and don't involve much hard thinking. The greater problems can be tackled later, after the language is usable for basic tasks.

I would also love to have basic sockets to try to write web stuff in Fenius, but these will have to wait, especially given that Chez does not come with sockets natively. Thunderchez has a socket library; maybe I can use that. But this will come after the items above are solved. I can write CGI apps in it, anyway, or make a wrapper in another language to receive the connections and pass the data via stdin/stdout.

Also, I sometimes wonder if it wouldn't be more profitable to implement Fenius on the top of Guile instead of Chez, since it has more libraries, and now also has a basic JIT compiler. But at the same time, the ultimate goal is to rewrite the implementation in itself and leave the Scheme dependency behind, so maybe it's better not to depend too much on the host ecosystem. Although the Chez compiler is so good, maybe it would make more sense to compile to Scheme and run on the top of Chez rather than compile to native code. If only it did not have the annoying startup time… we'll see in the future!

_____

* Of course, 'better' is subjective; what I want is for Fenius 0.3 to be a better Scheme for me.

Addendum: And by 'better syntax' I don't even mean the parentheses; my main annoyance when programming in Scheme is the syntax for accessors: the constant repeating of type names, like (person-name p) rather than p.name, (vector-ref v i) rather than v[i], (location-line (AST-location ast)) rather than ast.location.line, and so on. This goes hand in hand with the namespacing issue: because classes define a namespace for their methods, you don't have to care about name conflicts so often.

Comentários / Comments

Persistent hashmaps

2019-05-23 21:36 -0300. Tags: comp, prog, pldesign, fenius, in-english

Fenius got dictionaries! Now you can say:

fenius> let d = dict("foo" => 1, "bar" => 2)
dict("foo" => 1, "bar" => 2)
fenius> d["foo"]
1
fenius> d.with("foo" => 3, "quux" => 4)
dict("foo" => 3, "bar" => 2, "quux" => 4)
fenius> d.without("foo")
dict("bar" => 2)

Dictionaries are immutable: the with and without methods return new dictionaries, rather than modifying the original. I plan to add some form of mutable dictionary in the future, but that will probably wait until I figure Fenius's mutability story. (Right now, you can mutate the variables themselves, so you can write d := d.with("foo" => 42), for example.)

Dictionaries are persistent hashmaps, much like the ones in Clojure. Fenius's implementation of hashmaps is simpler than Clojure's, because it does less: beside persitent/immutable hashmaps, Clojure also supports transient hashmaps, a mutable data structure onto which entries can be inserted and removed efficiently, which then can be turned into a persistent hashmap. I want to implement something like this in Fenius at some point. On the flip side, the current Fenius implementation is easier to understand than its Clojure counterpart.

The underlying data structure is based on hash array mapped tries (HAMTs). The standard reference for it is Bagwell (2000). The persistent variant used by Clojure, Fenius and other languages is slightly different, but the basic idea is the same. In this post, I intend to explain how it works.

I will begin by explaining an auxiliary data structure used by the hashmap, which I will call a sparse vector here.

Sparse vectors

Suppose you want to create vectors of a fixed size (say, 8 elements), but you know most of the positions of the vector will be empty much of the time. If you create lots of these vectors in the naive way, you will end up wasting a lot of memory with the empty positions.

23 42

A cleverer idea is to split this information into two parts:

a bitmap telling which positions are filled (an integer where the nth bit is set if the nth position is filled); and
a content vector with the values of those positions only.

00001010 23 42

(Note that the first position is represented by the lowest bit.) If we know beforehand that the number of positions is limited (say, 32), we can fit the bitmap into a single integer, which makes for a quite compact representation.

That takes care of representing the vectors. But how do we fetch a value from this data structure given its position index i? First, we need to know if the position i is filled or empty. For that, we bitwise-and the bitmap with a mask containing only the ith bit set, i.e., bitmap & (1 << i) (or, equivalently, bitmap & 2ⁱ). If the result is non-zero, then the ith bit was set in the bitmap, and the element is present. (If it's not, we return a default empty value.)

Once we know the element is there, we need to find its place in the content vector. To do that, we need to know how many elements are there before the one we want, and skip those. For that:

We bitwise-and the bitmap with a mask having all bits for the elements before the one we want set, e.g., if we want (counting from 0) the element 3 (bitmap 00001000), we bitwise-and the element bitmap with 00000111 (having all bits before bit 3 set). That mask happens to be 00001000 - 1, or (1 << i) - 1, or 2ⁱ - 1. So we compute bitmap & ((1 << i) - 1). The result is a bitmap containing 1s only in the positions that are before the one we want and are filled. In our example, that would be:
```
  00001010
& 00000111
  --------
  00000010
```
We now count the number of 1 bits in the result. This is called the integer's Hamming weight, also known as population count. There are clever algorithms to compute it, and modern processors have a dedicated instruction (often called popcount or something similar) to calculate it. Some programming languages also have a function to compute it: R6RS Scheme, for example, has a bitwise-bit-count function for that.

The result of all this is the number of filled positions before the one we want. Because positions are counted from 0, it also happens to be the position of the wanted element in the content vector (e.g., if there is 1 element before the one we want, the one we want is at position 1 (which is the second position of the vector)). In summary:

def sparsevec_get(sparsevec, position):
    if sparsevec.bitmap & (1 << position):
        actual_position = bit_count(sparsevec.bitmap & ((1 << position) - 1))
        return sparsevec.content[actual_position]
    else:
        return SOME_EMPTY_VALUE

To insert an element at position i into the vector, we first figure if position i is already filled. If it is, we find its actual position in the content vector (using the same formula above) and replace its contents with the new value. If it is currently empty, we figure out how many positions are filled before i (again, same formula), and insert the new element just after that (and update the bitmap setting the ith bit). Analogously, to remove the ith element, we check if it is filled, and if it is, we find its position in the content vector, remove it, and clear the ith bit in the bitmap.

In our real implementation, sparse vectors will be persistent, so updates are performed by creating a new vector that is just like the original except for the added or removed elements.

Enough with sparse vectors. Let's get back to HAMTs.

Hash Array-Mapped Tries

A trie ("retrieval tree") is a tree where the path from the root to the desired element is determined by the symbols of the element's key. More specifically:

The element keys are, conceptually, strings, or sequences of symbols, from an alphabet. For example, keys might be sequences of bits (and the alphabet would be {0, 1}), or sequences of letters from a to z, or whatever.
Each node of the trie has one (possibly empty) child for each possible symbol of the alphabet. If the keys are sequences of bits, each node would have two (possibly empty) children, labelled 0 and 1. If the keys are sequences of letters from a to z, each node would have 26 children, and so on.
To find an element in the trie, we start from the root and traverse the trie by consuming sucessive symbols from the key and taking the path to the correspondingly-labelled child in the tree, until we find the element we want or reach an empty node (in which case the key is absent from the trie).

A hash function is a function f such that if two elements x and y are equal, then f(x) = f(y). (The opposite is not true: there are typically infinitely many elements x for which f(x) produces the same result.) The result of f(x) is called a hash of x. A hash is typically a finite sequence of bits, or symbols... from an alphabet. You can see where this is going.

A hash trie is a trie indexed by hashes. To find an element with key x in the trie, you first compute the hash f(x), and then consume successive bits (or symbols) from the hash to decide the path from the root of the trie to the desired element.

In principle, this would mean that the depth of the trie would be the same as the length of the hash. For example, if our hashes have 32 bits, we would have to traverse 32 children until we reach the element, which would be pretty wasteful in both storage and time. However, we don't need to use the entire hash: we can use just enough bits to distinguish the element from all other elements in the trie. For example, if the trie has elements x with hash 0010, y with hash 0100, and z with hash 1000, we build a trie like:

That is: z is the only element whose hash starts with 1, so we can store it right below the root as the child labelled 1. Both x and y have hashes beginning with 0, so we create a subtree under the root's 0 for them; the second bit of the hash is 0 for x and 1 for y, so we store them below the subtree under the appropriate label. We don't need to go any deeper to disambiguate them.

To insert an element in the tree, we compute its hash and start traversing the tree until either:

We find an empty node: in this case we put the element there; or
We find an existing leaf (element) node: this means that the existing and the new element share a prefix of the hash. In this case we have to compute the hash of the existing element and rebuild the tree from there downwards, creating new subtrees for each subsequent hash bit until we get to a bit where the hashes differ (one of the elements has 0 and the other has 1 in that position): then we can put the old and new elements as separate children of the corresponding subtree.

If our hashes have a finite number of bits, it may happen that two distinct elements end up having the same hash. There are two ways of handling this problem:

Bagwell's solution is to use hashes with an infinite number of bits. Roughly, the idea is that instead of having a function f(x) to compute the hash of x, you would have a function f(x, i) to compute the ith bit of the hash of x (or, more efficiently, the ith group of n bits of the hash, for some fixed n), and you can ask for an arbitrarily large i. If your function f is constructed in such a way that every pair of distict elements x and y will generate a distinct sequence of bits (i.e., there will eventually be an i for which f(x, i) and f(y, i) will be different), you have effectively eliminated hash collisions from your life. The problem is that the typical hash functions provided by your favorite programming language don't work like that.
The other option, used by Clojure et al., is to use a collision list whenever you have to store two elements with the same hash in the tree. (The current Fenius implementation goes at bit overboard with this: the leaves of the trees are always lists of elements, even if there is a single element in the leaf. This wastes a bit of memory but makes things a bit simpler. I may change this in the future.)

The only problem with this scheme is that our trees can still get pretty deep as we add more elements to them. For example, if we add 1024 elements (2¹⁰) to the tree, we need at least 10 bits of hash to distinguish them, which means we need to go at least 10 levels deep in the tree to find them; the deeper the tree is, the slower it is to find an element in it. We can reduce the depth if, instead of branching on single bits of the hash, we use groups of bits of a fixed size, say, 5 bits. Then instead of each node having 2 children, each node has 2⁵ = 32 children, labelled {00000, 00001, ..., 11110, 11111}, and we consume 5 bits of the hash for each level we traverse. Now a tree of 2¹⁰ elements will typically be 2 levels deep rather than 10, which makes it much faster to traverse.

The only problem with this scheme is that each node now needs to have space for 32 children, even when the children are empty. For example, if I store two elements, x with hash 00000 00001, and y with hash 00000 00100, the root of the tree will be a node with 32 children, of which only the 00000 child will be non-empty. This child will contain a subtree containing x at position 00001, y at position 00100, and all other 30 positions empty. If only we had a way to only store those positions that are actually filled. A sparse vector, if you will...

Congratulations, we have just invented hash array-mapped tries, or HAMTs. A HAMT is a hash trie in which each non-leaf node is a sparse vector, indexed by a fixed-length chunk of bits from the hash. To find an element in the HAMT, we traverse the sparse vectors, consuming successive chunks from the hash, until we either find the element we want, or we consume the entire hash and reach a collision list (in which case we look for the element inside the list), or we reach an empty child (in which case the element is not found). Because the sparse vector only allocates space for the elements actually present, each node is compact, and because each level is indexed by a large-ish chunk of bits, the tree is shallow. Win win.

The sparse vectors are immutable, and so is our tree. To add an element to the tree, we have to change the nodes from the root to the final place of the element in the tree, which means making copies of them with the desired parts changed. But the nodes that are not changed (i.e., those that are not part of the path from the root to the new element) are not copied: the new tree will just point to the same old nodes (which we can do because we know they won't change). So adding an element to the tree does not require making a full copy of it.

Removing an element from a HAMT requires some care. Basically, we have to replace the node where the element is with an empty child. But if the subtree where the element was had only two elements, and after removal is left with only one, that one element takes the place of the whole subtree (you never have a subtree with a single leaf child in a HAMT, because the purpose of a subtree is to disambiguate multiple elements with the same hash prefix; if there is no other element sharing the same prefix with the element, there is no point in having a subtree: the element could have been stored directly in the level above instead).

Miscellaneous notes

When profiling my implementation in a benchmark inserting a million elements in a HAMT, I discovered that most of the time was spent on an auxiliary function I wrote to copy sequences of elements between vectors (when updating sparse vectors). This would probably be more efficient if R6RS (or Chez) had an equivalent of memcpy for vectors. It does have bytevector-copy!, but not a corresponding vector-copy!. Go figure.

R7RS does have a vector-copy!, but I'm using Chez, which is an R6RS implementation. Moreover, R7RS(-small) does not have bitwise operations. But it totally has gcd (greatest common divisor), lcm (least common multiple) and exact-integer-sqrt. I mean, my idea of minimalism is a bit different. Also, it has a timestamp function which is based on TAI instead of UTC and thus requires taking account of leap seconds, except it's not really guaranteed to return accurate time anyway ("in particular, returning Coordinated Universal Time plus a suitable constant might be the best an implementation can do"). Yay. [/rant]

Implementing transient/mutable HAMTs efficiently is a bit more complicated. For it to be efficient, you need to be able to insert new elements in a sparse vector in-place, but you can only do that if you pre-allocate them with more space than they actually need, so you have room for growing. Finding a proper size and grow factor is left as an exercise to the reader.

Comparing performance with Clojure HAMTs is not a very exact benchmark, because the implementations are not in the same language (Clojure's is in Java and Fenius's is in Chez Scheme). In my tests doing 10M insertions, Clojure sometimes did much faster than Chez, sometimes much slower, with times varying between 16s and 48s; the JVM works in mysterious ways. Chez's fastest was not as fast as Clojure, but performance was consistent across runs (around ~35s). Note that this is the time for using the hashmap implementation from ChezScheme, not Fenius; doing the benchmark directly in Fenius would be much slower because the language is currently interpreted and the interpreter is very naive. Note also that in actual Clojure, you would do all the insertions on a transient hashmap, and then turn it into a persistent hashmap after all the insertions, so the benchmark is not very representative of actual Clojure usage.

End of file

That's all for now, folks. I wanted to discuss some other aspects of Fenius dictionaries (such as syntax), but this post is pretty big already. Enjoy your hashmaps!

Comentários / Comments

Modules in Hel (and Chez Scheme)

2019-04-18 22:41 -0300. Tags: comp, prog, pldesign, lisp, scheme, hel, fenius, in-english

Today I implemented a simple mechanism for importing modules in Hel. Basically, you can write:

import foo/bar/baz

and it will look for a file named foo/bar/baz.hel, load it as a module containing the bindings defined in the file, and expose the module as baz to the calling code. So if foo/bar/baz.hel defines a function hello, then after you import foo/bar/baz, you can call the function as baz.hello(). Alternatively, you can import the module with a different name using:

import foo/bar/baz as whatever

and access the bindings like whatever.hello().

That's all there is to it, which means there's a lot of things missing from the system. But I decided it was best to do the simple thing¹ and leave the more complex details of the module system for a later phase.

Oh, one more trick: if the module file is not found, Hel tries to import a Scheme module with the given name. So you can actually say import chezscheme and get all the bindings from the host Scheme. (Except some of the names are inacessible, since there is currently no way to use ? or - in Hel identifiers. We'll see how to fix that in the future.

Modularizing the interpreter

Incidentally, I also started an attempt to split the interpreter (a single 1420-line Scheme file) into modules. I still haven't merged those changes into the master repository, and I'm still not sure if it's a good idea. In principle it'd be great for organization. The problem is that the RnRS/Chez module system is annoying to use.

For instance, you have to list all bindings you want to export in the library declaration. This is especially annoying for records. For example, if you declare a record:

(define-record-type Foo (fields x y))

this will generate a record descriptor Foo, a constructor make-Foo, a type predicate Foo?, and two accessors Foo-x and Foo-y. That's all nice and fun, but you have to export each of the generated identifiers individually if you want to use them in other modules.

Another annoyance is that Chez does not seem to provide a mechanism to run the REPL from the environment of the module. You can switch the interaction environment to the exported bindings of a module, but there does not seem to be a way to switch to the environment within the module, to call non-exported functions, etc. The workaround I found was to split all modules into a library definition file (say, utils.sls), containing just:

(library (utils)
  (export binding1 binding2 binding3 ...)
  (import (chezscheme))
  (include "utils.scm"))

and the module code proper, in a separate file utils.scm. In this way, I can load the module code directly in Geiser or in the REPL, outside the module system. Note also that the the library definition file only imports (chezscheme) (so we can use the include form); all other imports are directly in the .scm file, so the .scm file will load properly by itself.

Even so, it is annoying to reload libraries, because you have to reload the users of each library manually too.

A smaller annoyance is that the code takes longer to compile when split into libraries. This is not a problem if you compile before execution, but makes running the code without a separate compilation step (chezscheme --script hel.scm) slower to start.

Yet another annoyance is that in the R6RS library syntax, all definitions must precede all expressions, so you have to move initialization code to the end of the module. [Addendum: this (mis)feature does not seem to be shared by R7RS library syntax. As much as I've learned to appreciate many good aspects of R6RS, it seems to me that library syntax is just better in R7RS.]

In the end, a middle-ground solution may be to avoid R6RS libraries entirely, and just create a single main file which includes the others, all in the same namespace. It's not as elegant, but it makes development easier. [Addendum: one benefit of the .sls/.scm split is that it's easy to switch to the non-library organization by just including all .scm files directly and ignoring the .sls files.]

Todos and remarks

When you write import foo/bar, where to look for the foo/bar.hel file? Currently the interpreter looks in the current directory, but that's far from ideal. A better option would be to search relative to the file where the import occurs. That would be an improvement, but still somewhat annoying: if I have a project with files foo.hel, dir1/bar.hel and dir2/baz.hel, I want to be able to load either of these files individually in the REPL and for each module to be able to import any other module in the project using the same name. What I really want is a notion of a project root to search from. One possibility would be to have a __project__.hel file (or something similar) at the project root. When looking for imports, the implementation tries to find the __project__.hel file up the directory hierarchy. If it's found, the directory where the file is is the project root. If not, the project root is the directory where the importing file is. This is vaguely similar to Python's __main__.py, except there would be only one project file per project (not per directory, which would destroy the idea of a single project root).

There is still no syntax to import individual bindings from a module, i.e., the equivalent of Python's from mod import foo, bar. Maybe we can just use Python's syntax (except (foo, bar) would have to be in parentheses, due to restrictions of Hel's syntax).

There is also no syntax to import all bindings from a module, i.e., Python's from foo import *; we can't use * because that's an infix operator, and the syntax doesn't and won't special-case individual commands (remember that one of the goals of the syntax is not to have hardcoded keywords). Maybe from foo import all(), and also things like from foo import except(foo, bar). I don't know.

Why foo/bar instead of foo.bar? Because I thought that import foo.bar might give the impression that the bindings are to be accessed as foo.bar.hello() (like Python) instead of just bar.hello() (as Hel does). And why I wanted this semantics? Because I was unsure what foo would be when you import foo.bar: a module containing just bar? What if I import foo later? What if foo contains a binding bar itself? Does the module foo have to exist for me to be able to import foo.bar? To avoid all these questions ("do the simple thing"), I decided it would be simpler to import the module without having to deal with the whole hierarchy, and make it available as just bar; and I thought the syntax with / suggested that better.

_____

¹ "When in doubt, do the simple thing" has been a sort of mantra in this project. This has helped me avoid analysis paralysis and keep making progress, even though I know eventually I will have to go back and change/improve things. (Note though that the mantra is "do the simple thing", not "do the simplest thing".)

2 comentários / comments

Loops and blocks in Hel

2019-04-11 17:11 -0300. Tags: comp, prog, pldesign, hel, fenius, in-english

Hel got a preliminary version of its main looping and non-local control flow primitives today: do, redo and return. They have characteristics similar to Common Lisp's block/return-from, Scheme's named let, and Clojure's loop/recur (and, one might say, Java's labeled blocks, continues and breaks), though they are not exactly the same as any of these. In this post, I will describe how they work.

`do`

do creates a labeled block with parameters and initial values. It has the syntax:

do name(var1=value1, var2=value2, ...) {
    body
}

What this does is to evaluate body in an environment containing the specified variables with the given values. The variable declaration section is like a function parameter declaration where all parameters are given default values. For example:

do block(x=1, y=2) {
    x+y
}

will return 3.

Within the block, name is bound to a tag for the block. This tag can be used with the redo and return commands.

`redo`

redo can be used to repeat the named block, with new values for the block variables. For example:

# Print all integers from 1 to `limit`.
let count_until(limit) = {
    do block(i=1) {             # First iteration will run with `i` = 1.
        if i <= limit {         # If we have not reached the limit yet...
            print(i)            # Print the current value of `i`...
            redo block(i=i+1)   # And repeat the block, with a new value of `i`.
        }
    }
}

This is analogous to Clojure's recur, except it does not have to be in tail position, and you can specify the label of the block you want to repeat (so you can have nested blocks and escape to outermost ones).

This is also similar to Scheme's named let, except that the new execution of the block replaces the current one, rather than behaving like a regular function call.

The names of the parameters in the redo call are optional; we could have written redo block(i+1) instead of redo block(i=i+1). This is analogous to the function call syntax.

`return`

return can be used to return a value immediately from a block. For example, suppose we have a foreach function which takes a list and a function, and applies the function to each element of the list in order:

let foreach(list, f) = {
    do block(list=list) {                # Start iterating with the full list
        if list != [] {                  # If the list is not empty yet...
            f(list.first)                # Apply the function to the first element...
            redo block(list=list.rest)   # And repeat the block for the remaining ones.
        }
    }
}

Now we want to write a function to test if a given element is in a list. We want to reuse foreach to do the iteration, but we want to stop the iteration (and get out of foreach immediately) when we find the element in the list. We can do this with return:

let is_member(searched, list) = {
    do out() {
        # Call `foreach` with an anonymous function, which will be called
        # for each element of the list.
        foreach(list, fn (item) {
            if item == searched {        # If the current element is the one we are searching...
                return true from out     # Return `true` immediately from the `out` block.
            }
        })
        # If we got here, it's because the element was not found.
        return false from out
    }
}

These constructs can be compared to Java's labeled blocks, continues and breaks. However, Hel blocks take parameters, which must be specified when redoing them (the equivalent of continue); and Hel blocks return a value, which must be specified when returning from them (the equivalent of break).

To do / open questions

Common Lisp has the notion of a default block (which is the block labelled nil). Some constructs, like return, return from the default block, so you can avoid naming the block if you are only using one. It would be nice to have something similar in Hel.

Currently the parameter/argument binding logic for blocks is the same one used for functions. This means that if one of the block's arguments is omitted from the redo call, it will acquire the initial value specified in the beginning of the block! This is most likely not what you want. Alternative behaviours would be to forbid omitting block arguments, or reusing the value in the current iteration rather than the initial value.

Perhaps instead of using special forms for redo and return, these could be methods of the tag object, so we would write, for example, block.redo(i=i+1) instead of redo block(i=i+1), and block.return(42) rather than return 42 from block. I like the special form better, especially for redo because the block name stays together with the arguments just like in the block declaration. It also allows the possibility of omitting the block name if we get default blocks in the future.

Comentários / Comments

Object model and dot syntax in Hel

2019-04-01 20:43 -0300. Tags: comp, prog, pldesign, hel, fenius, in-english

[Despite the date, this is not an April Fool's joke. This is mostly a mind dump for future reference.]

I have written about noun-centric vs. verb-centric OO before (in Portuguese), but the question surfaces now in the context of Hel's design.

Most mainstream OO programming languages are noun-centric: methods (verbs) belong to objects (nouns). When calling x.foo(y), the method to be called is determined by the (dynamic) class of x; the call can be conceptualized as sending the message foo(y) to the object x.

By contrast, Common Lisp and other languages influenced by the Common Lisp Object System (CLOS) are verb-centric: methods (verbs) are entities in their own right, which can be applied to objects (nouns). Methods of the same name are grouped under a generic function. The method calling syntax is typically the same as the regular function call syntax: (foo x y). The method invoked by a call to a generic function is determined by the (dynamic) classes of all arguments, not just x. New methods can be defined at any time, since they are independent from the class. The class definition, on the other hand, contains just the fields and the superclasses (and metaclass, and those sorts of thing), but no methods.

In Dylan, x.foo(y) is syntactic sugar for foo(x, y). This way you can have both the familiar method call notation and the verb-centric nature of CLOS.

Now, everything in language design is tradeoffs, and here we have some.

Namespacing

One of the main differences between the noun-centric and verb-centric models is in how they define namespaces for methods.

Suppose we define a File class in a module, with a method size() returning the file's size in bytes. In another module, we define a Circle class, with a method size() returning the circle's size in pixels. (Okay, we could have called the circle method radius or diameter(), but let's suppose the module was written by someone else and we don't control the name.)

In noun-centric OO, the class creates a namespace for its methods. someFile.size() and someCircle.size() are entirely different methods, because someFile and someCircle belong to different classes. By contrast, in verb-centric OO, these calls would be syntactic sugar for size(someFile) and size(someCircle); this would only work if there was a single generic function size encompassing both methods, which does not make much sense in this example (since size means something completely different in each class).

Common Lisp solves this problem by having the names belong to packages: variable names are symbols, and each symbol belongs to a package. In this case, each module/package would have its own symbol size, and there would be two distinct generic functions, both named size, but each by a distinct size. Due to the way the package system works in Common Lisp, you would not be able to import both at the same time: you would have to use a fully qualified symbol name to refer to at least one of them.

Guile does something different: if you import two generic functions with the same name into a module, they are merged into a new generic function combining the methods of both. In this case, even though each module defines its own generic function size, a module importing both would see a single generic function size which would accept both files and circles. May seem a bit weird from a conceptual standpoint, but it works nicely. Without Guile's trickery, the Schemely solution would be to rename one (or both) of the functions when importing (the equivalent of Python's from file import size as file_size). I don't know how Dylan handles this situation.

The flip side is that noun-centric OO provides a single namespace for all of a class' methods. This means that you have to be careful about overriding methods in subclasses. Suppose someone defines a class A and I create a subclass B inheriting from A and define a method foo on it. In the future, the author of class A decides to add a method foo to A. Now my class B inadvertently overrides the foo method of the superclass, just because it happens to have the same name as A's new foo method. Some noun-centric OO languages like C# require the explicit use of an override keyword on overriding methods to avoid this kind of accidental override. By contrast, in the CLOS world, my definition of the generic function foo would be unrelated to the new foo created by class A's author, so no conflict would ensue. (A package import conflict might happen, though. And if all of the symbols in the package where A is defined are imported into the package where B is defined, you might end up using the same symbol for both foos without even realizing. Yeah, packages are fun like that. But at least it's possible to have two different, non-conflicting foo methods.)

Another way in which noun-centric OO provides a namespace for methods is by separating method names from regular variables. This means I can write let size = file.size() without losing access to the size method. In Common Lisp this problem does not arise because functions/methods live in a different namespace from regular variables anyway, but I'm not willing to go that route. In Scheme, the local size would shadow the global method. (Again, I don't know how Dylan handles this.)

Yet another consequence of the namespacing thing is that, in the noun-centric model, you don't have to import a class' methods individually: if you have access to the class, you have access to all of its (public) methods. In the verb-centric model, generic functions are independent entities, and would have to be imported individually (or else you import all of them at once by importing the whole module (the equivalent of Python's from foo import *), thus polluting your module's namespace).

A possible counter-argument against the noun-centric model is that importing all of a class' methods is kind of an illusion: there are typically functions taking objects of a given class as arguments which are not methods of the class, and those would have to be imported manually anyway. In practice, though, the most common operations on a given object will be methods of the object, so this argument may not be very strong.

The last point brings an advantage of the verb-centric model: you can 'add' methods to a class without modifying its source, since the methods are independent entities that can be defined anywhere, just like regular functions. Some languages, such as Ruby, have "open classes" to which methods can be added at any time. One problem with this is that no matter where the method definitions are for a given class, they all share the same method namespace, so conflicts may happen more often. The other problem is that the set of methods available in a class depends on which modules have been loaded. This is also the case in the verb-centric model, but at least it's completely explicit: you only have access to a method if you import it. In the Ruby model, you see every method in a class regardless of where it was defined, which may create implicit module dependencies (i.e., I use a method defined elsewhere, but I don't import the defining module explicitly, it just happens to be available by the time my code runs).

If I understand correctly, Haskell's typeclasses offer an alternative model: you can instantiate a typeclass (i.e., implement an interface) anywhere, and even implement the same interface multiple times in different ways, but you only see the implementations if you import the implementing module. Transplanting this model to class definition, you might be able to add methods to a class anywhere, but would only see the new methods if you import the defining module. I'm not sure this would work; it seems plausible in a static world, but not really when you can obtain an object from anywhere and call a method on it without knowing its type (or worse, via reflection).

Conclusion

I intend to implement a rudimentary object model for Hel soon. I'm leaning towards plain old noun-centric OO, if only because it's easier to reach a class' methods (you don't have to import each method individually), and because it limits conflicts between local variables and method names. Let's see how it goes.

Comentários / Comments

Named parameters in Hel

2019-03-28 21:21 -0300. Tags: comp, prog, pldesign, hel, fenius, in-english

Hel acquired Python-like named parameters yesterday. This means that if you declare a function like:

let f(x, y) = x+y

you can call it as f(2, 3), or f(x=2, y=3), or f(2, y=3). It also got (also Python-like) rest parameters, i.e., you can declare a parameter like *args to collect all positional (non-named) arguments not captured by a previous parameter, and **kwargs to capture all named arguments not captured by a previous parameter.

(Unlike Python, the resulting kwargs variable is a list of (name, value) tuples, but that's because Hel does not have dictionaries yet. Also, I still have to implement support for *x and **x syntax at the call site, rather than just at function declaration site.)

But I wonder if this is the best approach to named parameters in Hel:

By allowing any parameter to be passed by name, all parameter names become part of the function's API, whether the programmer wants it or not. You cannot change the name of a function parameter without wondering if some other code depends on it.
This also means that the parameter names become part of the type of the function: fn (x) x+1 and fn (y) y+1 don't have the same type, because one of them can be called like f(x=5) and the other can't.
Currently Hel is interpreted, but I plan to make it compilable at some point. Having multiple ways to call the same function, when the function is not necessarily in the same module / known at compile time and might have been passed around as an argument, may make function calling conventions more complicated and more expensive.

So, are there alternatives for handling named parameters better suited to Hel's goals out there?

What other languages do?

Plenty of languages get by without named parameters at all, but that's not really what I'm after.

Common Lisp, Dylan, Scheme

In Common Lisp, functions have positional and keyword (named) parameters, but any given parameter is either positional or keyword: if you declare a function like (defun f (x y &key z) ...), you can call it like (f 1 2 :z 3), but not like (f :x 1 :y 2 :z 3). This means the function controls whether the name of a parameter is exposed or not (and actually the keyword exposed by the function need not be the same as the variable name used internally to store its value).

This makes the calling convention simpler. Conceptually, a function receives a list of arguments; keywords like :x are just values, and keyword arguments are just extra keyword value sequences in the list. An argument list can be assembled programmatically and passed to a function via apply. The call site (from an implementation point of view) does not need to know the function signature beforehand to call it. (Of course, performance is usually better when it does know the signature beforehand.)

One downside of this is that because keywords are plain values, it is easy to pass one as a positional argument by mistake, especially if the function supports both optional and keyword arguments. For example, if a function is declared (defun f (a &optional b c &key d) ...), calling (f 1 :c 2) will actually pass :c as the value for b, and 2 as the value for c. For this reason, it is considered good practice^{[by whom?]} not to use both optional and keyword arguments in the same function.

The other downside is that sometimes we do want to be able to pass the same arguments either with or without names. I feel this is especially the case with constructors, where I want to be able to call either Person(name="Hildur", age=23) or Person("Hildur", 23). I don't know. Constructors also have the characteristic that the parameter names are usually part of the interface anyway, because they are the same as the names of the object accessors.

Dylan seems to use the same scheme (heh) as Common Lisp.

Standard Scheme only supports positional parameters and a mechanism to collect rest arguments (like Python's *args) in a list. The various Scheme implementations tend to support variations of Common Lisp style argument lists.

Smalltalk, Objective-C, Swift

In Smalltalk and Objective-C, the parameter names are part of the name of a method. Using an example from Wikipedia, when you write:

'hello world' indexOf: $o startingAt: 6

the method is actually called indexOf:startingAt:, with the arguments interspersed with the name. This means all arguments are named, and it also means they cannot be reordered or omitted (though you can define a different method with different arguments, for example a separate indexOf: method, thus simulating optional arguments).

Swift is somewhat similar: by default, all parameters have a label, which must be used when calling the function; however, in Swift you can specify _ as the label to omit it. Arguments also have a fixed order. The parameter labels appear to be considered part of the function name too, so you can have different declarations of the same function name with different parameter labels. Argument names are not part of the type. I'm not sure how you specify which of multiple functions with different argument labels you want to refer to when using a function as a value.

Elixir, Ruby 1.x, Clojure

In Elixir, passing the last arguments of a function call in the form key: value, key: value, ... is syntactic sugar for passing a list [key: value, key: value, ...], which is itself syntactic sugar for a list of tuples [{:key, value}, {:key, value}, ...]. By the magic of pattern matching, if you do the same thing in the function parameter declaration, it will turn into a pattern that will match the list of tuples passed in as argument. But this also means that the list must be in the same order in the declaration and the call, and also means that the keywords are not optional. Alternatively, one can receive the whole list and parse it manually (or semi-manually with the help of a dictionary).

Ruby pre-2.0 seems to work similarly, except you get a dictionary instead of a list of pairs. Ruby 2.0 and after has actual keyword parameters. Unlike Python, a parameter is either positional or keyword; it cannot be called both ways.

Clojure's approach is a mix of Common Lisp and Elixir: to support keyword parameters, you declare a rest parameter which will collect the sequence of :keyword value items, but instead of specifying a variable as the parameter to receive the list, you can specify a dictionary pattern to destructure the list. The syntax is not exactly awesome, especially when declaring default values for the keys, but it works.

Ada

Ada is like Python in allowing any parameter to be passed by name or by position, as the caller desires. The names don't seem to be part of the type, so I don't know how the language handles named arguments when using a function as a value.

Conclusion

There is no real conclusion here. I will keep the Python-style calls for now, but I have to think more about this.

Comentários / Comments

O que é uma macro e por que diabos eu usaria uma?

2019-03-23 22:21 -0300. Tags: comp, prog, lisp, hel, fenius, em-portugues

No último post, eu divaguei um pouco sobre a implementação de macros em Hel, minha linguagem de programação experimental. Neste post, pretendo explicar para um público não-Líspico o que são macros e como elas podem ser úteis. /

Árvores

Quando escrevemos uma expressão do tipo 2+3 em um programa, o compilador/interpretador da nossa linguagem de programação tipicamente converte essa expressão em uma estrutura de dados, chamada árvore de sintaxe abstrata (AST, em inglês), representando as operações a serem realizadas. Em Hel, o operador quote permite visualizar a AST de uma expressão:

hel> quote(2+3)
Call(Identifier(`+`), [Constant(2), Constant(3)])

Neste exemplo, a AST representa uma chamada ao operador +, com as duas constantes como argumento. Podemos manipular a AST para obter seus componentes individuais:

hel> let ast = quote(2+3)
Call(Identifier(`+`), [Constant(2), Constant(3)])
hel> ast.head
Identifier(`+`)
hel> ast.arguments
[Constant(2), Constant(3)]
hel> ast.arguments[0]
Constant(2)
hel> ast.arguments[1]
Constant(3)

Também podemos construir uma AST diretamente, chamando os construtores Identifier, Call, etc. manualmente ao invés de obter uma AST pronta com quote(). Assim, podemos escrever código que manipula ou gera novas ASTs, possivelmente utilizando componentes de uma AST já existente.

Agora, quando chamamos uma função, ela atua sobre o resultado dos seus argumentos, e não sobre a AST dos argumentos. Por exemplo, se eu definir uma função:

let f(x) = x

e a chamar como f(2+3), o valor de x dentro da função será 5, e não uma AST da expressão 2+3. Do ponto de vista da função, não há como saber se ela foi chamada como f(5) ou f(2+3) ou f(7-2): o valor de x será o mesmo. E se fosse possível escrever uma função que trabalhasse diretamente sobre a AST de seus argumentos? E se eu pudesse fazer transformações sobre essa AST, produzindo uma expressão diferente a ser calculada (por exemplo, alterando o significado de certos operadores ou palavras que apareçam na expressão)?

Pois é basicamente isso que é uma macro. Uma macro é uma função especial que, ao ser chamada, recebe como argumento a AST inteira da chamada, produz como resultado uma AST alternativa, que será usada pelo compilador/interpretador no lugar da AST original.

Mas pra que serve?

Em muitas linguagens, existe um comando for ou foreach para iterar sobre os elementos de uma lista. Em Python, por exemplo:

for x in [1, 2, 3]:
    print(x)

Você, programador Hel, olha para esse comando e pensa "puxa, que legal!". Infelizmente, porém, (ainda) não existe um comando análogo em Hel. Porém, nós sabemos que (1) um for nada mais é do que uma repetição mudando o valor da variável a cada iteração; (2) podemos escrever uma função que implementa essa repetição; e (3) podemos escrever uma macro que transforma uma expressão do tipo for var in list { ... } em uma chamada de função correspondente. Vamos ver como isso funciona.

1º passo: a função

Nossa linguagem atualmente não conta com nenhum comando de repetição especializado. Porém, podemos escrever uma função recursiva que recebe uma lista e uma função a aplicar e, se a lista não for vazia, aplica a função ao primeiro elemento da lista e chama a si própria sobre o resto da lista, repetindo assim a operação até que só sobre a lista vazia.

let foreach(items, f) = {
    if items != [] {            # Se a lista não for vazia...
        f(items.first)          # Aplica a função ao primeiro elemento
        foreach(items.rest, f)  # E repete para o resto da lista.
    }
}

Agora podemos chamar essa função com uma lista e uma função pré-existente para aplicar a cada elemento:

hel> foreach([1, 2, 3], print)
1
2
3

Ou podemos chamá-la com uma função anônima:

hel> foreach([1, 2, 3], fn (x) { print("Contemplando elemento ", x) })
Contemplando elemento 1
Contemplando elemento 2
Contemplando elemento 3

2º passo: a transformação

[Update (30/05/2019): O código desta seção está obsoleto. Há uma maneira muito mais simples de realizar a transformação descrita aqui nas versões mais recentes da linguagem.]

Agora o que gostaríamos é de poder escrever:

for x in [1, 2, 3] { print("Contemplando elemento ", x) }

ao invés de:

foreach([1, 2, 3], fn (x) { print("Contemplando elemento ", x) })

Para isso, vamos analisar a AST de cada uma das expressões e ver como podemos transformar uma na outra. Começando pela expressão pré-transformação:

hel> let source = quote(for var in list body)
Phrase([Identifier(`for`), Identifier(`var`), Identifier(`in`), Identifier(`list`), Identifier(`body`)])

A expressão consiste de uma frase com uma lista de constituintes. Os constituintes que nos interessam aqui são a variável de iteração (constituinte 1, contando do zero), a lista sobre a qual iterar (constituinte 3), e o corpo (constituinte 4):

hel> source.constituents[1]
Identifier(`var`)
hel> source.constituents[3]
Identifier(`list`)
hel> source.constituents[4]
Identifier(`body`)

Agora vamos analisar a expressão que queremos como resultado da transformação:

hel> let target = quote(foreach(list, fn (var) body))
Call(Identifier(`foreach`), [Identifier(`list`), Phrase([Identifier(`fn`), Identifier(`var`), Identifier(`body`)])])

Com isso, podemos escrever uma função que recebe uma AST da expressão origem e produz uma similar à expressão destino, porém substituindo Identifier(`list`), Identifier(`var`) e Identifier(`body`) pelos componentes extraídos da AST origem:

let for_transformer(source) = {
    # Extraímos os componentes:
    let var = source.constituents[1]
    let list = source.constituents[3]
    let body = source.constituents[4]

    # E produzimos uma expressão transformada:
    Call(Identifier(`foreach`), [list, Phrase([Identifier(`fn`), var, body])])
}

Será que funciona?

hel> for_transformer(Phrase([Identifier(`for`), Identifier(`var`), Identifier(`in`), Identifier(`list`), Identifier(`body`)]))
Call(Identifier(`foreach`), [Identifier(`list`), Phrase([Identifier(`fn`), Identifier(`var`), Identifier(`body`)])])

Parece um sucesso.

3º passo: A macro

Agora só falta definir for como uma macro, pondo a nossa função for_transformer como o transformador de sintaxe associado à macro:

hel> let for = Macro(for_transformer)

E, finalmente, podemos usar nossa macro:

hel> for x in [1, 2, 3] { print("Eis o ", x) }
Eis o 1
Eis o 2
Eis o 3

E não é que funciona? Ao se deparar com o for, o interpretador identifica que trata-se de uma macro, e chama o transformador associado para converter a AST em uma nova AST. No nosso caso, o transformador monta uma AST correspondente a uma chamada a foreach, com uma função anônima cujo argumento é a variável de iteração e cujo corpo é o corpo do for. A AST resultante é então executada pelo interpretador, que chama a função foreach, que itera sobre cada elemento da lista chamando a função anônima gerada, imprimindo assim galhardamente os elementos da lista.

Conclusão

Macros nos permitem adicionar novas construções à linguagem, através de funções que transformam a AST das novas construções em ASTs de construções já existentes. É basicamente uma maneira de ensinar ao compilador/interpretador como interpretar novas construções em termos das que ele já conhece.

Na versão atual de Hel, é necessário manipular e construir as ASTs manualmente. O ideal seria ter um mecanismo para facilitar a extração de componentes e construção de novas ASTs sem ter que obter e construir cada nodo individual... mas um dia chegamos lá.

Comentários / Comments

Hel has macros

2019-03-20 23:31 -0300. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

Hel got macros today. That came about through a different (and simpler) route than I had envisioned; today I'm going to ramble a bit about it.

As I described previously, I decided to abandon Lisp syntax and experiment with a more conventional syntax, while trying to preserve the flexibility to define new commands that looked just like the builtin ones (such as if, for, etc).

Because the new syntax was more complicated than Lisp's atoms and lists, I thought Lisp-style procedural macros would not be very convenient to use in the new language. So from the start, I had the idea of providing template-based macros (a la Scheme's syntax-rules) to match the various syntactic forms and describe replacements. I've been struggling with the problem of finding a good notation for matching pieces of code [all the while looking at Rust and Dylan for inspiration], with unsatisfying results. Meanwhile, I have been working on simpler parts of the language, such as adding support for defining functions, if/then/else, and such.

Yesterday I tackled various other low-hanging fruit problems, such as adding preliminary support for lists, tuples, indexing (list[i]), strings, and records. Rather than reinventing a record structure, I implemented Hel records in terms of R6RS records¹. One consequence of this is that Hel programs can manipulate not only their own record types, but also the records of the host Scheme implementation.

Once I had that, I realized I could just drop the record types for the AST into the Hel standard environment, and now I could manipulate syntax trees from Hel! By this point, I could write functions taking a syntax tree as an argument and returning a syntax tree as a result. This is basically what a macro is. All I needed then was a way to mark those functions as macros, so that the interpreter could identify them as such and call them with the unevaluated syntax tree as an argument, rather than the evaluated arguments (i.e., so that m(x+y) is called with the syntax tree for x+y rather than the result of calculating x+y).

* * *

What I did when I dropped the AST constructors in the Hel environment was, in a sense, making Hel homoiconic (although not with a code representation as direct as Lisp's, and some would argue that this does not count as true homoiconicity; it does not really matter). Although this is somewhat obvious (I exposed the syntax tree types, therefore I can manipulate syntax trees), there is a difference between a formal/logical understanding and an intuitive understanding of something; and seeing the immediate power that something as simple as exposing the language's syntax tree to itself yielded was eye-opening, even though I have programmed in a language with exposed syntax trees as its hallmark feature for years – I guess this so normal in Lisp you eventually take it for granted, and don't really think about how magic this is.

The most surprising part of this for me was how easy it was to add this power to the language: just expose the AST constructors, add half a dozen lines to the interpreter to recognize macro calls, and bam!, we have macros and homoiconicity. I started wondering why more modern interpreted languages don't expose their ASTs in the same way. I think there are a number of factors in the answer. One of them perhaps is the fact that most of the popular scritping languages are implemented in C, and in C it would take special effort to expose the AST to the interpreted language, compared to (R6RS) Scheme where I was able to easily implement generic support for exposing any record/struct types from the host language to the interpreted language. Reflection was a big win here. (I'm not clear how much dynamic typing had a fundamental role in making this easy too. Perhaps it would be possible to do in a statically-typed host language too, but I wonder how easy would it be; it certainly seems it would not be as easy, but that's something I have to think harder about.)

Another factor is that the Hel syntax tree, although more complex than Lisp, is still much simpler than the typical programming language, by design. There were only eight AST constructors to expose to the interpreted language: Phrase, Constant, Identifier, Tuple, Array, Block, Call, and Index. (In the current version there is an extra node, Body, which is used for the whole program and as the content of a Block; I expect to remove it from the exposed AST in the future, since it's just a list of phrases.) Infix and prefix operators are internally converted to Call nodes, with the operator as the callee and the operands as arguments. There is still room for simplification: Call and Index (i.e., f(x, y) and v[i, j]) have essentially the same fields and might be unified in some way; and Tuple and Array might be unified in a single Sequence node. I don't know to what extent this is desirable, though.

By contrast, Python, for example, does expose its AST, but it has a huge set of syntax nodes, and its representation can change with each Python release. Of course this is a danger for Hel too: once the AST is exposed, it's harder to change without breaking client code. Some abstraction mechanism might be necessary to allow evolution of the AST representation without breaking everyone's macros. On the other hand, the Hel AST is much less likely to change, since new language constructs don't generally require changing the AST.

Open problems

Although it's already possible to write macros in Hel, a pattern-matching interface would still be more convenient to use than directly manipulating the syntax nodes. However, it might be easier to implement the pattern-matching interface as a macro, in Hel, in terms of the current procedural interface, than as special code in the interpreter.

The other problem to handle is hygiene: how to keep track of the correct binding each identifier points to, and how this information is exposed in the AST. I still have to think about this.

_____

1 And although I have spoken unfavorably about R6RS in the past, I'm glad for its procedural interface for record creation and inspection. I think I have some more good things to say on R6RS in the future.

Comentários / Comments

The Lispless Lisp

2019-03-08 01:16 -0300. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

For a while I have been trying to design a nice Lisp-based syntax for Hel, trying to fit things like type information and default arguments in function definitions, devising a good syntax for object properties, etc., but never being satisfied with what I come up with. So a few days ago I decided to try something else entirely: to devise a non-Lisp syntax while maintaining a similar level of flexibility to define new language constructs. And I think I have come up with something quite palatable, though there are a few open problems to solve.

The idea is not entirely new. I know of at least Elixir, which is homoiconic but has more variety/flexibility in its syntax, though it has a bunch of hardcoded reserved words; and Dylan, which seems to have a pretty complex macro system, though maybe a little more complex than I'd wish. My non-Lisp Hel syntax has a bunch of hardcoded symbols (( ) [ ] { } , ; and newline), and the syntax for numbers and identifiers is hardcoded too (but that is hardcoded even in Lisp, though Common Lisp has reader macros to overcome this problem to an extent), but there are no reserved keywords, and I find it easier to read and analyze than Dylan. I hope you like it too (though feel free to comment in any case).

A taste of syntax

Here is a sample of the new syntax:

if x > 0 {
    print("foo")
    return x*2
} else {
    print("bar")
    return x*3
}

That looks like a pretty regular language, but the magic here is that none of the "keywords" is hardcoded. This is interpreted as a command if with the four arguments x > 0, the first block, else, and the second block. It is up to the operator/macro bound to the variable if to decide what to do with these arguments.

There are some caveats here, the most notable one being that the else must appear in the same line as the closing brace of the first block, otherwise it would be interpreted as an independent command rather than a part of the if command. I think this is an acceptable price to pay for the flexibility of not having a limited, hardcoded set of commands.

How does it work

The central component of the syntax is the command, or perhaps more precisely the phrase, since "command" gives the impression of a separation between statements and expressions which does not really exist in the syntax. A phrase is a sequence of space-separated constituents. A constituent is one of:

a constant, such as 42 or "hello";
an identifier, such as x;
an infix expression (consituent operator constituent), such as 2*3;
a prefix expression (operator constituent), such as -x;
a function call, such as f(x, y);
an indexing operation, such as vec[i];
a tuple, such as (1, 2, 3);
an array, such as [1, 2, 3];
a parenthesized expression, such as (42);
a block, such as { print("foo") }.

The arguments of a function call, indexing operation, parenthesized expression, and the elements of tuples and arrays are themselves phrases (i.e., you can have an if inside a function call).

Parsing of constituents is greedy. When looking at a series of tokens such as if x > 0 { ... } and trying to determine where a consituent ends and the next one starts, the parser will consider each consituent to be the longest sequence of tokens from left to right that can be validly interpreted as a constituent. In this example:

The first constituent is if, since if x is not a valid constituent.
The second constituent is x > 0, since x > 0 { (or any longer sequence) cannot be a valid constituent.
The third constituent is the whole block.

Phrases are separated by newlines or semicolons. To avoid the effect of a newline, a \ can be used at the end of the line (as in Python). Within parentheses or brackets, newlines are ignored (also as in Python).

That's pretty much all there is to it, in general lines. Except...

Operator precedence

An operator is any sequence of one or more of the characters ! @ $ % ^ & * - + = : . < > / ? | ~. So + and * are operators, but so are ++ and |> and @.@.

One open problem with this syntax is how to handle operator precedence in a general way. In my current prototype, I have hardcoded the precedence of the arithmetic operators, but I need to have sane precedence rules for user-defined operators.

One way is to allow the user to specify the precedence for custom operators (like the infixl and infixr declarations in Haskell). The problem is that this means a program cannot be parsed without interpreting the fixity declarations, which I find annoying, especially given that operators can be imported from other modules, and being unable to parse (read in Lisp parlance) a program without compiling the dependencies is deeply annoying from a Lisp perspective. Another problem is that it is not only hard to parse for the parser, it's hard to parse for the human too.

Another way is to have a fixed rule to assign each operator a precedence. I remember seeing a language once which gave each operator a precedence based on its first character (so, for example, *.* would have the same precedence as *). [Update: I don't remember which language it was, but it turns out that Scala does the same.] I like this solution a lot because it's easy for the human to know the precedence of an operator they've never seen before. The problem is reconciling this with the natural precedences expected from some operators (for instance, = and == usually have different precedences). I still have to think about this. Suggestions are welcome.

[Update: Maybe it makes more sense for the last character to determine precedence, since we want things like += to have the same precedence as =. On the other hand, an operator like => makes more sense as having the same precedence as = than >. Don't = and > have the same precedence anyway, though, since == and > do?]

Gotchas

Constituents vs. phrases

The consituent parsing rules may cause some surprises. Consider the following example:

let answer = if x > 0 { 23 } else { 42 }

In principle, this looks like assigning the result of the if to the variable answer. However, the parsing rules as stated above will lead to this being parsed as the sequence of constituents:

let
answer = if
x > 0
{ 23 }
else
{ 42 }

which is not quite what was intended. To use the if expression (which is a whole phrase, not a constituent) as the right-hand side of =, one would have to surround it with parentheses.

Commas delimit phrases

The arguments of a function call are phrases, not constituents, so an if expression can appear as the argument of a function call without having to surround it with an extra pair of parentheses on the top of those already required by the function call. But function arguments are delimited by commas, so, to avoid ambiguity, commas are not allowed to appear outside parentheses. For example, you cannot have a command like:

for x, y in items { ... }

because in a function call like:

f(for x, y in items { ... })

it would be ambiguous whether this is a single argument or two. The solution is to require:

for (x, y) in items { ... }

instead.

This also means that import foo, bar must be import (foo, bar) instead, though this limitation might be lifted outside parentheses.

Function calls vs. constituent+tuple

A space cannot appear between a function and its argument list. The reason is that we don't want for (x, y) in list to be interpreted as containing a function call for(x, y), nor do we want for x in (1, 2, 3) to be interpreted as containing a call in(1, 2, 3). I don't typically write spaces between a function and its arguments anyway, but I feel ambivalent about this space sensitivity. Perhaps the most important thing here (and in the above gotchas as well) is to have good error messages and (optional, on by default) warnings when things go wrong. For example, upon seeing something that might be a function call with a spurious space inside:

foo.hel:13: error: `foo` is not a command
foo.hel:13: hint: don't put spaces between a function and its argument list
13   foo (x, y)

This might be harder for the macros (e.g., for identifying that the call in(1, 2, 3) it received as argument was meant to be two separate constituents), but I think it can be done, especially for rule-based macros (as opposed to procedural ones).

A different solution is to get rid of tuples entirely and use for [x, y] instead, except this does not really solve the problem because this is ambiguous with the indexing operation.

That's all for now

I already have a prototype parser, but it's pretty rough, and I still have to work on the interpreter, so I have not published it yet. If you have comments, suggestions, constructive criticism, or two cents to give, feel free to comment.

Comentários / Comments

The road to Hel (is paved with good intentions)

2019-01-04 00:54 -0200. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

2019 might just be the year of Hel on the desktop. I mean, not really, but I would like to talk a bit about my prospects for Hel going forward.

For those who don't know, Hel (huangho's Experimental Language/Lisp) is my playground for experimenting with programming language design and implementation ideas. The Hel 0.1 compiler was a very crude translator from a simple Lisp-like language to C, similar in spirit to lows. Hel 0.2 was a bit of an improvement, in that it at least had the rudiments of an intermediate representation. The goal for Hel 0.2 was to be a superset of a subset of R5RS Scheme (i.e., Hel and R5RS would have a common subset), and the compiler itself would be written mostly in this subset; the idea was to eventually be able to compile the Hel compiler with either a Scheme implementation or with the Hel compiler itself, thus bootstrapping the language (i.e., being able to compile the compiler with itself).

My goals have changed a bit, though. First of all, I'm now interested in exploring more the language side rather than the implementation side of the project. I think an interpreter might serve my goals better than a compiler, since it is easier to change and test ideas. The compiler can come later.

Second, I'm not so keen anymore in having a large common subset with Scheme. Multi-shot continuations (call/cc) have always been out of the subset, but as of late I'm willing to question things as basic as cons cells. I may not get that far away from Scheme, but when exploring design options I definitely don't want to be constrained by compatibility. So bootstrappability can come later too.

Third, because I want to write an interpreter, but I don't want it to be terribly slow, I'll probably be switching implementation platform. So far I had been doing development on GNU Guile, but I'll probably switch to either Chez Scheme (a pretty fast R6RS Scheme implementation), or SBCL (a pretty fast Common Lisp implementation). SBCL has the advantage of having more libraries available (and Common Lisp itself is a larger language than that provided by Chez), while Chez has the advantage of being (in principle) closer to Hel (although that's kind of moot if bootstrapping is not an immediate goal anymore). I thought SBCL would also be faster than Chez, but in my highly scientific benchmark (running (fib 45) on each implementation), Chez is actually faster out of the box, though SBCL is faster if type declarations are provided.

So what are my goals for Hel 0.3? Well:

Implementation-wise, one of my major goals is debuggability. I want to have a Lisp which can give proper stack traces, with proper line numbers, in the presence of macros, and show function arguments in the stack trace.
- If a macro call (m) expands to a function call (f), I would like to be able to see both calls in the stack trace, even though the macro call of course is not really in the stack because it happened at compile time. But the fact that the call to (f) was generated by a macro can be stored in the debug information somehow together with the line number information. (I don't expect this to be particularly easy, or particularly hard; but it's a goal.)
Language-wise, I want:
- Optional types, with a lightweight syntax for type declarations;
- Explicit indication of mutability, with more things immutable by default (strings, lists). Mutability is an always-present part of APIs in Scheme and Common Lisp (both language standards are very explicit in describing data sharing between function arguments and results, which values can or cannot be mutated, etc.); I think this really ought to be more explicit in the data types themselves. I have described a proposal for handling mutability before, but nowadays I'm more inclined to simply have mutable and immutable variants of lists, dictionaries and such; it's less elegant but easier to implement and use efficiently. We'll see how this goes. The important thing is to make mutability explicit in the data types (and their printed representation).
- A concise notation for declaring and using new data types, reading and printing them, doing pattern matching, reflection, etc., to reduce the temptation to Use Lists For Everything™.

There are other things (some of which I intend to write about in the future), but I think these are the most important ones.

Wish everyone a happy new year, and may our living-dead open source projects thrive in 2019.

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.

Posts com a tag: prog

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

2019-05-23 21:36 -0300. Tags: comp, prog, pldesign, fenius, in-english

Sparse vectors

Hash Array-Mapped Tries

Miscellaneous notes

End of file

2019-04-18 22:41 -0300. Tags: comp, prog, pldesign, lisp, scheme, hel, fenius, in-english

Modularizing the interpreter

Todos and remarks

2019-04-11 17:11 -0300. Tags: comp, prog, pldesign, hel, fenius, in-english

do

redo

return

To do / open questions

2019-04-01 20:43 -0300. Tags: comp, prog, pldesign, hel, fenius, in-english

Namespacing

Conclusion

2019-03-28 21:21 -0300. Tags: comp, prog, pldesign, hel, fenius, in-english

What other languages do?

Common Lisp, Dylan, Scheme

Smalltalk, Objective-C, Swift

Elixir, Ruby 1.x, Clojure

Ada

Conclusion

2019-03-23 22:21 -0300. Tags: comp, prog, lisp, hel, fenius, em-portugues

Árvores

Mas pra que serve?

1º passo: a função

2º passo: a transformação

3º passo: A macro

Conclusão

2019-03-20 23:31 -0300. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

Open problems

2019-03-08 01:16 -0300. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

A taste of syntax

How does it work

Operator precedence

Gotchas

Constituents vs. phrases

Commas delimit phrases

Function calls vs. constituent+tuple

That's all for now

2019-01-04 00:54 -0200. Tags: comp, prog, pldesign, lisp, hel, fenius, in-english

Main menu

Recent posts

Recent comments

Tags

Elsewhere

Quod vide

Posts com a tag: `prog`

`do`

`redo`

`return`