Elmord's Magic Valley

Posts com a tag: `pldesign`

Updates on Fenius and life

2024-03-26 15:19 +0000. Tags: comp, prog, pldesign, fenius, lisp, life, in-english

Fenius

Over the last couple of months (but mainly over the last four weeks or so), I’ve been working on the Fenius interpreter, refactoring it and adding features. The latest significant feature was the ability to import Common Lisp packages, and support for keyword arguments in a Common-Lisp-compatible way, i.e., f(x, y=z) ends up invoking (f x :y z), i.e., f with three arguments, x, the keyword :y, and z. Although this can lead to weird results if keyword arguments are passed where positional arguments are expected or vice-versa (a keyword like :y may end up being interpreted as a regular positional value rather than as the key of the next argument), the semantics is exactly the same as in Common Lisp, which means we can call Common Lisp functions from Fenius (and vice-versa) transparently. Coupled with the ability to import Common Lisp packages, this means that we can write some useful pieces of code even though Fenius still doesn’t have much in its standard library. For example, this little script accepts HTTP requests and responds with a message and the parsed data from the request headers (yes, I know that it’s not even close to fully supporting the HTTP standard, but this is just a demonstration of what can be done):

# Import the Common Lisp standard functions, as well as SBCL's socket library.
let lisp = importLispPackage("COMMON-LISP")
let sockets = importLispPackage("SB-BSD-SOCKETS")

# We need a few Common Lisp keywords (think of it as constants)
# to pass to the socket library.
let STREAM = getLispValue("KEYWORD", "STREAM")
let TCP = getLispValue("KEYWORD", "TCP")

# Import an internal function from the Fenius interpreter.
# This should be exposed in the Fenius standard library, but we don't have much
# of a standard library yet.
let makePort = getLispFunction("FENIUS", "MAKE-PORT")

# Add a `split` method to the builtin `Str` class.
# This syntax is provisional (as is most of the language anyway).
# `@key start=0` defines a keyword argument `start` with default value 0.
method (self: Str).split(separator, @key start=0) = {
    if start > self.charCount() {
        []
    } else {
        let position = lisp.search(separator, self, start2=start)
        let end = (if position == [] then self.charCount() else position)
        lisp.cons(
            lisp.subseq(self, start, end),
            self.split(separator, start=end+separator.charCount()),
        )
    }
}

# Listen to TCP port 8000 and wait for requests.
let main() = {
    let socket = sockets.makeInetSocket(STREAM, TCP)
    sockets.socketBind(socket, (0,0,0,0), 8000)
    sockets.socketListen(socket, 10)

    serveRequests(socket)
}

# Process one request and call itself recursively to loop.
let serveRequests(socket) = {
    print("Accepting connections...")

    let client = sockets.socketAccept(socket)
    print("Client: ", client)
    let clientStream = sockets.socketMakeStream(client, input=true, output=true)
    let clientPort = makePort(stream=clientStream, path="<client>")
    let request = parseRequest(clientPort)

    clientPort.print("HTTP/1.0 200 OK")
    clientPort.print("")
    clientPort.print("Hello from Fenius!")
    clientPort.print(request.repr())

    lisp.close(clientStream)
    sockets.socketClose(client)

    serveRequests(socket)
}

# Remove the "\r" from HTTP headers. We don't have "\r" syntax yet, so we call
# Common Lisp's `(code-char 13)` to get us a \r character (ASCII value 13).
let strip(text) = lisp.remove(lisp.codeChar(13), text)

# Define a structure to contain data about an HTTP request.
# `@key` defines the constructor as taking keyword (rather than positional) arguments.
record HttpRequest(@key method, path, headers)

# Read an HTTP request from the client socket and return an HttpRequest value.
let parseRequest(port) = {
    let firstLine = strip(port.readLine()).split(" ")
    let method = firstLine[0]
    let path = firstLine[1]
    let protocolVersion = firstLine[2]

    let headers = parseHeaders(port)

    HttpRequest(method=method, path=path, headers=headers)
}

# Parse the headers of an HTTP request.
let parseHeaders(port) = {
    let line = strip(port.readLine())
    if line == "" {
        []
    } else {
        let items = line.split(": ") # todo: split only once
        let key = items[0]
        let value = items[1]
        lisp.cons((key, value), parseHeaders(port))
    }
}

main()

Having reached this stage, it’s easier for me to just start trying to use the language to write small programs and get an idea of what is missing, what works well and what doesn’t, and so on.

One open question going forward is how much I should lean on Common Lisp compatibility. In one direction, I might go all-in into compatibility and integration into the Common Lisp ecosystem. This would give Fenius easy access to a whole lot of existing libraries, but on the other hand would limit how much we can deviate from Common Lisp semantics, and the language might end up being not much more than a skin over Common Lisp, albeit with a cleaner standard library. That might actually be a useful thing in itself, considering the success of ReasonML (which is basically a skin over OCaml).

In the opposite direction, I might try to not rely on Common Lisp too much, which means having to write more libraries instead of using existing ones, but also opens up the way for a future standalone Fenius implementation.

Life

I quit my job about 6 months ago. My plan was to relax a bit and work on Fenius (among other things), but I’ve only been able to really start working on it regularly over the last month. I’ve been mostly recovering from burnout, and only recently have started to get back my motivation to sit down and code things. I’ve also been reading stuff on Old Chinese (and watching a lot of great videos from Nathan Hill’s channel), and re-reading some Le Guin books, as well as visiting and hosting friends and family.

I would like to go on with this sabbatical of sorts, but unfortunately money is finite, my apartment rental contract ends by the end of July, and the feudal lord wants to raise the rent by over 40%, which means I will have to (1) get a job in the upcoming months, and (2) probably move out of Lisbon. I’m thinking of trying to find some kind of part-time job, or go freelancing, so I have extra time and braincells to work on my personal projects. We will see how this plays out.

EOF

That’s all for now, folks! See you next time with more thoughts on Fenius and other shenanigans.

Comentários / Comments

Adventures with Fenius and Common Lisp

2023-01-22 00:05 +0000. Tags: comp, prog, pldesign, fenius, lisp, in-english

I started playing with Fenius (my hobby, vaporware programming language) again. As usual when I pick up this project again after a year or two of hiatus, I decided to restart the whole thing from scratch. I currently have a working parser and a very very simple interpreter that is capable of running a factorial program. A great success, if you ask me.

This time, though, instead of doing it in Go, I decided to give Common Lisp a try. It was good to play a bit with Go, as I had wanted to become more familiar with that language for a long time, and I came out of the experience with a better idea of what the language feels like and what are its strong and weak points. But Common Lisp is so much more my type of thing. I like writing individual functions and testing and experimenting with them as I go, rather than writing one whole file and then running it. I like running code even before it’s complete, while some functions may still be missing or incomplete, to see if the parts that are finished work as expected, and to modify the code according to these partial results. Common Lisp is made for this style of development, and it’s honestly the only language I have ever used where this kind of thing is not an afterthought, but really a deeply ingrained part of the language. (I think Smalltalk and Clojure are similar in this respect, but I have not used them.) Go is very much the opposite of this; as I discussed in my previous Go post, the language is definitely not conceived with the idea that running an incomplete program is a useful thing to do.

Common Lisp macros, and the ability to run code at compile time, also opens up some interesting ways to structure code. One thing I’m thinking about is to write a macro to pattern-match on AST nodes, which would make writing the interpreter more convenient than writing lots of field access and conditional logic to parse language constructs. But I still have quite a long way to go before I can report on how that works out.

What kind of language I’m trying to build?

This is a question I’ve been asking myself a lot lately. I’ve come to realize that I want many different, sometimes conflicting things from a new language. For example, I would like to be able to use it to write low-level things such as language runtimes/VMs, where having control of memory allocation would be useful, but I would also like to not care about memory management most of the time. I would also like to have some kind of static type system, but to be able to ignore types when I wish to.

In the long term, this means that I might end up developing multiple programming languages along the way focusing on different features, or maybe even two (or more) distinct but interoperating programming languages. Cross-language interoperability is a long-standing interest of mine, in fact. Or I might end up finding a sweet spot in the programming language design space that satisfies all my goals, but I have no idea what that would be like yet.

In the short term, this means I need to choose which aspects to focus on first, and try to build a basic prototype of that. For now, I plan to focus on the higher-level side of things (dynamically-typed, garbage-collected). It is surprisingly easier to design a useful dynamic programming language than a useful static one, especially if you already have a dynamic runtime to piggy-back on (Common Lisp in my case). Designing a good static type system is pretty hard. For now, the focus should be on getting something with about the same complexity as R7RS-small Scheme, without the continuations.

F-expressions

One big difference between Scheme/Lisp and Fenius, however, is the syntax. Fenius currently uses the syntax I described in The Lispless Lisp. This is a more “C-like” syntax, with curly braces, infix operators, the conventional f(x,y) function call syntax, etc., but like Lisp S-expressions, this syntax can be parsed into an abstract syntax tree without knowing anything about the semantics of specific language constructs. I’ve been calling this syntax “F-expressions” (Fenius expressions) lately, but maybe I’ll come up with a different name in the future.

If you are not familiar with Lisp and S-expressions, think of YAML. YAML allows you to represent elements such as strings, lists and dictionaries in an easy-to-read (sorta) way. Different programs use YAML for representing all kinds of data, such as configuration files, API schemas, actions to run, etc., but the same YAML library can be used to parse or generate those files without having to know anything about the specific purpose of the file. In this way, you can easily write scripts that consume or produce YAML for these programs without having to implement parsing logic specific for each situation. F-expressions are the same, except that they are optimized for representing code: instead of focusing on representing lists and dictionaries, you have syntax for representing things like function calls and code blocks. This means you can manipulate Fenius source code with about the same ease you can manipulate YAML.

(Lisp’s S-expressions work much the same way, except they use lists (delimited by parentheses) as the main data structure for representing nested data.)

Fenius syntax is more complex than Lisp-style atoms and lists, but it still has a very small number of elements (8 to be precise: constants, identifiers, phrases, blocks, lists, tuples, calls and indexes). This constrains the syntax of the language a bit: all language constructs have to fit into these elements. But the syntax is flexible enough to accomodate a lot of conventional language constructs (see the linked post). Let’s see how that will work out.

One limitation of this syntax is that in constructions like if/else, the else has to appear in the same line as the closing brace of the then-block, i.e.:

if x > 0 {
    print("foo")
} else {
    print("bar")
}

Something like:

if x > 0 {
    print("foo")
}
else {
    print("bar")
}

doesn’t work, because the else would be interpreted as the beginning of a new command. This is also one reason why so far I have preferred to use braces instead of indentation for defining blocks: with braces it’s easier to tell where one command like if/else or try/except ends through the placement of the keyword in the same line as the closing brace vs. in the following line. One possibility that occurs to me now is to use a half-indentation for continuation commands, i.e.:

if x > 0:
    print("foo")
  else:
    print("bar")

but this seems a bit ~~cursed~~ error-prone. Another advantage of the braces is that they are more REPL-friendly: it’s easier for the REPL to know when a block is finished and can be executed. By contrast, the Python REPL for example uses blank lines to determine when the input is finished, which can cause problems when copy-pasting code from a file. Copy-pasting from the REPL into a file is also easier, as you can just paste the code anywhere and tell your text editor to reindent the whole code. (Unlike the Python REPL, which uses ... as an indicator that it’s waiting for more input, the Fenius REPL just prints four spaces, which makes it much easier to copy multi-line code typed in the REPL into a file.)

Versioning

Fenius (considered as a successor of Hel) is a project that I have started from scratch and abandoned multiple times in the past. Every time I pick it up again, I generally give it a version number above the previous incarnation: the first incarnation was Hel 0.1, the second one (which was a completely different codebase) was Hel 0.2, then Fenius 0.3, then Fenius 0.4.

This numbering scheme is annoying in a variety of ways. For one, it suggests a continuity/progression that does not really exist. For another, it suggests a progression towards a mythical version 1.0. Given that this is a hobby project, and of a very exploratory nature, it’s not even clear what version 1.0 would be. It’s very easy for even widely used, mature projects to be stuck in 0.x land forever; imagine a hobby project that I work on and off, and sometimes rewrite from scratch in a different language just for the hell of it.

To avoid these problems, I decided to adopt a CalVer-inspired versioning scheme for now: the current version is Fenius 2023.a.0. In this scheme, the three components are year, series, micro.

The year is simply the year of the release. It uses the 4-digit year to make it very clear that it is a year and not just a large major version.

The series is a letter, and essentially indicates the current “incarnation” of Fenius. If I decide to redo the whole thing from scratch, I might label the new version 2023.b.0. I might also bump the version to 2023.b.0 simply to indicate that enough changes have accumulated in the 2023.a series that it deserves to be bumped to a new series; but even if I don’t, it will eventually become 2024.a.0 if I keep working on the same series into the next year, so there is no need to think too much about when to bump the series, as it rolls over automatically every year anyway.

The reason to use a letter instead of a number here is to make it even less suggestive of a sequential progression between series; 2023.b might be a continuation of 2023.a, or it might be a completely separate thing. In fact it’s not unconceivable that I might work on both series at the same time.

The micro is a number that is incremented for each new release in the same series. A micro bump in a given series does imply a sequential continuity, but it does not imply anything in terms of compatibility with previous versions. Anything may break at any time.

Do I recommend this versioning scheme for general use? Definitely not. But for a hobby project that nothing depends on, this scheme makes version numbers both more meaningful and less stressful for me. It’s amazing how much meaning we put in those little numbers and how much we agonize over them; I don’t need any of that in my free time.

(But what if Fenius becomes a widely-used project that people depend on? Well, if and when this happens, I can switch to a more conventional versioning scheme. That time is certainly not anywhere near, though.)

Implementation strategies

My initial plan is to make a rudimentary AST interpreter, and then eventually have a go at a bytecode interpreter. Native code compilation is a long-term goal, but it probably makes more sense to flesh out the language first using an interpreter, which is generally easier to change, and only later on to make an attempt at a serious compiler, possibly written in the language itself (and bootstrapped with the interpreter).

Common Lisp opens up some new implementation strategies as well. Instead of writing a native code compiler directly, one possibility is to emit Lisp code and call SBCL’s own compiler to generate native code. SBCL can generate pretty good native code, especially when given type declarations, and one of Fenius’ goals is to eventually have an ergonomic syntax for type declarations, so this might be interesting to try out, even if I end up eventually writing my own native code compiler.

This also opens up the possibility of using SBCL as a runtime platform (in much the same way as languages like Clojure run on top of the JVM), and thus integrating into the Common Lisp ecosystem (allowing Fenius code to call Common Lisp and vice-versa). On the one hand, this gives us access to lots of existing Common Lisp libraries, and saves some implementation work. On the other hand, this puts some pressure on Fenius to stick to doing things the same way as Common Lisp for the sake of compatibility (e.g., using the same string format, the same object system, etc.). I’m not sure this is what I want, but might be an interesting experiment along the way. I would also like to become more familiar with SBCL’s internals as well.

EOF

That’s it for now, folks! I don’t know if this project is going anywhere, but I’m enjoying the ride. Stay tuned!

1 comentário / comment

Type parameters and dynamic types: a crazy idea

2020-07-24 22:19 +0100. Tags: comp, prog, pldesign, fenius, in-english

A while ago I wrote a couple of posts about an idea I had about how to mix static and dynamic typing and the problems with that idea. I've recently thought about a crazy solution to this problem, probably too crazy to implement in practice, but I want to write it down before it flees my mind.

The problem

Just to recapitulate, the original idea was to have reified type variables in the code, so that a generic function like:

let foo[T](x: T) = ...

would actually receive T as a value, though one that would be passed automatically by default if not explicitly specified by the programmer, i.e., when calling foo(5), the compiler would have enough information to actually call foo[Int](5) under the hood without the programmer having to spell it out.

The problem is how to handle heterogeneous data structures, such as lists of arbitrary objects. For example, when deserializing a JSON object like [1, "foo", true] into a List[T], there is no value we can give for T that carries enough information to decode any element of the list.

The solution

The solution I had proposed in the previous post was to have a Dynamic type which encapsulates the type information and the value, so you would use a List[Dynamic] here. The problem is that every value of the list has to be wrapped in a Dynamic container, i.e., the list becomes [Dynamic(1), Dynamic("foo"), Dynamic(true)].

But there is a more unconventional possibility hanging around here. First, the problem here is typing a heterogeneous sequence of elements as a list. But there is another sequence type that lends itself nicely for this purpose: the tuple. So although [1, "foo", true] can't be given a type, (1, "foo", true) can be given the type Tuple[Int, Str, Bool]. The problem is that, even if the Tuple type parameters are variables, the quantity of elements is fixed statically, i.e., it doesn't work for typing an arbitrarily long list deserialized from JSON input, for instance. But what if I give this value the type Tuple[*Ts], where * is the splice operator (turns a list into multiple arguments), and Ts is, well, a list of types? This list can be given an actual type: List[Type]. So now we have these interdependent dynamic types floating around, and to know the type of the_tuple[i], the type stored at Ts[i] has to be consulted.

I'm not sure how this would work in practice, though, especially when constructing this list. Though maybe in a functional setting, it might work. Our deserialization function would look like (in pseudo-code):

let parse_list(input: Str): Tuple[*Ts] = {
    if input == "" {
        ()
        # Returns a Tuple[], and Ts is implicitly [].
    } elif let (value, rest) = parse_integer(input) {
        (value, *parse_list(rest))
        # If parse_list(rest) is of type Tuple[*Ts],
        # (value, *parse_list(rest)) is of type Tuple[Int, *Ts].
    } ...
}

For dictionaries, things might be more complicated; the dictionary type is typically along the lines of Dict[KeyType, ValueType], and we are back to the same problem we had with lists. But just as heterogeneous lists map to tuples, we could perhaps map heterogeneous dictionaries to… anonymous records! So instead of having a dictionary {"a": 1, "b": true} of type Dict[Str, T], we would instead have a record (a=1, b=true) of type Record[a: Int, b: Bool]. And just as a dynamic list maps to Tuple[*Ts], a dynamic dictionary maps to Record[**Ts], where Ts is a dictionary of type Dict[Str, Type], mapping each record key to a type.

Could this work? Probably. Would it be practical or efficient? I'm not so sure. Would it be better than the alternative of just having a dynamic container, or even specialized types for dynamic collections? Probably not. But it sure as hell is an interesting idea.

Comentários / Comments

Type parameters and dynamic types

2020-05-30 17:27 +0100. Tags: comp, prog, pldesign, fenius, in-english

In the previous post, I discussed an idea I had for handling dynamic typing in a primarily statically-typed language. In this post, I intend to first, describe the idea a little better, and second, explain what are the problems with it.

The idea

The basic idea is:

Functions can take type arguments as well as value arguments.
Type arguments are passed as actual concrete values that can be consulted at run-time to determine what methods the type supports.
Any value argument without an explicitly declared type gets an implicit type argument associated with it.

For example, consider a function signature like:

let f[A, B](arg1: Int, arg2: A, arg3: B, arg4): Bool = ...

This declares a function f with two explicit type parameters A and B, and four regular value parameters arg1 to arg4. arg1 is declared with a concrete Int type. arg2 and arg3 are declared as having types passed in as type parameters. arg4 does not have an explicit type, so in effect it behaves as if the function had an extra type parameter C, and arg4 has type C.

When the function is called, the type arguments don't have to be passed explicitly; rather, they will be automatically provided by the types of the expressions used as arguments. So, if I call f(42, "hello", 1.0, True), the compiler will implicitly pass the types Str and Float for A and B, as well as Bool for the implicit type parameter C.

In the body of f, whenever the parameters with generic types are used, the corresponding type parameters can be consulted at run-time to find the approprate methods to call. For example, if arg2.foo() is called, a lookup for the method foo inside A will happen at run-time. This lookup might fail, in which case we would get an exception.

This all looks quite beautiful.

The problem

The problem is when you introduce generic data structures into the picture. Let's consider a generic list type List[T], where T is a type parameter. Now suppose you have a list like [42, "hello", 1.0, True] (which you might have obtained from deserializing a JSON file, for instance). What type can T be? The problem is that, unlike the case for functions, there is one type variable for multiple elements. If all type information must be encoded in the value of the type parameter, there is no way to handle a heterogeneous list like this.

Having a union type here (say, List[Int|Str|Float|Bool]) will not help us, because union types require some way to distinguish which element of the union a given value belongs to, but the premise was for all type information to be carried by the type parameter so you could avoid encoding the type information into the value.

For a different example, consider you want to have a list objects satisfying an interface, e.g., List[JSONSerializable]. Different elements of the list may have different types, and therefore different implementations of the interface, and you would need type information with each individual element to be able to know at run-time where to find the interface implementation for each element.

Could this be worked around? One way would be to have a Dynamic type, whose implementation would be roughly:

record Dynamic(
    T: Type,
    value: T,
)

The Dynamic type contains a value and its type. Note that the type is not declared as a type parameter of Dynamic: it is a member of Dynamic. The implication is that a value like Dynamic(Int, 5) is not of type Dynamic[Int], but simply Dynamic: there is a single Dynamic type container which can hold values of any type and carries all information about the value's type within itself. (I believe this is an existential type, but I honestly don't know enough type theory to be sure.)

Now our heterogeneous list can simply be a List[Dynamic]. The problem is that to use this list, you have to wrap your values into Dynamic records, and unwrap them to use the values. Could it happen implicitly? I'm not really sure. Suppose you have a List[Dynamic] and you want to pass it to a function expecting a List[Int]. We would like this to work, if we want static and dynamic code to run along seamlessly. But this is not really possible, because the elements of a List[Dynamic] and a List[Int] have different representations. You would have to produce a new list of integers from the original one, unwrapping every element of the original list out of its Dynamic container. The same would happen if you wanted to pass a List[Int] to a function expecting a List[Dynamic].

All of this may be workable, but it is a different experience from regular gradual typing where you expect this sort of mixing and matching of static and dynamic code to just work.

[Addendum (2020-05-31): On the other hand, if I had an ahead-of-time statically-typed compiled programming language that allowed me to toss around types like this, including allowing user-defined records like Dynamic, that would be really cool.]

EOF

That's all I have for today, folks. In a future post, I intend to explore how interfaces work in a variety of different languages.

4 comentários / comments

Types and Fenius

2020-05-19 21:35 +0100. Tags: comp, prog, pldesign, fenius, in-english

Hello, fellow readers! In this post, I will try to write down some ideas that have been haunting me about types, methods and namespaces in Fenius.

I should perhaps start with the disclaimer that nothing has really happened in Fenius development since last year. I started rewriting the implementation in Common Lisp recently, but I only got to the parser so far, and the code is still not public. I have no haste in this; life is already complicated enough without one extra thing to feel guilty about finishing, and the world does not have a pressing need for a new programming language either. But I do keep thinking about it, so I expect to keep posting ideas about programming language design here more or less regularly.

So, namespaces

A year ago, I pondered whether to choose noun-centric OO (methods belong to classes, as in most mainstream OO languages) or verb-centric OO (methods are independent entities grouped under generic functions, as in Common Lisp). I ended up choosing noun-centric OO, mostly because classes provide a namespace grouping related methods, so:

You don't have to deal with name conflicts if two classes define independent methods with the same name;
You don't have to import each method separately. In fact you don't even have to import the class: if you have an object, you can call its methods without even knowing which class it belongs to (in a dynamic language context).

This choice has a number of problems, though; it interacts badly with other features I would like to have in Fenius. Consider the following example:

Suppose I have a bunch of classes that I want to be able to serialize to JSON. Some of these classes may be implemented by me, so I can add a to_json() method to them, but others come from third-party code that I cannot change. Even if the language allows me to add new methods to existing classes, I would rather not add a to_json() method to those classes because they might, in the future, decide to implement their own to_json() method, possibly in a different way, and I would be unintentionally overriding the library method which others might depend on.

What I really want is to be able to declare an interface of my own, and implement it in whatever way I want for any class (much like a typeclass in Haskell, or a trait in Rust):

from third_party import Foo

interface JSONSerializable {
    let to_json()
}

implement JSONSerializable for Foo {
    let to_json() = {
         ...
    }
}

In this way, the interface serves as a namespace for to_json(), so that even if Foo implements its own to_json() in the future, it would be distinct from the one I defined in my interface.

The problem is: if I have an object x of type Foo and I call x.to_json(), which to_json() is called?

One way to decide that would be by the declared type of x: if it's declared as Foo, it calls Foo's to_json(), and JSONSerializable's to_json() is not even visible. If it's declared as JSONSerializable, then the interface's method is called. The problem is that Fenius is supposed to be a dynamically-typed language: the declared (static) type of an object should not affect its dynamic behavior. A reference to an object, no matter how it was obtained, should be enough to access all of the object's methods.

Solution 1: Interface wrappers

One way to conciliate things would be to make it so that the interface wraps the implementing object. By this I mean that, if you have an object x of type Foo, you can call JSONSerializable(x) to get another object, of type JSONSerializable, that wraps the original x, and provides the interface's methods.

Moreover, function type declarations can be given the following semantics: if a function f is declared as receiving a parameter x: SomeType, and it's called with an argument v, x will be bound to the result of SomeType.accept(v). For interfaces, the accept method returns an interface wrapper for the given object, if the object belongs to a class implementing the interface. Other classes can define accept in any way they want to implement arbitrary casts. The default implementation for class.accept(v) would be to return v intact if it belongs to class, and raise an exception if it doesn't.

Solution 2: Static typing with dynamic characteristics

Another option is to actually go for static typing, but in a way that still allows dynamic code to co-exist more or less transparently with it.

In this approach, which methods are visible in a given dot expression x.method is determined by the static type of x. One way to see this is that x can have multiple methods, possibly with the same name, and the static type of x acts like a lens filtering a specific subset of those methods.

What happens, then, when you don't declare the type of the variable/parameter? One solution would be implicitly consider those as having the basic Object type, but that would make dynamic code extremely annoying to use. For instance, if x has type Object, you cannot call x+1 because + is not defined for Object.

Another, more interesting solution, is to consider any untyped function parameter as a generic. So, if f(x) is declared without a type for x, this is implicitly equivalent to declaring it as f(x: A), for a type variable A. If this were a purely static solution, this would not solve anything: you still cannot call addition on a generic value. But what if, instead, A is passed as a concrete value, implicitly, to the function? Then our f(x: A) is underlyingly basically f(x: A, A: Type), with A being a type value packaging the known information about A. When I call, for instance, f(5), under the hood the function is called like f(5, Int), where Int packages all there is to know about the Int type, including which methods it supports. Then if f's body calls x+1, this type value can be consulted dynamically to look up for a + method.

Has this been done before? Probably. I still have to do research on this. One potential problem with this is how the underlying interface of generic vs. non-generic functions (in a very different sense of 'generic function' from CLOS!) may differ. This is a problem for functions taking functions as arguments: if your function expects an Int -> Int function as argument and I give it a A -> Int function instead, that should work, but underlyingly an A -> Int takes an extra argument (the A type itself). This is left as an exercise for my future self.

Gradual typing in reverse

One very interesting aspect of this solution is that it's basically the opposite of typical gradual typing implementations: instead of adding static types to a fundamentally dynamic language, this adds dynamic powers to a fundamentally static system. All the gradual typing attempts I have seen so far try to add types to a pre-existing dynamic language, which makes an approach like this one less palatable since one wants to be able to give types to code written in a mostly dynamic style, including standard library functions. But if one is designing a language from scratch, one can design it in a more static-types-friendly way, which would make this approach more feasible.

I wonder if better performance can be achieved in this scenario, since in theory the static parts of the code can happily do their stuff without ever worrying about dynamic code. I also wonder if boxing/unboxing of values when passing them between the dynamic and static parts of the code can be avoided as well, since all the extra typing information can be passed in the type parameter instead. Said research, as always, will require more and abundant funding.

Comentários / Comments

Functional record updates in Fenius, and other stories

2019-06-16 17:33 -0300. Tags: comp, prog, pldesign, fenius, in-english

Fenius now has syntax for functional record updates! Records now have a with(field=value, …) method, which allows creating a new record from an existing one with only a few fields changed. For example, if you have a record:

fenius> record Person(name, age)
<class `Person`>
fenius> let p = Person("Hildur", 22)
Person("Hildur", 22)

You can now write:

fenius> p.with(age=23)
Person("Hildur", 23)

to obtain a record just like p but with a different value for the age field. The update is functional in the sense that the p is not mutated; a new record is created instead. This is similar to the with() method in dictionaries.

Another new trick is that now records can have custom printers. Printing is now performed by calling the repr_to_port(port) method, which can be overridden by any class. Fenius doesn't yet have much of an I/O facility, but we can cheat a bit by importing the functions from the host Scheme implementation:

fenius> record Point(x, y)
<class `Point`>
fenius> import chezscheme

# Define a new printing function for points.
fenius> method Point.repr_to_port(port) = {
            chezscheme.fprintf(port, "<~a, ~a>", this.x, this.y)
        }

# Now points print like this:
fenius> Point(1, 2)
<1, 2>

A native I/O API is coming Real Soon Now™.

Comentários / Comments

Questions, exclamations, and binaries

2019-06-03 21:39 -0300. Tags: comp, prog, pldesign, fenius, in-english

I'm a bit tired today, so the post will be short.

`ready? go!`

In Scheme, it is conventional for procedures returning booleans to have names ending in ? (e.g., string?, even?), and for procedures which mutate their arguments to have names ending in ! (e.g., set-car!, reverse!). This convention has also been adopted by other languages, such as Ruby, Clojure and Elixir.

I like this convention, and I've been thinking of using it in Fenius too. The problem is that ? and ! are currently operator characters. ? does not pose much of a problem: I don't use it for anything right now. !, however, is a bit of a problem: it is part of the != (not-equals) operator. So if you write a!=b, it would be ambiguous whether the ! should be interpreted as part of an identifier a!, or as part of the operator !=. So my options are:

Consider ! exclusively as an identifier character. This would mean changing != to something else, like /= (as in Haskell or Common Lisp) or <> (as in Pascal) or ~= (as in Lua). But != is familiar to a lot of people, and I'm not sure it's a good idea to change it.
Consider ! exclusively as an operator character. Then the ! convention is gone and has to be replaced by something else. Currently, when you import a Scheme module into Fenius, the names are converted such that names ending in ? are prefixed with is (e.g., even? becomes isEven), and names ending in ! are prefixed with do (e.g., reverse! becomes doReverse). [Update: I will be switching to snake_case rather than camelCase soon.]
Consider ! as an identifier character if it follows with other identifier characters, but as an operator character when not preceded by other identifier characters (in much the same way as digits are considered part of an identifier if they immediately follow other identifier characters, but as numbers if not). So you would have to write a != b with spaces instead. I don't like this very much.

What do you think? Which of these you like best? Do you have other ideas? Feel free to comment.

Binaries available

In other news, I started to make available a precompiled Fenius binary (amd64 Linux), which you can try out without having to install Chez Scheme first. You should be aware that the interpreter is very brittle at this stage, and most error messages are in terms of the underlying implementation rather than something meaningful for the end user, so use it at your own peril. But it does print stack traces in terms of the Fenius code, so it's not all hopeless.

Comentários / Comments

Pattern matching and AST manipulation in Fenius

2019-05-30 19:40 -0300. Tags: comp, prog, pldesign, fenius, in-english

Fenius has pattern matching! This means you can now write code like this:

record Rectangle(width, height)
record Triangle(base, height)
record Circle(radius)

let pi = 355/113    # We don't have float syntax yet :(

let area(shape) = {
    match shape {
        Rectangle(width, height) => width * height
        Triangle(base, height) => base * height / 2
        Circle(radius) =>  pi * radius * radius
    }
}

print(area(Rectangle(4, 5)))
print(area(Triangle(3, 4)))
print(area(Circle(10)))

More importantly, you can now pattern match over ASTs (abstract syntax trees). This is perhaps the most significant addition to Fenius so far. It means that the code for the for macro from this post becomes:

# Transform `for x in items { ... }` into `foreach(items, fn (x) { ... })`.
let for = Macro(fn (ast) {
    match ast {
        ast_match(for _(var) in _(items) _(body)) => {
            ast_gen(foreach(_(items), fn (_(var)) _(body)))
        }
    }
})

This is a huge improvement over manually taking apart the AST and putting a new one together, and it basically makes macros usable.

It still does not handle hygiene: it won't prevent inserted variables from shadowing bindings in the expansion site, and will break if you shadow the AST constructors locally. But that will come later. (The AST constructors will move to their own module eventually, too.)

The _(var) syntax is a bit of a hack. I wanted to use some operator, like ~var or $var, but the problem is that all operators in Fenius can be interpreted as either infix or prefix depending on context, so in for $var would be interpreted as an infix expression for $ var, and you would have to parenthesize everything. One solution to this is to consider some operators (like $) as exclusively prefix. I will think about that.

How does it work?

I spent a good while hitting my head against the whole meta-ness of the ast_match/ast_gen macros. In fact I'm still hitting my head against it even though I have already implemented them. I'll try to explain them here (to you and to myself).

ast_match(x) is a macro that generates a pattern that would match the AST for x. So, for example, ast_match(f(x)) generates a pattern that would match the AST for f(x). Which pattern is that? Well, it's:

Call(_, Identifier(_, `f`), [Identifier(_, `x`)])

That's what you would have to write on the left-hand side of the => in a match clause to match the AST for f(x). (The _ patterns are to discard the location information, which is the first field of every AST node. ast_gen is just like ast_match but does not discard location information.) So far, so good.

But here's the thing: that's not what the macro has to output. That's what you would have to write in the source code. The macro has to output the AST for the pattern. This means that where the pattern has, say, Identifier, the macro actually has to output the AST for that, i.e., Identifier(nil, `Identifier`). And for something like:

Identifier(_, `f`)

i.e., a call to the Identifier constructor, the macro has to output:

Call(nil, Identifier(nil, `Identifier`),
          [Identifier(nil, `_`), Constant(nil, `f`)])

and for the whole AST of f(x), i.e.:

Call(_, Identifier(_, `f`), [Identifier(_, `x`)])

the macro has to output this monstrosity:

Call(nil, Identifier(nil, `Call`),
     [Identifier(nil, `_`),
      Call(nil, Identifier(nil, `Identifier`),
                [Identifier(nil, `_`), Constant(nil, `f`)]),
      Array(nil, [Call(nil, Identifier(nil, `Identifier`),
                            [Identifier(nil, `_`), Constant(nil, `x`)])])])

All of this is to match f(x). It works, is all encapsulated inside the ast_* macros (so the user doesn't have to care about it), and the implementation is not even that much code, but it's shocking how much complexity is behind it.

Could it have been avoided? Perhaps. I could have added a quasiquote pattern of sorts, which would be treated especially by match; when matching quasiquote(ast), the matching would happen against the constructors of ast itself, rather than the code it represents. Then I would have to implement separate logic for quasiquote outside of a pattern (e.g., on the right-hand side). In the end, I think it would require much more code. ast_match/ast_gen actually share all the code (they call the same internal meta-expand function, with a different value for a "keep location information" boolean argument), and requires no special-casing in the match form: from match's perspective, it's just a macro that expands to a pattern. You can write macros that expand to patterns and use them in the left-hand side of match too.

(I think I'll have some observations on how all of this relates/contrasts to Lisp in the future, but I still have not finished digesting them, and I'm tracking down some papers/posts I read some time ago which were relevant to this.)

Missing things

The current pattern syntax has no way of matching against a constant. That is:

match false {
    true => "yea"
    false => "nay"
}

binds true (as a variable) to false and returns "yea". I still haven't found a satisfactory way of distinguishing variables from constants (which are just named by identifiers anyway). Other languages do various things:

In Haskell, constant constructors have an uppercase initial, so you can distinguish them this way.
Scala also uses case to distinguish constants from pattern variables, except Scala constants are not actually required to have an uppercase initial; if they don't, you can't match against them, apparently. [Update: No, you can put the constant in `backticks` in this case. What you apparently can't do is use an uppercase name as a pattern variable.]
Rust considers enum elements as constants in patterns. This means that the decision is non-local: depending on the environment where the pattern appears, the same identifier may be a constant or a pattern variable.
- A previous incarnation of Rust used a period to distinguish nullary enum constructors in patterns (i.e., you had to write None. instead of None).
Swift requires let to appear before variables; without let, they are considered constants.

One thing that occurred to me is to turn all constructors into calls (i.e., you'd write true() and false(), not only in patterns but everywhere), which would make all patterns unambiguous, but that seems a bit annoying.

Rust's solution seems the least intrusive, but Fenius does not really have a syntactically separate class of "constructors" (as opposed to just variables bound to a constant value), and considering all bound variables as constants in patterns makes patterns too fragile (if you happen to add a global variable – or worse, a new function in the base library – with the same name as a variable currently in use in a pattern, you break the pattern). I'll have to think more about it. Suggestions and comments, as always, are welcome.

Another missing thing is a way to debug patterns: I would like to be able to activate some kind of 'debug mode' for match which showed why a pattern did not match. I think this is feasible, but we'll see in the future.

Comentários / Comments

Partial goals, and other rambles

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

Designing a programming language is a huge undertaking. The last few days I have been thinking about types and classes and interfaces/traits and I got the feeling I was getting stuck in an analysis paralysis stage again. There is also lots of other questions I have to thinking about, such as: How will I handle mutability, and concurrency, and how those things interact? Can I add methods to classes at runtime? Can I implement interfaces to arbitrary existing types? How do I get a trait implementation 'in scope' / available? How static and dynamic typing will interact? How does this get implemented under the hood? And so on, and so forth…

At the same time, it may seem that not solving those problems undermine the whole point of designing a new language in the first place. If you don't solve the hard problems, why not keep using an existing language?

And that's the recipe for paralysis.

But the thing is, even if you solve part of the problems, you can already have something valuable. If Fenius 0.3 gets to be just a 'better' Scheme with nicer syntax and fewer namespacing problems*, that would be already a language I would like to use.

So what would a Minimum Viable Fenius need?

Records, with single inheritance. We already have those, actually, it's just that there is no syntax for inheritance exposed yet.
Throwing and catching exceptions of a given (record) type.
Basic string capabilities. There is the whole Unicode thing, and the problem of text vs. binary data, but we can use the host Scheme strings as a start.
Basic pattern matching, and particularly pattern matching over ASTs. This is necessary to make macros convenient to write. No need for the whole syntax-rules and hygiene machinery right now.
Basic file manipulation (opening, reading, writing, closing, testing for existence).
An R5RS-comparable standard library. That means basic arithmetic, iteration (map, foreach, filter), that kind of thing.
Fixing up parser corner cases (e.g., add the ability to refer to operators as identifiers, allow expressions to continue in the next line by either adding \ at the end or ending a line with an infix operator).
More robust error handling (and more meaningful error messages).
Fix the module import path situation. (Just allowing to specify the import path (say, via an environment variable) would already be a usable solution.)

These would go a long way already. These are all pretty doable and don't involve much hard thinking. The greater problems can be tackled later, after the language is usable for basic tasks.

I would also love to have basic sockets to try to write web stuff in Fenius, but these will have to wait, especially given that Chez does not come with sockets natively. Thunderchez has a socket library; maybe I can use that. But this will come after the items above are solved. I can write CGI apps in it, anyway, or make a wrapper in another language to receive the connections and pass the data via stdin/stdout.

Also, I sometimes wonder if it wouldn't be more profitable to implement Fenius on the top of Guile instead of Chez, since it has more libraries, and now also has a basic JIT compiler. But at the same time, the ultimate goal is to rewrite the implementation in itself and leave the Scheme dependency behind, so maybe it's better not to depend too much on the host ecosystem. Although the Chez compiler is so good, maybe it would make more sense to compile to Scheme and run on the top of Chez rather than compile to native code. If only it did not have the annoying startup time… we'll see in the future!

_____

* Of course, 'better' is subjective; what I want is for Fenius 0.3 to be a better Scheme for me.

Addendum: And by 'better syntax' I don't even mean the parentheses; my main annoyance when programming in Scheme is the syntax for accessors: the constant repeating of type names, like (person-name p) rather than p.name, (vector-ref v i) rather than v[i], (location-line (AST-location ast)) rather than ast.location.line, and so on. This goes hand in hand with the namespacing issue: because classes define a namespace for their methods, you don't have to care about name conflicts so often.

Comentários / Comments

Persistent hashmaps

2019-05-23 21:36 -0300. Tags: comp, prog, pldesign, fenius, in-english

Fenius got dictionaries! Now you can say:

fenius> let d = dict("foo" => 1, "bar" => 2)
dict("foo" => 1, "bar" => 2)
fenius> d["foo"]
1
fenius> d.with("foo" => 3, "quux" => 4)
dict("foo" => 3, "bar" => 2, "quux" => 4)
fenius> d.without("foo")
dict("bar" => 2)

Dictionaries are immutable: the with and without methods return new dictionaries, rather than modifying the original. I plan to add some form of mutable dictionary in the future, but that will probably wait until I figure Fenius's mutability story. (Right now, you can mutate the variables themselves, so you can write d := d.with("foo" => 42), for example.)

Dictionaries are persistent hashmaps, much like the ones in Clojure. Fenius's implementation of hashmaps is simpler than Clojure's, because it does less: beside persitent/immutable hashmaps, Clojure also supports transient hashmaps, a mutable data structure onto which entries can be inserted and removed efficiently, which then can be turned into a persistent hashmap. I want to implement something like this in Fenius at some point. On the flip side, the current Fenius implementation is easier to understand than its Clojure counterpart.

The underlying data structure is based on hash array mapped tries (HAMTs). The standard reference for it is Bagwell (2000). The persistent variant used by Clojure, Fenius and other languages is slightly different, but the basic idea is the same. In this post, I intend to explain how it works.

I will begin by explaining an auxiliary data structure used by the hashmap, which I will call a sparse vector here.

Sparse vectors

Suppose you want to create vectors of a fixed size (say, 8 elements), but you know most of the positions of the vector will be empty much of the time. If you create lots of these vectors in the naive way, you will end up wasting a lot of memory with the empty positions.

23 42

A cleverer idea is to split this information into two parts:

a bitmap telling which positions are filled (an integer where the nth bit is set if the nth position is filled); and
a content vector with the values of those positions only.

00001010 23 42

(Note that the first position is represented by the lowest bit.) If we know beforehand that the number of positions is limited (say, 32), we can fit the bitmap into a single integer, which makes for a quite compact representation.

That takes care of representing the vectors. But how do we fetch a value from this data structure given its position index i? First, we need to know if the position i is filled or empty. For that, we bitwise-and the bitmap with a mask containing only the ith bit set, i.e., bitmap & (1 << i) (or, equivalently, bitmap & 2ⁱ). If the result is non-zero, then the ith bit was set in the bitmap, and the element is present. (If it's not, we return a default empty value.)

Once we know the element is there, we need to find its place in the content vector. To do that, we need to know how many elements are there before the one we want, and skip those. For that:

We bitwise-and the bitmap with a mask having all bits for the elements before the one we want set, e.g., if we want (counting from 0) the element 3 (bitmap 00001000), we bitwise-and the element bitmap with 00000111 (having all bits before bit 3 set). That mask happens to be 00001000 - 1, or (1 << i) - 1, or 2ⁱ - 1. So we compute bitmap & ((1 << i) - 1). The result is a bitmap containing 1s only in the positions that are before the one we want and are filled. In our example, that would be:
```
  00001010
& 00000111
  --------
  00000010
```
We now count the number of 1 bits in the result. This is called the integer's Hamming weight, also known as population count. There are clever algorithms to compute it, and modern processors have a dedicated instruction (often called popcount or something similar) to calculate it. Some programming languages also have a function to compute it: R6RS Scheme, for example, has a bitwise-bit-count function for that.

The result of all this is the number of filled positions before the one we want. Because positions are counted from 0, it also happens to be the position of the wanted element in the content vector (e.g., if there is 1 element before the one we want, the one we want is at position 1 (which is the second position of the vector)). In summary:

def sparsevec_get(sparsevec, position):
    if sparsevec.bitmap & (1 << position):
        actual_position = bit_count(sparsevec.bitmap & ((1 << position) - 1))
        return sparsevec.content[actual_position]
    else:
        return SOME_EMPTY_VALUE

To insert an element at position i into the vector, we first figure if position i is already filled. If it is, we find its actual position in the content vector (using the same formula above) and replace its contents with the new value. If it is currently empty, we figure out how many positions are filled before i (again, same formula), and insert the new element just after that (and update the bitmap setting the ith bit). Analogously, to remove the ith element, we check if it is filled, and if it is, we find its position in the content vector, remove it, and clear the ith bit in the bitmap.

In our real implementation, sparse vectors will be persistent, so updates are performed by creating a new vector that is just like the original except for the added or removed elements.

Enough with sparse vectors. Let's get back to HAMTs.

Hash Array-Mapped Tries

A trie ("retrieval tree") is a tree where the path from the root to the desired element is determined by the symbols of the element's key. More specifically:

The element keys are, conceptually, strings, or sequences of symbols, from an alphabet. For example, keys might be sequences of bits (and the alphabet would be {0, 1}), or sequences of letters from a to z, or whatever.
Each node of the trie has one (possibly empty) child for each possible symbol of the alphabet. If the keys are sequences of bits, each node would have two (possibly empty) children, labelled 0 and 1. If the keys are sequences of letters from a to z, each node would have 26 children, and so on.
To find an element in the trie, we start from the root and traverse the trie by consuming sucessive symbols from the key and taking the path to the correspondingly-labelled child in the tree, until we find the element we want or reach an empty node (in which case the key is absent from the trie).

A hash function is a function f such that if two elements x and y are equal, then f(x) = f(y). (The opposite is not true: there are typically infinitely many elements x for which f(x) produces the same result.) The result of f(x) is called a hash of x. A hash is typically a finite sequence of bits, or symbols... from an alphabet. You can see where this is going.

A hash trie is a trie indexed by hashes. To find an element with key x in the trie, you first compute the hash f(x), and then consume successive bits (or symbols) from the hash to decide the path from the root of the trie to the desired element.

In principle, this would mean that the depth of the trie would be the same as the length of the hash. For example, if our hashes have 32 bits, we would have to traverse 32 children until we reach the element, which would be pretty wasteful in both storage and time. However, we don't need to use the entire hash: we can use just enough bits to distinguish the element from all other elements in the trie. For example, if the trie has elements x with hash 0010, y with hash 0100, and z with hash 1000, we build a trie like:

That is: z is the only element whose hash starts with 1, so we can store it right below the root as the child labelled 1. Both x and y have hashes beginning with 0, so we create a subtree under the root's 0 for them; the second bit of the hash is 0 for x and 1 for y, so we store them below the subtree under the appropriate label. We don't need to go any deeper to disambiguate them.

To insert an element in the tree, we compute its hash and start traversing the tree until either:

We find an empty node: in this case we put the element there; or
We find an existing leaf (element) node: this means that the existing and the new element share a prefix of the hash. In this case we have to compute the hash of the existing element and rebuild the tree from there downwards, creating new subtrees for each subsequent hash bit until we get to a bit where the hashes differ (one of the elements has 0 and the other has 1 in that position): then we can put the old and new elements as separate children of the corresponding subtree.

If our hashes have a finite number of bits, it may happen that two distinct elements end up having the same hash. There are two ways of handling this problem:

Bagwell's solution is to use hashes with an infinite number of bits. Roughly, the idea is that instead of having a function f(x) to compute the hash of x, you would have a function f(x, i) to compute the ith bit of the hash of x (or, more efficiently, the ith group of n bits of the hash, for some fixed n), and you can ask for an arbitrarily large i. If your function f is constructed in such a way that every pair of distict elements x and y will generate a distinct sequence of bits (i.e., there will eventually be an i for which f(x, i) and f(y, i) will be different), you have effectively eliminated hash collisions from your life. The problem is that the typical hash functions provided by your favorite programming language don't work like that.
The other option, used by Clojure et al., is to use a collision list whenever you have to store two elements with the same hash in the tree. (The current Fenius implementation goes at bit overboard with this: the leaves of the trees are always lists of elements, even if there is a single element in the leaf. This wastes a bit of memory but makes things a bit simpler. I may change this in the future.)

The only problem with this scheme is that our trees can still get pretty deep as we add more elements to them. For example, if we add 1024 elements (2¹⁰) to the tree, we need at least 10 bits of hash to distinguish them, which means we need to go at least 10 levels deep in the tree to find them; the deeper the tree is, the slower it is to find an element in it. We can reduce the depth if, instead of branching on single bits of the hash, we use groups of bits of a fixed size, say, 5 bits. Then instead of each node having 2 children, each node has 2⁵ = 32 children, labelled {00000, 00001, ..., 11110, 11111}, and we consume 5 bits of the hash for each level we traverse. Now a tree of 2¹⁰ elements will typically be 2 levels deep rather than 10, which makes it much faster to traverse.

The only problem with this scheme is that each node now needs to have space for 32 children, even when the children are empty. For example, if I store two elements, x with hash 00000 00001, and y with hash 00000 00100, the root of the tree will be a node with 32 children, of which only the 00000 child will be non-empty. This child will contain a subtree containing x at position 00001, y at position 00100, and all other 30 positions empty. If only we had a way to only store those positions that are actually filled. A sparse vector, if you will...

Congratulations, we have just invented hash array-mapped tries, or HAMTs. A HAMT is a hash trie in which each non-leaf node is a sparse vector, indexed by a fixed-length chunk of bits from the hash. To find an element in the HAMT, we traverse the sparse vectors, consuming successive chunks from the hash, until we either find the element we want, or we consume the entire hash and reach a collision list (in which case we look for the element inside the list), or we reach an empty child (in which case the element is not found). Because the sparse vector only allocates space for the elements actually present, each node is compact, and because each level is indexed by a large-ish chunk of bits, the tree is shallow. Win win.

The sparse vectors are immutable, and so is our tree. To add an element to the tree, we have to change the nodes from the root to the final place of the element in the tree, which means making copies of them with the desired parts changed. But the nodes that are not changed (i.e., those that are not part of the path from the root to the new element) are not copied: the new tree will just point to the same old nodes (which we can do because we know they won't change). So adding an element to the tree does not require making a full copy of it.

Removing an element from a HAMT requires some care. Basically, we have to replace the node where the element is with an empty child. But if the subtree where the element was had only two elements, and after removal is left with only one, that one element takes the place of the whole subtree (you never have a subtree with a single leaf child in a HAMT, because the purpose of a subtree is to disambiguate multiple elements with the same hash prefix; if there is no other element sharing the same prefix with the element, there is no point in having a subtree: the element could have been stored directly in the level above instead).

Miscellaneous notes

When profiling my implementation in a benchmark inserting a million elements in a HAMT, I discovered that most of the time was spent on an auxiliary function I wrote to copy sequences of elements between vectors (when updating sparse vectors). This would probably be more efficient if R6RS (or Chez) had an equivalent of memcpy for vectors. It does have bytevector-copy!, but not a corresponding vector-copy!. Go figure.

R7RS does have a vector-copy!, but I'm using Chez, which is an R6RS implementation. Moreover, R7RS(-small) does not have bitwise operations. But it totally has gcd (greatest common divisor), lcm (least common multiple) and exact-integer-sqrt. I mean, my idea of minimalism is a bit different. Also, it has a timestamp function which is based on TAI instead of UTC and thus requires taking account of leap seconds, except it's not really guaranteed to return accurate time anyway ("in particular, returning Coordinated Universal Time plus a suitable constant might be the best an implementation can do"). Yay. [/rant]

Implementing transient/mutable HAMTs efficiently is a bit more complicated. For it to be efficient, you need to be able to insert new elements in a sparse vector in-place, but you can only do that if you pre-allocate them with more space than they actually need, so you have room for growing. Finding a proper size and grow factor is left as an exercise to the reader.

Comparing performance with Clojure HAMTs is not a very exact benchmark, because the implementations are not in the same language (Clojure's is in Java and Fenius's is in Chez Scheme). In my tests doing 10M insertions, Clojure sometimes did much faster than Chez, sometimes much slower, with times varying between 16s and 48s; the JVM works in mysterious ways. Chez's fastest was not as fast as Clojure, but performance was consistent across runs (around ~35s). Note that this is the time for using the hashmap implementation from ChezScheme, not Fenius; doing the benchmark directly in Fenius would be much slower because the language is currently interpreted and the interpreter is very naive. Note also that in actual Clojure, you would do all the insertions on a transient hashmap, and then turn it into a persistent hashmap after all the insertions, so the benchmark is not very representative of actual Clojure usage.

End of file

That's all for now, folks. I wanted to discuss some other aspects of Fenius dictionaries (such as syntax), but this post is pretty big already. Enjoy your hashmaps!

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.

Posts com a tag: pldesign

2024-03-26 15:19 +0000. Tags: comp, prog, pldesign, fenius, lisp, life, in-english

Fenius

Life

EOF

2023-01-22 00:05 +0000. Tags: comp, prog, pldesign, fenius, lisp, in-english

What kind of language I’m trying to build?

F-expressions

Versioning

Implementation strategies

EOF

2020-07-24 22:19 +0100. Tags: comp, prog, pldesign, fenius, in-english

The problem

The solution

2020-05-30 17:27 +0100. Tags: comp, prog, pldesign, fenius, in-english

The idea

The problem

EOF

2020-05-19 21:35 +0100. Tags: comp, prog, pldesign, fenius, in-english

So, namespaces

Solution 1: Interface wrappers

Solution 2: Static typing with dynamic characteristics

Gradual typing in reverse

2019-06-16 17:33 -0300. Tags: comp, prog, pldesign, fenius, in-english

2019-06-03 21:39 -0300. Tags: comp, prog, pldesign, fenius, in-english

ready? go!

Binaries available

2019-05-30 19:40 -0300. Tags: comp, prog, pldesign, fenius, in-english

How does it work?

Missing things

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

2019-05-23 21:36 -0300. Tags: comp, prog, pldesign, fenius, in-english

Sparse vectors

Hash Array-Mapped Tries

Miscellaneous notes

End of file

Main menu

Recent posts

Recent comments

Tags

Elsewhere

Quod vide

Posts com a tag: `pldesign`

`ready? go!`