Elmord's Magic Valley

Posts com a tag: `pldesign`

Lisp without cons cells

2016-05-28 13:14 -0300. Tags: comp, prog, pldesign, lisp, ramble, in-english

Okay, I'm gonna write this down now to distract myself for a while before I get back to Master's stuff.

In a recent post I talked about the problem of cross-process garbage collection, and suggested wrapping objects in a reference-counted container when crossing process boundaries as a possible solution, but I remarked that this would have a large overhead when passing many small objects. The prime example would be passing a linked list, as (at least naively) every node of the list would get wrapped as the elements of the list are accessed.

Now, I particularly cared about this case because the linked list (based on cons cells) is a very prominent data structure in Lisp. And although they have some nice properties (they are conceptually simple, you can insert and remove elements into the middle/end of a list by mutating the cdrs), they also are not exactly the most efficient data structure in the world: half the memory they use is just for storing the "next" pointer (which fills processor cache), whereas in a vector you just need a header of constant size (indicating the vector size and other metadata) and the rest of the memory used is all payload. Also, vectors have better locality. On the other hand, "consing" (i.e., nondestructively inserting) an element into a vector is O(n), because you have to copy the whole vector, and even destructive insertion may require a whole copy every once in a while (when you exceed the current capacity of the vector). I've been wondering for a long time: could you make a Lisp based on a data structure that is halfway between a linked list and a vector?

If we are to allow the common Lisp idioms with this new kind of list, it has to support consing and taking the tail of the list efficiently. (Another possibility is to replace the common idioms with something else. That is much more open-ended and requires more thought.)

What I've been thinking of as of late is roughly a linked list of vectors, with some bells and whistles; each vector would be a chunk of the list. Each vector/chunk would have a header containing: (1) the number of elements in the chunk; (2) a link to the next chunk; (3) an index into the next chunk. Then comes the payload. So, for example, if you have the list (w x y z), and you want to append the list (a b c) on the front of it, you'd get a structure like this (the | separates graphically the header from the payload; it does not represent anything in memory):

[3 * 0 | a b c]
   |
   `->[4 * 0 | w x y z]
         |
         `-> ø

The reason for the index is that now you can return the tail of a list lst without the first n elements by returning a vector chunk with 0 length and a pointer into lst with index n: [0 lst n | ]. If the n is greater than the size of the first chunk (e.g., if you want to drop 5 elements from the (a b c w x y z) list above), we must follow the "next" pointers until we find the chunk where the desired tail begins. This is likely to be more efficient than the cons cell case, because instead of following n "next" pointers, you follow the number of chunks, subtracting the length of the skipped chunk from n each time. In the worst case, where there is one chunk for each element, the performance is the same as for cons cells, at least in number of pointers traversals. (We must only allow empty chunks, like the [0 lst n | ] example, at the beginning of a list, never in the middle of a chunk sequence. This ensures worst-case cons-like behavior. If we allowed empty chunks anywhere, reaching the nth element of a list could require arbitrarily many chunk traversals.)

One problem with this is that now (cdr lst) allocates memory (it creates a [0 lst 1 | ] chunk and returns it), unlike the cons cell case, where cdr never allocates memory (it just returns the value of the cell's "next" pointer). One possible solution is to try to make the result of cdr go in the stack rather than being heap-allocated, to reduce the overhead (the compiler could special-case cdr somehow to make it return multiple values rather than a new chunk, and build the chunk on the fly in the caller if it turns out to be necessary.) Another way around this would be to return a pointer into the middle of a chunk instead of a new chunk. I see two ways of achieving this:

Change the layout of the vector such that it begins with the payload, followed by a special value that marks the end of the payload, followed by the pointer to the next chunk. Now we don't need an index into the next chunk anymore, because we can just make the "next" pointer point into the middle of the next chunk. The drawback is that now finding the end of the chunk requires traversing the whole chunk, rather than looking at the length at the header. This would still be faster than with cons cells, because you're traversing adjacent elements in memory.
Require all chunks to have a fixed size and fixed memory alignment (say, every chunk has 64 bytes and is allocated at an address that is a multiple of 64). Then you can always find the header by zeroing out the last bits of the address.
Third idea that occurred to me now: if all chunks are, say, 16-byte aligned, you can encode an index 0~15 into the lower-order bits of the pointer.

All these have drawbacks. First, you need to know that the pointer you have is a pointer to a cons cell to be able to safely do the pointer arithmetic. (The fixed-size chunks case is simpler to solve: you zero out the pointer and see if it points to a chunk type tag.) Also, pointers into the middle of objects complicate garbage collection (and even more reference counting, I think). Finally, if you fix the size of chunks some of the advantages of using chunks in first place go away; if I allocate a 1000-element list at once, that should get me a single 1000-element chunk.

Or should it? Another problem here is that now garbage collection / reference counting can only collect whole chunks. If you choose your chunks badly, you may end up holding memory for longer than necessary. For instance, if you have a 1000-element list and at some point your program takes tails until it only remains with a reference to the last three elements, and the list was made out of a single 1000-element chunk, now you're stuck with a huge chunk most of which is unused – and more, all the elements in it are held from being collected too. Maybe we'd need a heuristic: if the tail size you want is less than some threshold size of the chunk, the system would return a copy of the tail rather than the tail. This would mess with mutability (you'd never know if the tail list you got shares storage with the original), but maybe immutable lists are the way to go anyway.

The other problem to solve is how to make cons efficient: the classical Lisp cons adds (non-destructively) one element to the front of an existing list, and we don't want to create a new chunk per cons invocation, otherwise the chunks just degenerate into cons cells. One idea I had is to allocate chunks with a least a certain amount of elements. For example, if you create a list with just a, you'd get a chunk with a few blank spaces (and enough metadata to know what is blank and what isn't; this could be an extra header element, or just a distinguished value meaning "blank"): [4 ø 0 | _ _ _ a]. Now, when you cons a new element x into that list, cons would check if there is a space immediately before the a in the existing chunk, and mutate it in place: [4 ø 0 | _ _ x a]. This won't mess with the program's view of the list because so far it only had references to the already filled part of the list. The problem with this is if you have multiple threads wanting to cons onto the same list at the same time: we must ensure only one of them gets to mutate the chunk. For example, say one thread want to cons x onto the list (a), and another thread wants to cons y onto the same list (a). We must make sure that only one gets to mutate the chunk in place ([4 ø 0 | _ _ x a]), and the other one will fail and fall back to either by copying the chunk and then mutating the copy, or by creating a new chunk that points to the old one ([4 [4 ø _ _ x a] 3 | _ _ _ y]; note that outer chunk points into the inner chunk with an index 3, skipping the first 3 elements, including the x added by the other thread). This could have a synchronization overhead. I'm not sure if it would be significant, though, because all you need is a compare-and-swap: "try to write into this space if it is blank". You don't need a lock because you don't need to wait anyone: if this first try fails (i.e., if the other thread got the space first), the space won't be available anymore, so you must immediately fall back to creating a new chunk rather than waiting for anything.

A possible side-effect of all of this is that now vectors as a separate data structure may not be necessary: you just allocate an n-element list at once, and it will largely have the same performance as an n-element vector. Well, unless we make lists immutable, then we may need (mutable) vectors. And lists still have some arithmetic overhead to find the position of the element (because in general we don't know that the list is a single chunk when performing an access, we have to find that out), so vectors may still be advantageous in many circumstances.

Now, back to (trying to) work.

[Update: Apparently I reinvented a half-hearted version of VLists. Also, I didn't mention that, but the Lisp Machine had a feature similar in spirit (but not in implementation) called CDR coding, which used a special tag in cons cells to mean that the rest of the list itself rather than a pointer to it was stored at the cdr place, thus saving one pointer and gaining locality. In the Lisp Machine, every memory object was tagged, so this special tag came more or less for free, which is generally not the case for modern architectures.]

Comentários / Comments

C is in an identity crisis, and some thoughts on undefined behavior

2016-05-19 23:11 -0300. Tags: comp, prog, c, pldesign, ramble, in-english

So, stories about undefined behavior have been making rounds again in my Twitter and RSS feeds (two things I was supposed not to be using, but anyway), which brought me some new thoughts and some other thoughts I meant to blog about ages ago but forgot about them.

The most recent one was this comment on Hacker News (via @pcwalton, via @jamesiry), which presents the following code, which is supposed to take a circular linked list, take note of the head of the list, and walk around the list freeing each node until it finds the node that points back to the head (and thus the end of the list):

void free_circularly_linked_list(struct node *head) {
  struct node *tmp = head;
  do {
    struct node *next = tmp->next;
    free(tmp);
    tmp = next;
  } while (tmp != head);
}

This looks (to me) as good C code as it gets. However, this code triggers undefined behavior: after the first iteration of the loop frees the node pointed to by head, it is undefined behavior to perform the tmp != head comparison, even though head is not dereferenced.

I don't know what is the rationale behind this. Maybe that would make it possible to run C in a garbage-collected environment where as soon as an object is freed, all references to it are zeroed out. (The fact that no one has ever done this (as far as I know) is mere detail. The fact that in a garbage-collected environment free would likely be a no-op is a mere detail too.)

The feeling I had after I read this is that C is in a kind of identity crisis: C allows you to do all sorts of unsafe operations because (I'd assume) it's supposed to let you do the kind of bit-bashing you often want to do in low-level code; at the same time, modern standards forbid that very bit-bashing. What is the point of programming in C anymore?

[Addendum: To be more clear, what is the purported goal of the C language? The feeling I have is that it has moved from its original function as a "higher-level assembly" that is good for systems programming, and is trying to serve a wider audience more preoccupied with performance, but in doing so it is not serving either audience very well.]

And the standards forbid these operations in the worst possible way: by claiming that the behavior is undefined, i.e., claiming that compilers are free to do whatever the hell they please with code perfoming such operations. Compilers keep becoming better and better at exploiting this sort of undefinedness to better "optimize" code (for speed, anyway). Meanwhile, they keep breaking existing code, and opening new and shiny security vulnerabilities in programs. The NSA probably loves this.

At this point, I'm beginning to think that C does not serve its purpose well anymore. The problem is that there seems to be no real alternative available. Maybe Rust can be it, although I don't really know how well Rust does in the bit-twiddling camp (e.g., can you easily perform bitwise and/or with a pointer in Rust? Well, come to think of it, even C does not allow that; you have to cast to an integer first.)

* * *

The other undefined behavior I've been reading about lately is signed overflow. In C, signed overflow is undefined, which means that code like:

if (value + increment < value) {
    printf("Overflow occurred! Aborting!\n");
    exit(1);
}
else {
    printf("No overflow; proceeding normally\n");
    value += increment;
}

is broken, because the compiler is likely to optimize the overflow check and the then branch away and just leave the else branch. I have seen two rationales given for that:

Pointer arithmetic. In the good old times, an int and a pointer used to have the same size. People happily used ints as array indices. Array indexing is just pointer arithmetic, and in some architectures (like x86), you can often perform the pointer arithmetic plus load in a single instruction.

Then came 64-bit architectures. For reasons I don't really get (compatibility?), on x86-64 and other 64-bit architectures ints remained 32-bit even though pointers became 64-bit. The problem now is that transformations that assumed integers and pointers to be the same size don't work anymore, because now their point of overflow is different. For example, suppose you had code like:

void walk_string(char *s) {
    for (int i=0; s[i]; i++) {
        do_something(s[i]);
    }
}

Usually, the compiler would be able to replace this with:

void walk_string(char *s) {
    for (; *s; s++) {
        do_something(*s);
    }
}

which is potentially more efficient. If ints and pointers have the same size, then this transformation is okay regardless of overflow, because the int would only overflow at the same point the pointer would anyway. Now, if ints are supposed to wrap at 32-bits but pointers wrap at 64-bits, then this transformation is not valid anymore, because the pointer version does not preserve the overflow behavior of the original. By making signed overflow undefined, the problem is sidestepped entirely, because now at the point of overflow the compiler is free to do whatever the hell it pleases, so the fact that the overflow behavior of the original is not preserved does not matter.

Now, there is a number of things wrong in this scenario:

Now ints cannot be used to index into an arbitrarily-sized array, only 2³²-element ones at most. In any sane language (I'd argue), the standard integer type would be one encompassing the full range of valid array indices. (Common Lisp explicitly requires fixnums to have at least that range, for instance.)
Unsigned overflow is defined. That's good, but at the same time it seems kinda strange that we have been using signed integers for array indices all this time, to the point that the undefinedness was chosen precisely to enable optimization of this usage, even though, come to think of it, it seems kinda wrong to use a signed integer in this circumstance (if your array takes more than half the memory space, this would be wrong, for instance).
[Update: Speaking of the devil...]

Optimizations based on "real math". The other reason I am aware of for making signed overflow undefined is to enable optimizations based on the mathematical properties of actual mathematical integers. An example is assuming that x+1 > x, for instance (which is what breaks the overflow test mentioned before). Another example is assuming that in a loop like:

for (i=0; i<=limit; i++) { ... }

the halting condition i<=limit will eventually be true, and therefore the loop will finish; if i were defined to overflow, then this loop would be infinite when limit == INT_MAX. Knowing that a loop terminates enables some optimizations. The linked article mentions enabling use of loop-specific instructions which assume termination in some architectures. Another advantage of knowing that a loop terminates is enabling moving code around, because non-termination is an externally-visible effect and you may not be able to move code across the boundaries of an externally-visible event ^{[please clarify]}. Now, something that occurred to me back when I read that post is that it assumes a dichotomy between either treating overflow as undefined, or defining it to wrap around. But there are other possibilities not explored here. I don't necessarily claim that they are feasible or better, but it's interesting to think on what optimizations they would enable or preclude. For instance:

One possibility is to actually make integers behave like their mathematical counterparts. This is what Common Lisp and Python do, for instance, by automatically promoting fixnums to bignums when they exceed the representation capacity of fixnums. In this case, all optimizations based on actual arithmetic are not only possible but safe, because the language integers actually respect those properties. The drawback is that now you have to deal with bignums, and check whether a number is a fixnum or a bignum before performing arithmetic.
Another possibility is to make integer overflow trap, i.e., raise an exception, abort execution, or something along those lines. In this case, an assumption like x+1 > x is true as long as x+1 does not trap. This might not enable the same amount of optimization as claiming undefined behavior for overflow, but at least it enables knowing that a loop terminates (either by hitting the termination condition or by aborting on overflow). Although aborting on overflow is an externally-visible event too, so I don't know if you would be able to perform the aforementioned code moving. Actually, I'd need to see an example of code moving that would be benefited by knowing about termination. I need to research this. [I rememeber seeing some code involving multiple threads and the observable behavior of one thread from another, but (1) I'd have to find it; (2) I'd have to be convinced that the mentioned optimization would be a good idea anyway.]

Now, the problem with the trapping semantics is that you have to check for overflow on every operation. This could be costly, but there are people working on making it more efficient. Seriously, this is the kind of thing that would be trivial if only architectures would help a little bit. Having the processor trap on overflow (either by having special trapping arithmetic instructions, or by having a special mode/flag which would enable trapping) would make this essentially costless, I think. Another nice-to-have would be a set of arithmetic instructions which treated the lower bits of words specially as flags, and trapped when the flags were not, say, all zeros. This could drastically reduce the cost of having fixnums and bignums in the language; the instructions would trap on non-fixnums and invoke a handler to perform the bignum arithmetic (or even find out that the operands are not numbers at all, and signal a dynamic type error), and perform the fast integer arithmetic when the operands had the fixnum flag. Alas, unfortunately we cannot just invent our own instructions, as we typically want to be able to use our programming languages on existing platforms. We could try to lobby Intel/AMD, though (as if).

(At this point I'm not really thinking about semantics for C anymore, just about the possible semantics for integers in a new programming language. Even if, say, Clang incorporated an efficient mechanism for trapping integer overflow, the standard would still say that signed overflow is undefined, so I'm not sure there is much hope for C in this situation.)

2 comentários / comments

Random project ideas

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A few days ago someone asked me what kind of projects I would like to work on. I have some ideas that have been wandering my mind for a long time, so I decided to write some of them down. (I'm sure someone will come around, steal some of them, and make millions of dollars, but there we go.) I have been meaning to write a post like this one for quite a while, but never got around it. For the next days I will be working on finishing my Master's monograph, so I decided write them down now. The text is kind of a mess, and especially towards the end it is more of a "thinking out loud" sort than a clear exposition of ideas, but it's probably a good idea (for me, anyway) to register these ideas at once before I forget them. As Zhuangzi would say, "Let me say a few careless words to you and you listen carelessly, all right?"

A programming language

Although there are plenty of nice programming languages around (and plenty of un-nice ones as well), I don't know any language I'm fully satisfied with, and I certainly would like to try creating one. I don't think there can be such a thing as one perfect language to rule them all, but I'd like to create one that is better suited to the way I like to do things and to the kinds of things I like to do. Some features of that hypothetical language would be:

Optional types

After writing some larger programs in Scheme, I realized that some static type analysis would go a long way in catching many programming mistakes. Moreover, while working (for a short time) with the Clang/LLVM codebase, I realized how types can be useful in finding other places in a program which need to be changed after a change in one part. However, I strongly believe in the ability to run incomplete (and incorrect) programs, both for the sake of debugging (it is often convenient to be able to run an incorrect program with some concrete data to see what is the state at the point of the error, rather than relying solely on static type errors), and for the sake of exploratory programming (when you want to try out stuff and figure out what program do you want to write after all).

(Some static-types people, when confronted with this notion, give replies like: "but static types help me with exploratory programming!". Well, dynamic typing helps me with exploratory programming. As I said before, I don't believe in a single programming language to rule them all, and neither do I believe that either static typing or dynamic typing is inherently superior in all circumstances and for all people. Indeed, I would actually like not to have to choose one or another, and that's kind of the point of this section.)

So, I would like to alternate between dynamic and static types according to convenience. First of all, I don't think static type mismatches should preclude compilation/execution; rather, these should emit warnings, and the compiler should emit code that runs as far as possible before hitting the error.

In dynamically-typed languages, it is relatively easy to call functions with arguments of the wrong type and keep running, because all values carry enough metadata (tags of some kind) to enable checking the type at any given moment. Statically-typed languages, on the other hand, usually employ "unboxed" representations for data (e.g., a 32-bit integer in memory is just 32 bits, with no extra tag indicating that it is an integer; a pointer is just another 32 (or 64) bits in memory, with no way of distinguishing it from an integer of the same size), therefore calling a function with a value of the wrong type (e.g., passing an integer to a function expecting a pointer) must be blocked; otherwise, there would be a violation of safety (e.g., the function would try to use the integer as a pointer, probably with disastrous consequences). Conventional statically-typed languages block execution of such unsafe function calls by refusing to compile the program, but that's not necessary: one might instead emit code that evaluates the program until the point where an unsafe function call would be performed, and abort execution just then. For instance, if f is a function taking integers, and g(x) returns a string, and the program contains the expression f(g(x)), you can still emit code that evaluates g(x) (and f), and then interrupts execution instead of calling f with the string. (I've written about this before, but anyway.)

So, you could have a language where (1) type declarations are optional; (2) if function parameter/return types are not declared, dynamic types and boxed representations are used by default; (3) if types are declared, static type checking is performed on function calls, and where the compiler cannot prove that the types match (e.g., because they don't match, or because a function with statically-typed parameters is called with a dynamically-typed value), it emits code to interrupt execution (and provide a decent diagnostic message, and/or trigger the debugger) just before the function is called. I wonder if interrupting execution at the function call wouldn't be too early (as ideally I'd like to be able run an incomplete/incorrect program as far as is reasonable). Perhaps there could be a compiler option/pragma to emit code for the entire module as if everything were dynamically typed, even when type declarations are provided (but still emit warnings upon detected type mismatches).

This mix of static and dynamic types brings some implementation problems. Because statically-typed functions can't just be called with any random argument due to representation mismatches, if a function expects another function as an argument (let's say, a function f expects a function taking an integer and returning an integer), you cannot pass it a dynamically-typed function g that happens to return integers when given integers, because the representation of the values it returns would be different (g returns boxed integers, but f expects a function that returns raw integers). Likewise, if a function f expects a dynamically-typed function, it cannot be given a function g that returns raw integers. One solution would be to generate a dynamically-typed wrapper around statically-typed functions when they are used in a context that expects a dynamic function, and vice-versa.

To research: It'd probably be good to see how Haskell deals with the situation where an Int -> Int function is used where an a -> a one is expected. Although now that I wrote this I realize that that's not the same problem, because in Haskell the concrete a is always statically known at the point where the integer value is used. Probably more promising would be to look at how Java and C# deal with this. (In Java, the difference between a raw integer and an Integer object is directly visible to the programmer, whereas I think that's not the case in C#.)

There has also been lots of work on the subject of gradual types, both recent and old, and I have to check it out.

Language interoperability

I'm probably beginning to sound repetitive, but as I said before, I don't believe in a single language for all circumstances, and not only I want to be able to use other existing programming languages, I'm probably going to create more than one in my lifetime. Still, I don't like the idea of having to reimplement code already available in a library just because the library is not available in the language I want to use. I'd like to be able to easily integrate pieces of code written in different programming languages, without having to manually write wrappers/bindings/etc. Ideally, there would be a "universal" way of exchanging data / invoking procedures across programming languages, and the "bridging" work would only have to be done once for each programming language (to make it support the "universal" way), rather than for each library and each language you want to use the library in.

This already kinda exists: the JVM and CLR do this – for languages that are implemented on the top of the JVM or CLR. The problem is that you are constrained to implement your language on the top of these virtual machines, which might not be a good match for the programming language in question (e.g., the VM may not support tail call optimization, or multiple inheritance, or you want to use the stack in a really strange way, or you want your strings to be UTF-8 instead of UTF-16, or you want direct access to operating system features, or whatever), or you may simply not want to use a bulky VM, or one whose evolution is in the hands of a single company and cramped by API copyright claims and/or patents.

An idea that occurred to me in the past is that of an extensible virtual machine: a machine where you could load modules that implemented new opcodes/features that would make it better suited to various programming languages. It is possible to do this without opcode conflicts between the different extensions by giving opcodes (fully-qualified) names instead of fixed numbers; the mapping from concrete opcode numbers to opcode names would be given in a header in the bytecode file, and would not be fixed. (I'm pretty sure I got this idea from WebAssembly, but I haven't been able to find a source right now.) So, for instance, you could have a Python extension for the VM, with Python-specific features and opcodes, and a Scheme extension, and so on. Of course, making all extensions work together is not just a matter of avoiding opcode conflicts, but having the freedom to extend the VM with features more convienient to each language to be supported sounds more promising than being forced to work with a fixed instruction set which did not have the language of interest in mind.

VMs are not the only possibility (although VMs have a number of other possible advantages, which I may write about in another post). Another way would be to let each language implementation be completely independent, but make it possible somehow for them to share data and call code from each other. This has the advantage of (potentially) making it easier to accomodate existing language implementations (rather than having to reimplement them as VM extensions), but of course brings with it a number of challenges to make interoperability work, such as:

Handling mismatches between data layout in memory. One idea that immediately comes to mind is message-passing/OO: it does not matter if your strings are UTF-8 or UTF-16 or something else if you always access their contents with a chatAt(i) method which knows how to find the ith character in it. You still have to agree on the format of characters, though. And it probably would be slow as hell to access everything through methods / message-passing. (But you could have methods like asUTF8() or asUTF16() which would return the string (or a copy thereof) in the specified format, which you could directly manipulate.) You also have to agree on a convention on how to invoke a method (but we would have to do that anyway, in principle).
Garbage-collecting data that crosses language boundaries. Basically you have multiple garbage collectors working at the same time, trying not to step on each others' toes. If language A calls a method from language B which returns an object x, now A holds a reference to x, but B's garbage collector is not prepared to go looking inside A's objects in memory for references to x. So it might end up collecting x while A still has references to it. One awkward solution would be to make everyone use the same garbage collector (for instance, a conservative one like Boehm–Demers–Weiser), but this precludes interesting implementations like Cheney-on-the-MTA linked before (which is used by Chicken Scheme, for instance). Another possibility would be to wrap every object that crosses language boundaries in a reference-counted container (which might also take care of other interoperability issues, such as containing enough information to allow inspecting the encapsulated object's methods). In the previous example, B would return x to A encapsulated within a wrapper X with a reference count of 2, meaning two runtimes (A and B) have access to the object. When B decides nothing in its memory holds a reference to x, the element would somehow (handwave) rememeber that x has been given to another runtime inside a container X, and decrement the container's refcount, instead of freeing x. When the refcount reaches zero, x can be freed. That could become unwieldy for large numbers of objects (e.g., if B gives A a linked list, every time A asks for the next element in the list, the element would get wrapped).

Another possibility, which got into my mind after reading this, would be a model in which no memory is shared between languages: everything is passed by copy, and things don't even have to be in the same process. (This brings up another problem that has been in my mind since forever: the fact that Unix streams, files, etc., are limited to bytes rather than structured data. Here we would be defininig a mechanism for exchanging higher-level data across processes.) This solves the data exchange problem, but not the code invocation problem. This could be done by message passing, but I wonder how that would work. For instance, suppose language A wants to use a regex library from language B.

A wants to search for all occurrences of a regex r within a 50MB string s. It would have to pass r and s by copy to B, and get the (potentially large) result by copy too, which sounds sub-optimal.
A wants to search for all occurrences of a regex r within a stream (open file, socket, generator, whatever) s. How would it pass s to B? Would it pass a handle h to B, and then listen for messages of the form "[nextItem h]" and answer accordingly? How would it know, in the general case, when to stop listening for messages about h from B? Wouldn't I have now a cross-process garbage collection problem, and now we're back to the wrapping solution?

To research: There is an infinite number of things to research here. One that I realized recently is GObject, GNOME/Gtk's way of doing objects in C which was created with the goal of easily supporting bindings to other languages in mind. Maybe GObject solves all the problems and there's nothing to do (other than making everything support GObject for exchange of arbitrary objects), or maybe GObject can be used as an inspiration. I should also look up other inter-process communication mechanisms (especially D-Bus, and Erlang/OTP). Other people have been working on this problem from other angles.

EOF

Sorry if this post looks like a mess of random ideas thrown around, because that's exactly what it is. As always, writing about stuff helps me organize my ideas about them and realize problems I hadn't thought about before (though how organized things are here is up to question). Feel free to comment.

Comentários / Comments

Lisp stares at PHP

2015-09-02 22:11 -0300. Tags: comp, prog, pldesign, lisp, php, lows, in-english, em-portugues

[This post is also available in English.]

No último post, eu falei sobre o lows, um Lisp que compila para PHP 5.2 que eu comecei a desenvolver, e o que eu fiz até agora nele. Neste post, eu pretendo discutir algumas questões de design da linguagem, mais ou menos na mesma idéia da série Blueprints for a shell (só que bem menor (or so I hope)). Como já mencionei, não sei quando vou mexer no lows de novo (afinal, eu tenho certos afazeres mundanos, tais como um mestrado para terminar), mas deixo aqui documentadas algumas idéias. Comentários são sempre bem-vindos.

Tipos de dados

Os tipos de dados do PHP não casam muito bem (leia-se: não casam praticamente nada) com os tipos convencionais do mundo Lisp. O plano é tentar projetar a linguagem para permitir um estilo de programação suficientemente Lisp-like com os tipos existentes, ao invés de tentar recriar os tipos líspicos tradicionais, até porque isso não seria muito viável em termos de performance em PHP.

Arrays

Basicamente o único tipo de coleção que o PHP possui é o array, que faz tanto o papel de lista quanto de dicionário (e mesmo como dicionário, ele preserva a ordem das inserções). Eu não vejo nenhuma maneira eficiente de implementar um cons em PHP (adiciona um elemento a uma lista, criando uma lista nova, em espaço/tempo O(1)). Daria para criar uma classe Cons, mas o overhead seria muito grande, e além do mais o objetivo da linguagem é interagir facilmente com código PHP. So, arrays.

As operações map, filter, reduce e afins são independentes da representação da coleção e funcionam igualmente bem com arrays. O idioma de criar listas usando cons e recursão, por outro lado, cai por terra, mas ao mesmo tempo ele já não seria uma boa escolha em PHP porque chamadas de função são mais custosas, e não há tail call optimization. Na verdade tail call optimization não nos ajudaria nesse caso anyway, porque a recursão não fica in tail position; nessa situação, quando a performance é relevante, a galera costuma acumular os elementos da lista como um argumento da chamada recursiva e chamar reverse! no final do loop, mas se é para fazer isso, podemos passar um array como argumento e adicionar elementos com push. Anyway.

No futuro, map e afins podem ser inlined, o que torna essas construções basicamente equivalentes a loops. Até mesmo a closure passada como argumento para o map não precisa ser construída, e ao invés disso o corpo do lambda pode ser inserido diretamente no loop.

Há que se pensar na sintaxe equivalente ao Array(...) do PHP. Para listas simples, (array 1 2 3) é suficiente. O problema é o equivalente da forma Array("foo"=>1, "bar"=>2). Quando a chave é uma string literal simples, uma possibilidade seria usar keywords do Chicken: (array foo: 1 bar: 2). Quando a chave não é literal, algo como (array (=> key1 val1) (=> key2 val2)) poderia ser usado, mas isso é meio verboso. I don't know.

Uma bizarrice de arrays do PHP é que elas são passadas por cópia. Ao que parece, simplesmente não existe uma maneira de passar um array "by sharing", como os valores costumam ser passados em Lisp ou em Python. É possível criar uma referência (que na verdade é um alias) para uma variável que contém um array, mas a referência é à variável, não ao array; se outro valor é atribuído à variável, a referência/alias reflete a modificação. Acho que teremos que conviver com isso.

Símbolos

Não existe nada equivalente a símbolos em PHP. A principal vantagem de símbolos sobre strings é o fato de que eles podem ser comparados em tempo constante, mas o PHP não possui nada realmente equivalente ao operador eq dos Lisps, então acho que não teria muito propósito implementar símbolos em lows. 'foo poderia ser usado como abreviação para "foo", talvez.

Pharen tem um "operador" #foo, que gera a string correspondente ao nome da função foo. A razão desse operador é que Pharen converte hífens em nomes de função para underlines (o que é uma boa idéia em um Lisp→PHP), e o operador # produz uma string com o nome convertido. Talvez ' pudesse ser usado com o mesmo propósito em lows.

By the way, lows não tem um operador quote de verdade, pois o código de um programa lows não é representado por uma estrutura de dados de lows (i.e., do PHP). Yeah, lows não é uma linguagem homoicônica, e talvez nem merecesse o título de Lisp por conta disso. Macros em lows não seriam escritas em lows, mas sim em (Chicken) Scheme, mas veremos isso mais adiante.

(Qual seria então o significado de '(+ 1 2) em lows? Erro de sintaxe? Uma abreviação de (array '+ '1 '2)?)

Miscelânea

Não há muito o que dizer sobre os demais tipos (números, strings, booleanos, e NULL). PHP converte loucamente entre tipos, o que é meio desagradável, mas não há muito o que fazer, a menos que queiramos inserir checks antes do uso de qualquer operação. (Isso poderia ser uma opção de compilação, útil para debugging, talvez.)

Operadores

Operações aritméticas e afins em Lisp costumam ser funções, e não operadores especiais. Porém, não queremos ter que chamar uma função em lows para realizar operações aritméticas, pois o custo de uma chamada de função em PHP é alto. (Mesmo Lisps costumam otimizar essas chamadas quando é possível determinar os tipos dos argumentos em tempo de compilação.) Por outro lado, seria bom podermos passar + e afins como argumento para outras funções. Uma possível solução seria expandir '+ para (lambda (x y) (+ x y)), mas isso fixa o número de argumentos do operador, enquanto o ideal é que (+ 1 2 3) seja possível e seja traduzido para 1+2+3. Outra possibilidade seria ter tanto uma função, definida na biblioteca padrão, quanto um operador especial, e o compilador decidiria qual usar dependendo do caso. Mas não tem por que incluir uma função para cada operador na biblioteca padrão, se o próprio compilador pode gerar o lambda com o código apropriado automaticamente quando necessário.

A maioria dos operadores do lows seriam equivalentes diretos, e teriam o mesmo nome, dos operadores do PHP. As exceções seriam:

Atribuição, que passaria a se chamar set!, porque = parece uma comparação de igualdade em Lisp.
Já que liberamos o =, ele pode ser equivalente ao === do PHP, que é o operador de igualdade mais útil do PHP. (Mas o != também teria que ser mapeado para o !== do PHP, porque senão isso seria uma armadilha dos infernus. Qual seria então o equivalente do != do PHP? !==? Oops!)
Incremento e decremento, talvez, para inc! e dec!, ou add1! e sub1!. Quanto ao pós-incremento, boa pergunta. Talvez post-inc! e post-dec!. Com que freqüência se usa o valor do pós-incremento anyway? (Se o incremento permanecesse como ++, qual seria o nome do pós-incremento?)
Concatenação de string, que eu me sinto tentado a renomear para ++, já que eu tenho o plano de usar . para acessar propriedades e métodos de objetos (more later).
and, or, not, ao invés de &&, ||, !, para maior similaridade com outros Lisps. and e or em particular me parecem melhores prefixados do que && e ||.

Os operadores binários têm um detalhe extra: x || y em PHP sempre retorna um booleano, enquanto em Lisp (or x y) costuma retornar x se ele é verdadeiro, e y caso contrário. Daria para traduzir (or x y) para x? x : y (com o detalhe de que x não pode ser avaliado duas vezes), mas isso meio que estraga a idéia de tradução direta (os condicionais de ifs teriam uma cara bizarra no código resultante). Talvez o compilador pudesse detectar quando o and/or está sendo usado como condicional de um if e traduzi-lo para &&/|| nesses casos.

De qualquer forma, o que o PHP acha que é verdadeiro é diferente do que um Lisp normalmente acha que é verdadeiro: em Lisp, normalmente o valor booleano falso (ou a lista vazia) é considerada como falso, e tudo o mais é considerado verdadeiro (inclusive 0, a string vazia, etc.). Esse comportamento inclusive faria mais sentido em PHP, que possui funções como strpos que devolvem um índice numa string (que pode ser 0) ou false. Mas não adiantaria mudar só o and, teria que mudar o comportamento do if também. Tem que ver isso aí...

Funções

Funções em PHP não verificam se o número de argumentos passados corresponde ao número de parâmetros declarados. Poderia haver uma opção de compilação para inserir um check no começo de cada função.

While we are at it, poderíamos permitir declarar os tipos dos argumentos e adicionar checks no começo da função, ou verificar e emitir warnings em tempo de compilação quando possível. Mas isso ficaria para o futuro distante.

Muito me alegraria ter keyword arguments em lows. Eles poderiam ser passados como um array, i.e., (f x y a: 1 b: 2) seria equivalente a (f x y (array a: 1 b: 2)), e no corpo da função referências a a seriam traduzidas para referências à chave no array.

Variáveis globais e constantes

O plano original era que um acesso a uma variável não declarada localmente seria interpretado como um acesso à variável global de nome correspondente. Porém, estive pensando se não seria melhor usar um prefixo ou outra convenção para nomear variáveis globais (e.g., *foo*, a la Common Lisp, ou $foo, a la Ruby). A vantagem seria poder detectar erros de digitação em tempo de compilação, ao invés de assumir que o nome não declarado se trata de uma variável global.

Outro problema são as constantes globais. O Pharen aparentemente decide se um nome é uma constante ou uma variável exigindo que constantes sejam escritas em maiúsculas. ($_SERVER e afins também são escritos em maiúsculas, mas o Pharen usa um operador especial $ para acessar essas variáveis.) Nada no PHP exige que o nome de uma constante seja totalmente em maiúsculas, entretanto. Uma possibilidade seria usar +nome+, uma convenção usada por alguns programadores Common Lisp. I don't know. (error-reporting +E-ALL+)? Talvez (error-reporting #E_ALL)? (error-reporting <E_ALL>)?

Namespaces

Um dialeto de Lisp é classificado como Lisp-1 ou Lisp-2 dependendo de se nomes de funções e de variáveis vivem em um mesmo namespace (Lisp-1), ou se dois namespaces separados são usados (Lisp-2). PHP é mais próximo de um Lisp-2 do que de um Lisp-1, mas a distinção não é bem a mesma, pois em PHP variáveis possuem um prefixo que as identifica ($), enquanto em Lisp a posição do nome em uma chamada determina o namespace a ser utilizado (se o nome aparece como primeiro elemento em uma lista entre parênteses, ele é um nome de função; caso contrário, é uma variável).

Para o lows ser um Lisp-1, seria necessário determinar se (f x y) deveria ser traduzido para f($x, $y) ou $f($x, $y). No geral, isso não é possível (o compilador lows não tem conhecimento dos nomes definidos fora do arquivo que está compilando). Assim, um Lisp-2 é uma escolha mais natural: (f x y) sempre traduz para f($x, $y). Para obter o equivalente de $f($x, $y), i.e., para usar uma variável como a função a ser chamada, deve-se usar um operador especial, normalmente chamado funcall nos Lisps: (funcall f x y).

Em PHP 5.2, $f() só funciona se $f é uma string contendo o nome da função a ser chamada; não é possível chamar closures criadas com (lambda ...) dessa maneira. Por outro lado, call_user_func($f, ...) funciona tanto quando $f é uma string quanto com outros valores chamáveis. Como queremos que funcall funcione em ambos os casos, funcall deve ser traduzido para call_user_func.

A situação contrária ao funcall, i.e., quando se quer usar uma função nomeada como argumento para outra, é coberta pelo nosso operador 'foo, que produz uma string com o nome (convertido) da função foo. Como PHP não possui funções de primeira classe nem declarações locais de função, a string com o nome da função é suficiente para identificar a que função está se referindo.

Classes, métodos e bagulheiras

Definições de classes e métodos são relativamente straightforward. (defclass Foo atributos/métodos...) não tem por que ser muito diferente da construção equivalente em PHP. Podemos adicionar algumas abreviações, e.g., permitir definir um construtor padrão ao invés do costumeiro $this->x = $x; $this->y = $y; … do PHP.

A bagunça começa na hora de escolher uma sintaxe para chamada de métodos e acesso a atributos. Pharen usa algo como (-> objeto (método args...)), mas isso é meio verboso. Uma solução seria permitir (objeto->método args...) como uma abreviação, no caso mais comum em que objeto é uma variável, mas eu prefiro . para isso ao invés de ->. By the way, e se eu quiser um método cujo nome está em uma variável, i.e., $objeto->$método(args...)? Suponho que em Pharen seja possível escrever (-> objeto ($método args...)), que é como eles resolvem o problema do funcall, mas meh, não é assim que lows funciona – e se eu quiser uma expressão ao invés de uma variável para computar o nome do método? Que tal (-> objeto (funcall método args...))? Mas agora eu vejo que toda essa sintaxe conflita com a sintaxe para acesso de atributos. Em Pharen, (-> objeto nome) é equivalente a $objeto->nome. Novamente, para usar um nome vindo de uma variável, pode-se usar $nome, mas, novamente, isso não permite usar uma expressão para calcular o nome do atributo; (-> objeto (computa-nome)) seria interpretado como $objeto->computa_nome().

So. Estou com a sensação de que o adequado em lows seria que $obj->nome fosse algo como (-> obj 'nome), com o nome quoted. Sem o quote, nome seria avaliado como uma expressão que produz um nome. Quanto a chamada de métodos, talvez o jeito seja usar um operador distinto do usado para atributos. Meh. Independentemente dos operadores escolhidos, obj.nome seria considerado uma abreviação para (-> obj 'nome), já que acredito que acesso com um nome fixo a um objeto em uma variável pré-determinada seja o caso mais comum. E (obj.método args...) seria uma abreviação para $obj->método(args...). Isso tudo requer mais pensação.

Analogamente, o argumento do operador new também deveria ser quoted: (new 'Foo), não (new Foo). Nesse sentido, as operações do lows seriam mais parecidas com o make-instance e slot-value do CLOS do que com o Pharen. Now, o CLOS usa (método objeto args...) para chamadas de métodos (que na verdade são multi-métodos; o objeto é um argumento como os demais, que tambem podem ser especializados para diferentes classes). O problema de adotar essa sintaxe em lows é que é necessário saber que método é um método e não uma função para decidir que código PHP gerar. Uma possiblidade seria usar (.método objeto args...). Tem que ver isso aí. Outra situação a ser considerada é quando queremos passar um método como argumento para uma função (i.e., o equivalente do Array($obj, "método") do PHP). Hmmrgh...

Falando de abreviações, acho que seria uma boa adotar @nome como uma abreviação de $this->nome, a la Ruby. (@método args...) poderia ser abreviação de $this->método(args...).

[Um problema de usar (. objeto 'nome) é que . é sintaxe especial em Scheme, e atualmente eu uso a função read do Scheme para ler o código-fonte lows de um arquivo. Mais adiante, é de se pensar escrever um reader próprio para o lows. Isso também teria a vantagem de permitir manter informação de linha e coluna nas formas lidas.]

Macros

Como o compilador é escrito em Chicken Scheme, e não PHP (e nem por sonho eu pensaria em escrevê-lo em PHP (embora talvez reescrevê-lo em lows no futuro não seja algo de se descartar)), e não possui um interpretador próprio, ele não é capaz de rodar código lows em tempo de compilação. Isso exclui a possibilidade de macros procedurais nativas em lows. O meu plano, por ora, é permitir que macros em Scheme possam ser escritas, pois essas podem ser executadas pelo compilador. Por um lado é bizarro (em um Lisp) escrever macros em uma linguagem diferente da linguagem de programação, mas por outro isso tem a vantagem de permitir usar todas as funções do Chicken na geração de código. Uma outra possibilidade (não-exclusiva) é ter um sistema de macros baseado em casamento de padrões, como o syntax-rules do Scheme, que não requer a execução de código lows em tempo de compilação.

Além das macros que recebem código lows e produzem código lows, também seria interessante ter uma construção para emitir código PHP diretamente, similar ao asm do GCC e afins. No caso mais simples, a construção seria usada com uma string constante, mas nada impediria que expressões arbitrárias em Scheme pudessem ser usadas para gerar o código PHP.

Enough talk

Acho que era isso por enquanto. Sugestões, opiniões, etc., são bem-vindos.

[Version in English follows.]

In the last post, I talked about lows, a Lisp which compiles to PHP 5.2 which I started to develop, and what I got done so far. In this post, I intend to discuss some design questions of the language, more or less in the same spirit of the Blueprints for a shell series (in Portuguese) (except much shorter (or so I hope)). As I mentioned, I don't know when I'm going to work on lows again (after all, I've got some mundane tasks to do, such as a Master's to finish), but I'll leave some ideas documented here. Comments are always welcome.

Data types

The PHP data types don't match very well (read: mostly don't match) the conventional types from the Lisp world. The plan is to try to design the language to enable a sufficiently Lisp-like programming style with the existing types, rather than trying to recreate the traditional Lispy types in PHP, even more because that would not be very feasible in terms of performance in PHP.

Arrays

Basically the only type of collection PHP has is the array, which fills the roles of both lists and dictionaries (and even as a dictionary, it preserves insertion order). I don't see any efficient way to implement cons in PHP (adds an element to a list, creating a new list, in space/time O(1)). It would be possible to create a Cons class, but the overhead would be too large, and moreover the goal of the language is to interact easily with PHP code. So, arrays.

The map, filter, reduce and similar operations are independent of the representation of the collection and work equally well with arrays. The idiom of building lists using cons and recursion, on the other hand, is right out, but at the same time it wouldn't be a good choice in PHP anyway because function calls are more costly, and there is no tail call optimization. Actually, tail call optimization wouldn't help in this case anyway, because the recursion is not in tail position; in this situation, when performance matters, people usually accumulate the elements of the list in an argument to the recursive call, and call reverse! at the end of the loop, but if we're going to do that, we can equally well pass an array as the argument and accumulate the elements with push. Anyway.

In the future, map and company can be inlined, which makes these constructions basically equivalent to loops. Even the closure passed as an argument to map does not need to be created, and instead the body of lambda can be inserted directly in the loop.

We have to think about the syntax equivalent to PHP's Array(...). For simple lists, (array 1 2 3) works. The problem is the equivalent of the form Array("foo"=>1, "bar"=>2). When the key is a simple literal string, one possibility would be to use Chicken keywords: (array foo: 1 bar: 2). When the key is not a literal, something like (array (=> key1 val1) (=> key2 val2)) could be used, but that is somewhat verbose. I don't know.

One oddity of PHP arrays is that they are passed by copy. Apparently, there simply is no way to pass an array "by sharing", as values are usually passed in Lisp or Python. It is possible to create a reference (actually an alias) to a variable containing an array, but the reference is to the variable, not to the array; if another value is assigned to the variable, the reference/alias will reflect the modification. I guess we'll have to live with that.

Symbols

There is nothing equivalent to symbols in PHP. The main advantage of symbols over strings is that they can be compared in constant time, but PHP doesn't really have anything equivalent to the Lisp eq operator, so I think there wouldn't be much to gain in implementing symbols in lows. 'foo could be used as an abbreviation for "foo", perhaps.

Pharen has an "operator" #foo, which yields the string corresponding to the name of the function foo. The reason for that operator is that Pharen converts hyphens in function names to underscores (which is a good idea in a Lisp→PHP), and the # operator yields a string with the converted name. Perhaps ' could be used to the same purpose in lows.

By the way, lows does not have a true quote operator, as the code of a lows program is not represented by a lows (i.e., PHP) data structure. Yeah, lows is not a homoiconic language, and perhaps doesn't even deserve the name "Lisp" because of that. Macros in lows wouldn't be written in lows, but rather in (Chicken) Scheme, as we'll see later.

(What would be the meaning of '(+ 1 2) in lows? Syntax error? Shorthand for (array '+ '1 '2)?)

Miscellaneous

There isn't much to say about the remaining types (numbers, strings, booleans, and NULL). PHP converts nilly-willy between types, which is somewhat annoying, but there is not much we can do, unless we want to insert checks before every operation. (This could be a compilation option, useful for debugging, perhaps).

Operators

Arithmetic operations and the like in Lisp are usually functions, not special operators. However, we don't want to have to call a function in lows to perform arithmetic operations, because the cost of a function call in PHP is high. (Even Lisps usually optimize these calls away when it is possible to determine the types of the operands at compile time.) On the other hand, it would be nice to be able to pass + and the like as arguments to other functions. One possibility would be to expand '+ to (lambda (x y) (+ x y)), but that would fix the number of arguments of the operator, whereas ideally (+ 1 2 3) should work and translate to 1+2+3. Another possibility would be to have both a function, defined in the standard library, and the special operator, and the compiler would decide which to use depending on the situation. But there is no reason to include a function for each operator in the standard library, when the compiler can automatically generate the lambda with the appropriate code as needed.

Most lows operators would be directly equivalent, and would have the same name, as the PHP operators. The exceptions would be:

Assignment, which would be named set!, because = looks like an equality comparison in Lisp.
Since we freed up =, if could become equivalent to PHP's ===, which is PHP's most useful equality operator. (But then != would have to be mapped to !== too, otherwise that would be a terrible trap. What would then be the equivalent of PHP's !=? !==? Oops!)
Increment and decrement, perhaps, renamed to inc! and dec!, or add1! and sub1!. As for post-increment, good question. Maybe post-inc! and post-dec!. How often is the result of post-increment used anyway? (If increment remained as ++, what would post-increment be named?)
String concatenation, which I feel tempted to rename to ++, as I have plans to use . to access object properties and methods (more later).
and, or, not, instead of &&, ||, !, for greater similarity with other Lisps. and and or in particular look better to me in prefix position than && and ||.

The binary operators have an extra detail: x || y in PHP always returns a boolean, whereas in Lisp (or x y) usually returns x if it is true-ish, and y otherwise. It would be possible to translate (or x y) to x? x : y (taking extra care not to evaluate x twice), but that kinda messes with the idea of straightforward translation (the conditions of if blocks would look weird in the compiled code). Maybe the compiler could detect when and/or are being used as the condition of an if and translate them to &&/|| in those cases.

In any case, what PHP thinks is true is at odds with what Lisps usually think is true: in Lisp, usually the false boolean value (or the empty list) is considered false, and everything else is considered true (including 0, the empty string, etc.). This behaviour would actually make more sense in PHP, which has functions such as strpos which return an index into a string (which may be 0) or false. But then it wouldn't be enough to change the behavior of and, we'd have to change the behavior of if too. Gotta think about it.

Functions

PHP functions don't check if the number of arguments in a call match the number of declared parameters. There could be a compilation option to insert a check at the beginning of each function.

While we are at it, we could allow declaring the types of the parameters and add checks at the beginning of the function, or verify and emit warnings at compile time if possible. But that would be left for the distant future.

It would much gladden me to have keyword arguments in lows. They could be passed in an array, i.e., (f x y a: 1 b: 2) would be equivalent to (f x y (array a: 1 b: 2)), and in the function body references to a would be translated to array dereferences.

Global variables and constants

The original plan was that an access to a variable not declared locally would be interpreted as an access to the global variable with that name. However, I have been thinking if it wouldn't be better to use a prefix or some other convention to name global variables (e.g., *foo* as in Common Lisp, or $foo as in Ruby). The advantage would be to be able to detect mistyped variable names at compile time, rather than assuming that the undeclared name refers to a global.

Another problem is global constants. Pharen seems to decide if a name is a constant or a variable by requiring all constants to have all-uppercase names. ($_SERVER et al. are all-uppercase too, but Pharen uses a special operator $ to access those.) Nothing in PHP requires that the name of a constant be all-uppercase, though. A possibility would be to use +name+, a convention used by some Common Lisp programmers. I don't know. (error-reporting +E-ALL+)? Maybe (error-reporting #E_ALL)? (error-reporting <E_ALL>)?

Namespaces

Lisp dialects are usually classified as Lisp-1 or Lisp-2 depending on whether function and variable names live in a single namespace (Lisp-1), or if there are two separate namespaces for them (Lisp-2). PHP is closer to a Lisp-2 than a Lisp-1, but the distinction is not quite the same, because in PHP variable names have a prefix identifying them ($), whereas in Lisp the position of the name in a call determines which namespace is to be used (if the name appears as the first element of a parenthesized list, it is a function name; otherwise, it's a variable name).

For lows to be a Lisp-1, it would be necessary to determine if (f x y) should be translated to f($x, $y) or $f($x, $y). In general, this is not possible (the lows compiler has no knowledge of the global names defined outside the file being compiled). Therefore, a Lisp-2 is a more natural choice: (f x y) always translates to f($x, $y). To get the equivalent of $f($x, $y), i.e., to use a variable as a function to be called, a special operator must be used, usually named funcall in Lisps: (funcall f x y).

In PHP 5.2, $f() only works if $f is a string containing the name of the function to be called; it is not possible to call closures created with (lambda ...) in this way. On the other hand, call_user_func($f, ...) works equally well when $f is a string and with other callable values. Since we want funcall to work in both cases, funcall must be translated to call_user_func.

The opposite situation to funcall, i.e., when we want to pass a named function as an argument to another, is covered by our ' operator, which yields a string with the (converted) name of the function foo. Since PHP does not have first-class functions or local function declarations, a string with the function name is enough to identify which function is being referenced.

Classes, methods, and stuff

Class and method definitions are relatively straightforward. (defclass Foo attributes/methods...) doesn't have to be very different from the equivalent construction in PHP. We can add some shorthands, e.g., allow a default constructor instead of the usual $this->x = $x; $this->y = $y; … in PHP.

The mess begins when choosing a syntax for method calls and attribute access. Pharen uses something like (-> object (method args...)), but that is somewhat verbose. A solution would be to allow (object->method args...) as a shorthand for the most common case when object is a variable, but I prefer . for this instead of ->. By the way, what if I want to use a method whose name is in a variable, i.e., $object->$method(args...)? I suppose Pharen allows (-> object ($method args...)), which is how they solve the funcall problem, but meh, that's not how lows works – what if I want an expression instead of a variable to compute the method name? What about (-> object (funcall method args...))? But now I see that all this syntax conflicts with the syntax for attribute access. In Pharen, (-> object name) is equivalent to $object->name. Again, to use a name from a variable, it allows $name, but, again, this does not allow using an expression to compute the name of the attribute; (-> object (compute-name)) would be interpreted as $object->compute_name().

So. I have the feeling that the appropriate thing in lows would be that $obj->name should be something like (-> obj 'name), with the name quoted. Without the quote, name would be evaluated as an expression yielding a name. As for method calls, perhaps the solution is to use a distinct operator from that used for attribute access. Meh. Regardless of the operators chosen, obj.name would be considered shorthand for (-> obj 'name), as I think access with a fixed name to an object in a known variable is the most common case. And (obj.method args...) would be shorthand for $obj->method(args...). All of this requires more thinking.

Analogously, the argument for the new operator should also be quoted: (new 'Foo), not (new Foo). In this sense, lows's operations would be more similar to make-instance and slot-value from CLOS than to Pharen. Now, CLOS uses (method object args...) for method calls (which are actually multi-methods; object is an argument just like the others, which can also be specialized for different classes). The problem of adopting this syntax in lows is that it requires knowing that method is a method and not a function to decide which PHP code to emit. One possibility would be using (.method object args...). Gotta think about that. Another situation to be considered is when we want to pass a method as an argument to a function (i.e., the equivalent of PHP's Array($obj, "method")). Hmmrgh...

Speaking of shorthand, I think it would be a good idea to adopt @name as shorthand for $this->name, a la Ruby. (@method args...) could be shorthand for $this->method(args...).

[One problem of using (. object 'name) is that . is special syntax in Scheme, and currently I use Scheme's read function to read lows source code from a file. Later on, one might think about writing a proper reader for lows. That would also have the advantage of allowing the compiler to keep track of line and column information.]

Macros

Because the compiler is written in Chicken Scheme, and not PHP (and I wouldn't even dream of writing it in PHP (though perhaps rewriting it in lows someday is not entirely unthinkable)), and has no lows intepreter, it is not capable of running lows code at compile time. This excludes the possibility of native procedural macros in lows. My plan, for now, is to allow writing macros in Scheme, as those could be run by the compiler. On the one hand it is somewhat weird (for a Lisp) to write macros in a different language, but on the other hand this has the advantage of allowing one to use all Chicken functions in code generation. Another (non-mutually-exclusive) possibility is to have a macro system based on pattern matching, like Scheme's syntax-rules, which does not require running lows code at compile time.

Beside macros which take lows code and yield lows code, it would also be interesting to have a construction to emit PHP code directly, similar to asm in GCC and the like. In the simplest case, the construction would be used with a constant string, but nothing would exclude using arbitrary Scheme expressions to generate PHP code.

Enough talk

I think that's it for now. Suggestions, opinions, etc., are welcome.

21 comentários / comments

Lisp meets PHP

2015-09-02 04:25 -0300. Tags: comp, prog, pldesign, php, lisp, lows, in-english, em-portugues

[This post is also available in English.]

Como eu andei comentando por aí, eu comecei a implementar uma linguagem Lisp-like que compila para PHP 5.2, chamada lows. Não sei quando vou mexer nesse projeto de novo, mas deixo aqui algumas notas para o meu eu futuro e para quem tiver interesse.

Prelúdio

Tudo começou no domingo retrasado, quando eu resolvi dar uma mexida no blog system, for a change. Mexer no blog sempre é uma experiência ambivalente, pois por um lado tem uma porção de idéias que eu gostaria de implementar nele, mas por outro lado eu tenho que fazer isso em PHP, porque é a única coisa que roda no inf.ufrgs.br (e eu não pretendo pagar por hospedagem any time soon). Eu comentei com uma pessoa que se eu tivesse a opção, eu já teria reescrito o blog em Scheme há muito tempo. Ela me perguntou por que eu não fazia algo para rodar Scheme em PHP, e eu comentei que já tinha pensado em fazer um compilador de Lisp para PHP, mas que achava que era muita mão só para poder escrever o blog em Lisp. Assim, eu segui meu domingo fuçando no blog em PHP.

Depois de uma porção de gambiarras e mais uma porção de concessões às bizarrices do PHP (e.g., existem os métodos mágicos __call, __callStatic e __get, mas não existe um __getStatic, sabe-se lá por quê), eu consegui reescrever a parte do blog responsável por mensagens multilíngües de uma maneira que me agradasse, e até já não estava mais achando tão horrível escrever o código em PHP.

No final do dia, depois de ter testado o código no meu servidor local, eu resolvi fazer upload da nova versão para o inf.ufrgs.br. Para minha surpresa, o PHP começou a reportar uma porção de erros no código. Turns out que a versão do PHP que roda no inf.ufrgs.br é a "5.2.6-1+lenny16". Para quem não sabe, lenny é o codinome do Debian 5. O Debian 5 foi lançado em 2009 e não recebe mais atualizações desde 2012. Três releases estáveis do Debian saíram desde então (as releases estáveis do Debian saem a cada mais ou menos dois anos). Meanwhile, eu estava rodando PHP 5.6.12 em um Debian testing em casa, e praticamente todo o código que eu tinha escrito usava features introduzidas no PHP 5.3.

Depois de tentar sem muito sucesso mudar um pouco o código para ver se conseguia fazê-lo rodar no PHP 5.2, eu resolvi largar de mão e deixar para mexer no código outro dia. Porém: (1) eu não estava a fim de enfeiar o código só para fazê-lo rodar em um PHP velho; (2) more generally, eu não estava a fim de mexer em PHP de novo; e (3) eu não estava conseguindo dormir aquele dia. Conclusão: comecei a escrever um tradutor Lisp→PHP, primariamente for the lol. Mais uma noite mal-dormida, e eis que eu tinha um tradutor (ou compilador, como preferir) Lisp→PHP que fazia pouca coisa, mas o suficiente para me convencer de que a idéia era pelo menos viável. Nasceu assim o lows, ou Lisp for Old Web Servers.

Idéia

A idéia do projeto é criar uma linguagem Lisp-like que satisfaça os seguintes objetivos:

Compilar para PHP 5.2. A idéia é eu poder rodar o código resultante no inf.ufrgs.br (e idealmente escrever a próxima versão do blog em lows), então eu preciso "targetar" especificamente PHP 5.2. Eu também podia tentar convencer a galera da admrede a atualizar o servidor, mas (1) acho pouco provável que isso aconteça any time soon, e eu não estava a fim de esperar; (2) a essa altura eu já tinha tomado a limitação a PHP 5.2 como um desafio (lembre-se de que eu comecei o projeto para matar tempo enquanto o sono não vinha); (3) já existe um projeto similar, chamado Pharen, que targeta PHP 5.5, e eu queria um diferencial (a.k.a. desculpa) para justificar o meu projeto.

Gerar código PHP relativamente straightforward. Tanto quanto possível, o código PHP resultante da compilação deve ser uma tradução mais ou menos direta do código lows original. A idéia é facilitar a depuração (e em particular a tarefa de encontrar o código lows correspondente a um erro reportado no código PHP), e também a esperança de que quanto mais direto for o código resultante, menor o impacto na performance de escrever o código em lows ao invés de diretamente em PHP.

Integrar facilmente com PHP. Deve ser possível usar funções, classes, etc. do PHP a partir de código lows e vice-versa, sem necessidade de conversões, anotações e afins.

Manter uma essência Lisp-like. A idéia não é simplesmente criar um redressing de PHP em S-expressions, mas sim uma linguagem que permita programar em um estilo semi-funcional e "Lispy" e evite as bizarrices do PHP na medida do possível ~~(ao mesmo tempo em que introduz outras bizarrices)~~.

Esse conjunto de objetivos influencia tanto a implementação (que deve gerar um PHP relativamente limpo/direto) quanto o design da linguagem (que não deve fugir muito do PHP para permitir a tradução relativamente direta e a compatibilidade com código PHP).

Transformando Lisp em PHP

Expressões e statements

Um desafio que eu encontrei logo no começo é o fato de que o PHP faz uma distinção entre expressões e statements, que (mostly) não existe em Lisp. Em particular, coisas como if, let (declaração de variáveis locais) e progn (executa uma seqüência de expressões e retorna o valor da última, mais ou menos análogo a um bloco entre chaves em PHP, mas que produz um valor) são expressões em lows. O if em princípio até poderia ser traduzido para o operador ternário (test? then : else), e o let poderia ser mais-ou-menos contornado já que atribuição é uma expressão em PHP. O problema é que PHP não tem um operador vírgula como em C. Coisas como:

(+ 1 (let ((x 23)
           (y 42))
       (* x y)))

não possuem uma tradução direta para PHP, pois não é possível escrever 1 + ($x=23, $y=42, $x*$y). Uma solução gambiarrenta seria gerar:

1 + ((($x=23) || TRUE) ? ((($y=42) || TRUE) ? ($x*$y)
                                            : whatever)
                       : whatever)

o que simula o operador vírgula usando só um branch do operador ternário, mas: (1) isso não funciona no caso geral (em particular, se uma das expressões é um progn contendo um echo ou alguma outra coisa statementosa); (2) isso vai totalmente contra a idéia de gerar código straightforward. A solução é mover as atribuições para antes da soma, mas, no caso geral, só mover qualquer coisa que não seja uma expressão em PHP para antes da expressão não é suficiente: se os branches de um condicional contêm expressões com efeitos colaterais, não é possível movê-las para fora do condicional. Por exemplo, em algo como:

(defun print-and-return (val)
  (echo "O valor é " val "\n")
  val)

(+ 1 (if (> x y)
         (let ((a (print-and-return (- x y))))
           (* a a))
       0))

não é possível traduzir a soma para:

$a = print_and_return($x-$y);
1 + (($x>$y)? ($a*$a) : 0)

pois print_and_return não pode ser chamada antes que o teste $x>$y seja realizado.

[A essa altura talvez lhe ocorra (como me ocorreu) o pensamento: "Ok, e por que a gente simplesmente não proíbe expressões complexas desse tipo aninhadas em outras expressões? Quando é que eu vou usar isso anyway?" Mas esse é justamente o tipo de limitação tosca de que nós estamos tentando fugir criando uma nova linguagem ao invés de programar em PHP! "Do not tell me “that’s what you get for doing weird things”. If two features exist, someday, someone will find a reason to use them together."]

A solução que eu encontrei foi traduzir o (if ...) para um bloco if em PHP, armazenar o valor do if em uma variável temporária, e usar a variável temporária na soma. O exemplo anterior fica algo como:

if ($x>$y) {
    $a = print_and_return($x-$y);
    $tmp = $a*$a;
} else {
    $tmp = 0;
}

1 + $tmp

Isso significa que para traduzir uma expressão como (+ ...), pode ser necessário emitir blocos de código antes da tradução da soma propriamente dita. Conseqüentemente, a função de tradução não pode ser simplesmente algo como:

translate[(+ lhs rhs)] = translate[lhs] + translate[rhs]

pois tanto a tradução de lhs quanto de rhs podem requerer a inserção de blocos de código antes da soma (e a soma, por sua vez, pode estar aninhada em outra expressão).

A solução que eu encontrei para esse problema foi fazer as funções de tradução retornarem dois valores: a expressão equivalente em PHP, e uma lista de "efeitos", que são basicamente (mas não necessariamente) instruções para emitir código nas redondezas da tradução. Por exemplo, a função de tradução aplicada ao if do exemplo gera a expressão $tmp (que pode ser inserida no meio de outra expressão que usa o valor do if), e o efeito (EmitBefore bloco-if-em-PHP), que indica que o bloco-if-em-PHP deve ser inserido antes da expressão que contém o if na geração do código PHP. Como a inserção só pode ser realizada fora de uma expressão, o efeito é propagado pelas funções de tradução de expressões, até que ele chega em uma função que emite statements (e.g., o corpo de um bloco if do PHP, ou o corpo de uma função) e pode então ser emitido. Pseudocodiciosamente (oops, hmm):

translate[(+ lhs rhs)] =
   let
       lhs-trans; lhs-effects = translate[lhs]
       rhs-trans; rhs-effects = translate[rhs]
   in
       lhs + rhs; lhs-effects ++ rhs-effects


translate-statement[item] =
   let
       item-trans; item-effects = translate[item]
   in
       (código correspondente aos EmitBefore em item-effects) ++ item-trans ;
       (efeitos em item-effects excluindo os EmitBefore já processados)

O mesmo mecanismo pode ser usado para emitir código em outras situações (e.g., no caso do lambda, como veremos adiante), ou para coletar e propagar informações durante a tradução. Por exemplo, quando uma variável x que não possui declaração visível é usada, é emitido um efeito (Global x). A função que traduz o corpo de uma função coleta esses efeitos para gerar declarações do tipo global $x; no começo da função.

lambda

O próximo desafio foi traduzir o lambda para PHP. PHP >=5.3 possui closures (meio toscas – é necessário declarar explicitamente que variáveis são capturadas pela closure – mas elas existem), mas PHP 5.2 não. A próxima coisa que eu pensei foi usar uma classe "callable" com um método mágico __invoke, mas turns out que classes chamáveis só foram introduzidas em PHP 5.3 também. Porém, as funções que aceitam coisas chamáveis em PHP, como call_user_func e usort, aceitam arrays da forma Array(objeto, nome-de-método) como chamáveis. Pois, aí está algo que o lambda pode retornar.

Capturar as variáveis em uma closure mostrou-se bem mais fácil do que eu antecipava, graças às referências do PHP. Uma closure em lows é representada por uma classe com um membro/slot/propriedade/atributo/whatever para cada variável capturada. Quando a classe é instanciada, as variáveis são passadas por referência para o construtor. Dentro do corpo do lambda, referências a variáveis capturadas x são traduzidas para $this->x; como $this->x foi inicializado com uma referência ao $x capturado, o corpo do lambda vê a mesma variável $x através do atributo, inclusive refletindo modificações à mesma.

Como exemplo, algo como:

(defun adder (x)
  (lambda (n)
    (+ x n)))

(defun main ()
  (let ((f (adder 10)))
    (call_user_func f 5)))

vira algo como:

class Closure1 {
    function __construct(&$x) {
        // Captura de variáveis.
        $this->x = &$x;
    }

    function __invoke($n) {
        // Corpo do lambda.
        return $this->x + $n;
    }
}

function adder($x) {
    // Cria a closure, passando a variável a ser capturada para o seu construtor,
    // e retorna um valor que, quando chamado, chama o método "__invoke" da closure.
    return Array(new Closure1($x), "__invoke");
}

function main() {
    $f = adder(10);
    return call_user_func($f, 5);
}

E assim, o PHP e suas referências nos surpreendem positivamente (o que é uma surpresa in itself).

By the way, o mecanismo de efeitos aqui é usado para duas coisas: (1) a geração da classe antes da função que contém o lambda é feita propagando um efeito (EmitBeforeToplevel definição-da-classe); (2) cada referência a uma variável externa ao lambda gera um efeito (CapturedVar x); esses efeitos são coletados pela função que traduz o lambda para saber que atributos devem ser inicializados na classe e que argumentos devem ser passados ao construtor. Quando eu criei a treta dos efeitos eu não tinha pensado em todas essas aplicações, então mui me alegrou descobrir que eu podia reusar o mecanismo para essas coisas.

Name clashes

Em PHP, variáveis locais têm como escopo a função inteira onde se encontram, não apenas o bloco onde foram declaradas. Conseqüentemente, em código como:

(let ((x 23))
  (echo "x aqui vale 23: " x "\n")
  (let ((x 42))
    (echo "x aqui dentro vale 42: " x "\n"))
  (echo "x aqui fora ainda vale 23: " x "\n"))

não se pode usar o mesmo nome para as duas variáveis x na tradução, pois a definição mais interna de x sobrescreveria a mais externa. A solução e renomear uma das (ou ambas as) variáveis. O ideal seria fazer o mínimo de renomeações possível, para facilitar a leitura e depuração do código resultante. Porém, a implementação atual simplesmente renomeia todas as variáveis (adicionando um prefixo _número_), já que testar quando uma variável deve ser renomeada não é muito simples. Essa decisão não é local: mesmo não havendo nenhuma variável x visivelmente declarada no ponto onde ocorre um (let ((x 23)) ...), ainda assim é necessário renomear o x se em um ponto posterior da função uma variável global x for referenciada.

O algoritmo de renomeação / geração de nomes temporários assume que nomes iniciados por _número_ são reservados para o compilador. Acredito que isso não seja um problema na prática. (Para o caso de variáveis locais, uma variável _42_ vai ser renomeada para algo como _1__42_ de qualquer forma.) Um problema mais sério dessa abordagem é no escopo global, em particular nos nomes gerados para as classes que implementam closures (e.g., _1_Closure), pois esses nomes podem conflitar com closures criadas em outros arquivos (e.g., quando os resultados da tradução de múltiplos arquivos são incluídos com include em um programa PHP). Talvez uma solução seja incluir o nome do arquivo no nome da classe, ou gerar um hash a partir do código da closure (mas isso ainda gera conflito se um lambda idêntico aparece em outro arquivo), or something. I don't know. Também seria bom se o nome da classe fosse informativo o suficiente para indicar de onde saiu a definição no código original (e.g., _1_Closure_arquivo_função_linha). [Side remark: namespaces não existem em PHP 5.2.]

Um conflito de variável mais sutil é quando um let é executado múltiplas vezes e um lambda captura uma variável definida pelo let. Por exemplo, supondo a existência de uma construção while:

(let ((array-of-lambdas (array))
      (i 0))
  (while (< i 5)
    (let ((n 0))
      (array_push array-of-lambdas
                  (lambda ()
                    (set! n (+ n 1))
                    (echo n))))
    (set! i (+ i 1))))

Isso seria traduzido para algo como:

Class _1_Closure {
    function __construct(&$n) {
        $this->n = &$n;
    }

    function __invoke() {
        $this->n = $this->n + 1;
        echo $this->n;
    }
}

$array_of_lambdas = Array();
$i = 0;
while ($i < 5) {
    $n = 0;
    array_push($array_of_lambdas,
               Array(new _1_Closure($n), "__invoke"));
    $i = $i + 1;
}

O problema é que todas as iterações do loop usam a mesma variável $n, que é passada por referência ao construtor da closure; o correto seria cada iteração capturar um $n diferente. A solução é emitir uma chamada a unset($n) no final do while, de maneira que cada iteração crie uma variável nova, mas eu ainda não implementei isso.

PHP formatado

Um dos objetivos do projeto é gerar PHP legível, e isso envolve gerar código com indentação adequada. Depois de alguns false starts (na versão inicial, as funções de tradução geravam strings de código PHP diretamente, e a minha idéia original era usar caracteres especiais do ASCII como indicadores de "increase indentation" e "decrease indentation" quando o código fosse impresso, mas eu me dei conta de que não dava para escolher caracteres para isso porque qualquer caractere pode aparecer em uma string; além disso, misturar geração de código e questões de formatação estava ficando um bocado desagradável), eu resolvi fazer as funções de tradução gerarem estruturas representando árvores de sintaxe abstrata (ASTs) de PHP. Depois da tradução, as árvores são passadas a uma função print-php que trata dos detalhes sórdidos de imprimir o código com quebras de linha, indentações, espaços e parênteses nos lugares apropriados. Separation of concerns FTW.

O futuro

Como o post ficou grande, e eu deveria ir dormir, ficaremos por aqui. Em um post futuro, pretendo falar de algumas features que falta implementar, tais como classes, chamadas de métodos e demais firulas orientadas a objetos, bem como as decisões de design mais tricky (que eram o objetivo inicial do post, mas enfim). Quem tiver interesse, pode dar uma olhada no código no GitHub.

[English version follows.]

As I have been talking about, I started implementing a Lisp-like language which compiles to PHP 5.2, called lows. I don't know when I'm going to work on this project again, but I'll leave here some notes for my future self and whoever might be interested.

Prelude

It all began last last Sunday, when I decided to play with my blog system, for a change. Working on the blog system is always an ambivalent experience, because on the one hand there is a bunch of ideas I would like to implement in it, but on the other hand I have to do it in PHP, as that is the only thing that runs at inf.ufrgs.br (and I don't plan to pay for hosting any time soon). I commented to a person that if I had the choice, I would have rewritten the blog in Scheme long ago. She asked my why I didn't make something to run Scheme in PHP, and I said I had already though of writing a compiler from Lisp to PHP, but that I thought it was too much work just to be able to write the blog in Lisp. So, I went on with my Sunday messing with the blog in PHP.

After a number of kludges and another number of concessions to the oddities of PHP (e.g., there are the __call, __callStatic and __get magic methods, but no __getStatic, who knows why), I succeeded in rewriting the part of the blog responsible for multilingual messages in a way that pleased me, and I was even not finding it so horrible to write the code in PHP.

At the end of the day, after having tested the code in my local server, I decided to upload the new version to inf.ufrgs.br. To my surprise, PHP started reporting lots of errors in the code. It turns out that the version of PHP running at inf.ufrgs.br is "5.2.6-1+lenny16". For those who don't know, lenny is the codename of Debian 5. Debian 5 was launched in 2009 and does not get updates since 2012. Three stable Debian releases have been out since then (the stable releases of Debian are launched more or less every two years). Meanwhile, I was running PHP 5.6.12 in a Debian testing at home, and practically all the code I had written used features introduced in PHP 5.3.

After trying without much success to change the code a bit to see if I got it to run on PHP 5.2, I decided to leave it alone and work on the code another day. However: (1) I was not willing to uglify my code just to make it run on an old PHP; (2) more generally, I wasn't willing to work with PHP again; and (3) I was having difficulties to sleep that day. Conclusion: I stared writing a Lisp→PHP translator, primarily for the lol. One more badly-slept night later, and so it was that I had a Lisp→PHP translator (or compiler, if you prefer) that did little, but enough to convince me that the idea was at least feasible. Thus lows, or Lisp for Old Web Servers, was born.

Idea

The idea of the project is to create a Lisp-like language which satisfies the following criteria:

Compile to PHP 5.2. The idea is for me to be able to run the resulting code at inf.ufrgs.br (and ideally write the next version of the blog in lows), so I need to target specifically PHP 5.2. I could also try to convince the admins at INF to upgrade the server, but (1) I don't think that's going to happen any time soon, and I was not willing to wait; (2) at this point I had already taken the limitation to PHP 5.2 as a challenge (remember that I started the project to kill time while I couldn't sleep); (3) there is already a similar project, called Pharen, which targets PHP 5.5, and I wanted a distinctive feature (a.k.a. excuse) to justify my project.

Emit relatively straightforward PHP code. As much as possible, the PHP code resulting from compilation should be a more or less direct translation of the lows source. The idea is to ease debugging (and in particular the task of finding the lows code corresponding to a PHP error message), and also the hope that the more direct the resulting code, the smaller the impact on performance of writing the code in lows rather than directly in PHP.

Integrate easily with PHP. It must be possible to use PHP functions, classes, etc. from lows code and vice-versa, without requiring conversions, annotations and the like.

Keep a Lisp-like essence. The idea is not simply to make a redressing of PHP in S-expressions, but rather a language which enables programming in a semi-functional and "Lispy" style and avoids the oddities of PHP as much as possible ~~(while introducing new oddities of its own)~~.

This set of goals influences both the implementation (which must emit relatively clean/direct PHP code) and the design of the language (which must not stray away too much from PHP to allow a relatively direct translation and compatibiltity with PHP code).

Transforming Lisp into PHP

Expressions and statements

A challenge I found right at the beginning is the fact that PHP makes a distinction between expressions and statements, which (mostly) does not exist in Lisp. In particular, things like if, let (local variable declaration) and progn (runs a sequence of expressions and returns the value of the last one, more or less like a block in braces in PHP, but yielding a value) are expressions in lows. if in principle could be translated to the ternary operator (test? then : else), and let could be more-or-less worked around because assignment is an expression in PHP. The problem is that PHP does not have a comma operator like that of C. Things like:

(+ 1 (let ((x 23)
           (y 42))
       (* x y)))

don't have a direct translation to PHP, because it is not possible to write 1 + ($x=23, $y=42, $x*$y). A kludgy solution would be to emit:

1 + ((($x=23) || TRUE) ? ((($y=42) || TRUE) ? ($x*$y)
                                            : whatever)
                       : whatever)

which emulates the comma operator by using only one branch of the ternary operator, but (1) that doesn't work in the general case (in particular, if one of the expressions is a progn containing an echo or some other statement-y thing); (2) that goes totally against the idea of emitting straightforward code. The solution is to move the assignments to before the addition, but, in general, just moving anything that is not an expression to before the expression is not enough: if the branches of a conditional contain expressions with side effects, they cannot be moved out of the conditional. For instance, in something like:

(defun print-and-return (val)
  (echo "The value is " val "\n")
  val)

(+ 1 (if (> x y)
         (let ((a (print-and-return (- x y))))
           (* a a))
       0))

it is not possible to translate the addition to:

$a = print_and_return($x-$y);
1 + (($x>$y)? ($a*$a) : 0)

because print_and_return cannot be called before the test $x>$y is performed.

[At this point, perhaps it ocurred to you (as ocurred to me) the thought: "Okay, why don't we just forbid complex expressions like these nested in other expressions? When will I use that anyway?" But that is exactly the kind of weird limitation which we are trying to escape from by creating a new language instead of programming in PHP! "Do not tell me “that’s what you get for doing weird things”. If two features exist, someday, someone will find a reason to use them together."]

The solution I found was to translate (if ...) to a PHP if block, store the value of the if expression into a temporary variable, and use the temporary in the addition. The previous example becomes something like:

if ($x>$y) {
    $a = print_and_return($x-$y);
    $tmp = $a*$a;
} else {
    $tmp = 0;
}

1 + $tmp

This means that to translate an expression like (+ ...), it may be necessary to emit blocks of code before the translation of the addition itself. As a consequence, the translation function cannot be just something like:

translate[(+ lhs rhs)] = translate[lhs] + translate[rhs]

because both lhs and rhs may require inserting blocks of code before the addition (and the addition itself may be nested in another expression).

The solution I found for this problem was to make the translation functions return two values: the equivalent expression in PHP, and a list of "effects", which are basically (but not necessarily) instructions to emit code in the surroundings of the translation. For example, the translation function, when applied to the example if, yields the expression $tmp (which can be inserted in the middle of another expression which uses the value of the if, and the effect (EmitBefore PHP-if-block), which indicates that PHP-if-block must be inserted before the expression containing the if when emitting the PHP code. Since the insertion can only be performed outside of an expression, the effect is propagated by the functions responsible for translating expressions, until it arrives at a function which emits statements (e.g., the body of a PHP if block, or a function body), where it can then be emitted. Pseudocodefully:

translate[(+ lhs rhs)] =
   let
       lhs-trans; lhs-effects = translate[lhs]
       rhs-trans; rhs-effects = translate[rhs]
   in
       lhs + rhs; lhs-effects ++ rhs-effects


translate-statement[item] =
   let
       item-trans; item-effects = translate[item]
   in
       (code corresponding to the EmitBefores in item-effects) ++ item-trans ;
       (effects in item-effects excluding those EmitBefores already processed)

The same mechanism can be used to emit code in other situations (e.g., in the case of lambda, as we'll see later), or to collect and propagate information during translation. For example, when a variable x which has no visible declaration is used, a (Global x) effect is generated. The function responsible for translating functions collects those effects to generate global $x; declarations at the beginning of the function.

lambda

The next challenge was to translate lambda to PHP. PHP >=5.3 has closures (somewhat crappy ones – one must declare explicitly which variables are to be captured by the closure – but they exist), but PHP 5.2 doesn't. The next thing I thought was to use a "callable" class with an __invoke magic method, but it turns out that callable classes were introduced only in PHP 5.3 too. However, the functions which accept callable things in PHP, such as call_user_func and usort, accept arrays of the form Array(object, method-name) as callables. So, this is something that lambda can return.

Capturing variables in a closure proved much easier than I anticipated, thanks to PHP references. A closure in lows is represented as a class with a member/slot/property/attribute/whatever for each captured variable. When the class is instantiated, the variables are passed by reference to the constructor. Inside the body of lambda, references to captured variables x are translated to $this->x; because $this->x was initialized with a reference to the captured $x, the lambda body sees the same variable $x through the attribute, even reflecting modifications to it.

As an example, something like:

(defun adder (x)
  (lambda (n)
    (+ x n)))

(defun main ()
  (let ((f (adder 10)))
    (call_user_func f 5)))

turns into something like:

class Closure1 {
    function __construct(&$x) {
        // Variable capture.
        $this->x = &$x;
    }

    function __invoke($n) {
        // lambda body.
        return $this->x + $n;
    }
}

function adder($x) {
    // Create the closure, passing the variables to be captured to the constructor,
    // and returns a value that, when called, calls the closures' "__invoke" method.
    return Array(new Closure1($x), "__invoke");
}

function main() {
    $f = adder(10);
    return call_user_func($f, 5);
}

And so, PHP and its references surprise us positively (which is a surprise in itself).

By the way, the effects mechanism is used here for two things: (1) emitting the class before the function containing the lambda is done by propagating an (EmitBeforeToplevel class-definition) effect; (2) each reference to a variable external to the lambda generates a (CapturedVar x) effect; these effects are collected by the function responsible for translating lambda to find out which attributes must be initialized in the class and which arguments must be passed to the constructor. When I came up with the effects idea I hadn't thought about all those applications, so it much gladdened me to find out I could use the mechanism for those things too.

Name clashes

In PHP, local variables have the scope of the entire function where they are created, not just the block where they were declared. As a consequence, in code like:

(let ((x 23))
  (echo "x here is 23: " x "\n")
  (let ((x 42))
    (echo "x here inside is 42: " x "\n"))
  (echo "x out here still is 23: " x "\n"))

we cannot use the same name for both x variables in the translation, because the innermost definition of x would overwrite the outermost one. The solution is to rename one of the (or both) variables. Ideally we should perform the minimum number of renames possible, to make it easier to read and debug the resulting code. However, the current implementation simply renames all variables (adding a _number_ prefix), since testing when a variable must be renamed is not very simple. This decision is non-local: even if there is no visible declaration of a variable x at the point where a (let ((x 23)) ...) occurs, it is still necessary to rename x if at some later point in the function a global variable x is referenced.

The renaming / temporary name generation algorithm assumes that names beginning with _number_ are reserved to the compiler. I think this is not a problem in practice. (In the case of local variables, a variable _42_ would be renamed to something like _1__42_ anyway.) A more serious problem of this approach is at the global scope, in particular in the names of generated classes which implemente closures (e.g., _1_Closure), because those names may conflict with closures created in other files (e.g., when the translation results of multiple files are included into a single PHP program). Perhaps a solution is to include the file name in the name of the class, or to compute a hash from the closure code (but this would still cause conflicts if an identical lambda appears in another file), or something. I don't know. It would also be nice if the class name were descriptive enough to indicate where the definition came from in the source code (e.g., _1_Closure_file_function_line). [Side remark: namespaces don't exist in PHP 5.2.]

A more subtle variable conflict occurs when a let is executed multiple times and a lambda captures a variable defined by the let. For example, supposing the existence of a while construction:

(let ((array-of-lambdas (array))
      (i 0))
  (while (< i 5)
    (let ((n 0))
      (array_push array-of-lambdas
                  (lambda ()
                    (set! n (+ n 1))
                    (echo n))))
    (set! i (+ i 1))))

This would be translated to:

Class _1_Closure {
    function __construct(&$n) {
        $this->n = &$n;
    }

    function __invoke() {
        $this->n = $this->n + 1;
        echo $this->n;
    }
}

$array_of_lambdas = Array();
$i = 0;
while ($i < 5) {
    $n = 0;
    array_push($array_of_lambdas,
               Array(new _1_Closure($n), "__invoke"));
    $i = $i + 1;
}

The problem is that all iterations of the loop use the same $n variable, which is passed by reference to the closure constructor; the correct would be for each iteration to capture a different $n. The solution is to emit an unset($n) at the end of the while body, so that each iteration would create a new variable, but I haven't implemented this yet.

Pretty-printed PHP

One of the goals of the project is to emit readable PHP code, and this involves emitting properly indented code. After some false starts (in the initial version, the translation functions emitted PHP code strings directly, and my original idea was to use some special ASCII characters to indicate "increase indentation" and "decrease indentation" when printing, but I realized that I could not choose any characters for that because any character can appear in a string; moreover, mixing code generation and formatting questions was becoming rather ugly), I decided to make the translation function emit structures representing PHP abastract syntax trees (ASTs). After translation, the trees are passed to a print-php function, which takes care of the gory details of printing the code with line breaks, indentation, spaces and parentheses at the proper places. Separation of concerns FTW.

The future

As this post turned quite long, and I should get some sleep, we'll finish here. In a future post, I intend to talk about some features that are still missing, such as classes, method calls and other stuff, as well as trickier design decisions (which were the initial goal of this post, but anyway). If you are interested, you can look at the code on GitHub.

1 comentário / comment

lash status update

2015-05-14 23:42 -0300. Tags: comp, prog, pldesign, lash, life, em-portugues

Faz quase dois meses desde o primeiro commit do lash. O status do projeto é o seguinte:

No decorrer dessa última semana, o parser original, baseado na biblioteca Comparse de parser combinators, foi substituído por um parser descendente recursivo escrito à mão. Os motivos principais para a reescrita foram que a Comparse não mantém informação de linha e coluna dos elementos parseados, aparentemente não tem suporte nenhum a error reporting (o parser simplesmente "backtracka" quando se depara com um erro, até que o parser inicial backtracka e devolve #f), e o parser estava com uns comportamentos estranhos diante de algumas entradas (o que não é culpa da Comparse, mas não tinha por que eu perder tempo debugando se eu já teria que reescrever o parser uma hora ou outra pelos outros motivos citados). O handling de espaços e newlines no parser antigo também estava o caos, enquanto no atual aparentemente tudo funciona como deveria nesse quesito.

O parser novo reconhece quase toda a linguagem prevista para a "release inicial", lança exceções nos pontos certos do código ao encontrar erros de sintaxe (embora as exceções ainda não sejam muito descritiva, mas já é um começo), e armazena a linha e a coluna de cada construção nos nós da árvore sintática (com pequenos erros, mas nada difícil de resolver). O código ainda está meio crude, e tem muita coisa que ainda dá para refatorar (e.g., repetições que estão codificadas como loops explícitos ao invés de uma construção que abstraia a repetição), mas isso vai ir se resolvendo ao longo do tempo. De repente as partes mais abstratas do parser podem até virar uma biblioteca de parser combinators no futuro (com a diferença de que eu estou usando uma struct mutável e exceções ao invés de uma mônada para manter o estado do parser e indicar erros, o que seria meio unusual para uma biblioteca de parser combinators, mas whatever; ninguém disse que seriam parser combinators funcionais).

O parser novo reconhece mais construções do que o resto do shell é capaz de executar (por exemplo, pipelines, redirects, &, &&, ||, $()... em outras palavras, praticamente todas as funções do shell), o que de certa forma é bom, porque me compele a implementar as coisas que faltam. Nos últimos dias o desenvolvimento anda numa taxa mais ou menos ok (para mim), e acho que é mais ou menos realista prever uma release 0.1* mais ou menos funcional para julho. Pelo menos é (mais ou menos) isso que eu espero. Isso é bom, porque em um momento de otimismo em março eu submeti uma proposta de palestra para o FISL sobre o shell e, na vaga possibilidade de ela ser aceita, até lá seria bom o shell estar num estado usável. (Eu submeti a proposta sob a premissa de que se tudo desse errado com o shell e eu fosse aceito era só pedir para tirarem a palestra, mas a essa altura acho que isso não será mais necessário. Tudo isso assumindo que eu seja aceito, o que seria muito doido, na real.)

Comecei a usar a wiki do projeto no GitHub para fazer anotações. O plano que ela venha a conter:

Uma especificação semi-formal da sintaxe e da semântica das construções do shell, discutindo possíveis corner cases e motivações por trás das decisões;
Informações sobre os internals do shell, para facilitar a vida de quem quiser mexer no código;
Discussão de idéias e propostas para o shell;
Uma descrição dos builtins do shell (i.e., a "standard library"), coisa que até agora não foi muito discutida por aqui, mas que é um aspecto importante do shell;
Eventualmente, tutoriais e documentação voltada ao usuário final.

Para editar a wiki é necessário criar uma conta no GitHub, aparentemente, mas acho que podemos conviver com isso. Contribuições são sempre bem-vindas.

Quanto à linguagem do shell, algumas coisas mudaram:

Por ora, troquei a sintaxe da captura de valor de retorno de função de $[...] para ${...}. O motivo é que eu não sei o que fazer com o [, já que existe pelo menos um comando (especificamente, /usr/bin/[) que espera que os caracteres [ e ] on their own não sejam tratados como caracteres especiais. Por outro lado, não me ocorre nenhuma outra situação em que [ e ] sejam usados literalmente no shell, e o comando [ fica meio que obsoleto dado que haverá builtins que provêem a mesma funcionalidade, então não sei. ${...} tem a vantagem de ter mais cara de "captura o valor desse bloco", mas por outro lado toma o lugar do ${var} convencional do (ba)sh, que serve para separar um nome de variável de outros caracteres da redondeza, e que eu não sei que sintaxe atribuir em lash. Pensei em $<var>, mas eu tinha outros possíveis planos para $< (redirecionamento do lendário canal de objetos). Além disso, talvez ${var} seja uma daquelas coisas tradicionais demais para serem mudadas. Opiniões?
A distinção entre array e dicionário provavelmente vai ser atirada pela janela; internamente, um dicionário cujas chaves são todas números inteiros positivos contíguos pode ser representado internamente como um array por performance, mas acho que essa distinção em nível de linguagem é uma distração desnecessária em um shell. Fica a questão da indexação com string vs. expressão numérica (i.e., $var[índice] avalia índice como uma expressão aritmética, ou é necessário escrever algo como $var[$((índice))] para isso? Haveria operadores separados para indexar com expressão numérica vs. string? Será que operações aritméticas no índice são freqüentes o suficiente em scripts para justificar que elas aconteçam por padrão after all?). Tem que ver isso aí.

Durante o desenvolvimento do novo parser eu descobri um "bug" no Chicken que faz com que variáveis criadas com define dentro de um cond sejam declaradas como globais. A galera na mailing list parece ser da opinião de que isso não é um bug e sim uma feature, entretanto. Meanwhile, eu resolvi o problema no lash redefinindo o cond para wrappear as cláusulas em um (let () ...) implícito (o que cria um "scope boundary" que torna as definições locais), e de brinde ainda lançar uma exceção se nenhuma cláusula for verdadeira. Scheme, yay.

Enquanto o lash anda às mil maravilhas, o mestrado vai por água abaixo, mas isso é assunto possivelmente para outro post.

_____

* No momento eu não estou numerando as versões, mas pelo esquema de numeração previsto (<major>.<minor>.<número de commits desde o último update do minor>), estaríamos na versão 0.0.31. Parece bastante, mas é porque eu tenho o hábito de commitar loucamente enquanto estou mexendo no código.

1 comentário / comment

Mind dump

2015-04-10 01:34 -0300. Tags: comp, prog, pldesign, lash, life, mind, ramble, music, em-portugues

Coloquei o lash no GitHub, for what it's worth. Eu me pergunto se foi uma coisa sensata publicar ele agora, mas já faz um tempo que eu vinha anunciando que ia publicar "em breve", então coloquei lá de uma vez. (Além disso, esses dias eu quis mexer nele fora de casa e não tinha o código.) O código está em um estágio bem inicial – vergonhosamente inicial, dado que já faz umas três semanas que comecei a trabalhar nele, e o que eu fiz até agora eu provavelmente poderia ter feito em uns três dias se tivesse tido a disciplina de trabalhar nele semi-diariamente. Por outro lado, o projeto está andando para frente, mesmo que devagar, o que já é melhor do que todos os anos anteriores em que eu disse "puxa, eu queria fazer um shell" e não escrevi uma linha de código. So, that's progress. Além disso, na atual conjuntura eu provavelmente deveria tentar relaxar um pouco a cuca e me preocupar menos com isso; afinal, isso é um projeto pessoal e eu não devo nada para ninguém. No final das contas, megalomanias de dominação mundial à parte, o principal afetado pelo bom sucesso do projeto sou eu mesmo.

Eu cheguei à brilhante e inaudita conclusão de que eu vou ter que reduzir bastante minha atividade twittereira e internética em geral se eu quiser começar a fazer alguma coisa produtiva com a minha vida (onde escrever um shell e estudar línguas obscuras contam como coisas produtivas). Por outro lado, a Internet atualmente é responsável por uns 95% das minhas interações sociais, especialmente agora que eu não tenho mais aulas, coisa que faz uma certa falta, a despeito da minha fama de anti-social. A solução provavelmente é (shudder) sair de casa e falar com pessoas.

Eu também cheguei à conclusão (igualmente brilhante) de que muita coisa nessa vida é questão de criar hábitos. Por exemplo, até algumas semanas atrás eu costumava usar toda a louça da casa até não ter mais louça limpa, momento em que eu aplicava o garbage collector e lavava tudo (ou, dependendo da preguiça, só o que eu precisasse na hora). Eu me dei conta de que isso não estava sendo muito conveniente e resolvi começar a lavar as coisas logo depois que uso, ou antes de ir dormir. No começo era meio ruim ter que me "obrigar" a fazer isso, mas agora eu já me habituei e isso não me incomoda mais tanto (além do que, como a louça não acumula, normalmente o esforço de lavar é pequeno). Talvez seja uma questão de criar o hábito de sentar uma hora do dia para programar/estudar/whatever. O flip side disso é que a gente também se habitua ao longo da vida a uma porção de coisas que a gente deveria questionar e/ou atirar pela janela, não só hábitos acionais como também (e principalmente) hábitos mentais. Estas são minhas (brilhantes e inauditas) palavras de sabedoria do dia. (Tecnicamente todo hábito é mental, mas deu pra entender. Acho.)

Escrever o lash em Chicken Scheme tem sido uma experiência bastante agradável. Eu estou aprendendo (a parte não-R5RS d)a linguagem "as I go", mas até agora a linguagem não me deixou na mão, a implementação é estável e gera executáveis pequenos e razoavelmente rápidos, e o sistema de pacotes funciona. (Rodar chicken-install pacote e consistentemente ver o pacote ser baixado e compilado sem erros era quase chocante no começo. O fato de que as bibliotecas são shared objects (a.k.a. DLLs) de verdade e carregam instantaneamente também muito alegrou o espírito, especialmente dada minha experiência anterior com bibliotecas em Common Lisp.) A única coisa que deixa um pouco a desejar é o error reporting, mas nada "deal-breaking".

Eu me dei conta de que uma das coisas que eu mais gosto em linguagens "dinâmicas" é a habilidade de rodar um programa incompleto. Eu já meio que escrevi sobre isso antes, mas eu já não lembrava mais quão deeply satisfying é poder rodar um programa pela metade e ver a parte que foi escrita até o momento funcionando. Por outro lado, é bastante incômodo errar um nome de função ou os argumentos e só descobrir o erro em tempo de execução. Faz muito tempo que eu acho que o ideal seria uma linguagem com análise estática de tipos, mas em que erros de tipo gerassem warnings ao invés de impedir a compilação, e que permitisse a declaração opcional dos tipos de variáveis e funções. Uma dificuldade que eu via nisso até agora é que enquanto em uma linguagem dinâmica os dados costumam ter uma representação uniforme em memória que carrega consigo alguma tag indicando o tipo do dado, e portanto é possível chamar uma função com argumentos do tipo errado e detectar isso em tempo de execução, em uma linguagem estaticamente tipada convencional os dados costumam ter uma representação untagged/unboxed e de tamanho variável, o que tornaria impossível compilar um programa com um erro de tipo sem violar a segurança da linguagem (e.g., se uma função f recebe um vetor e eu a chamo com um int, ou eu rejeito o programa em tempo de compilação, o que atrapalha minha habilidade de rodar programas incompletos/incorretos, ou eu gero um programa que interpreta o meu int como um vetor sem que isso seja detectado em tempo de execução, o que provavelmente vai causar um segfault ou algo pior). Porém, esses dias eu me dei conta de que ao invés de compilar uma chamada (insegura) a f(some_int), poder-se-ia simplesmente (além de gerar o warning) compilar uma chamada a error(f, some_int), onde error é uma função que avalia os argumentos e lança uma exceção descrevendo o erro de tipo. O resultado prático é que o executável gerado roda até o ponto em que é seguro rodar (inclusive avaliando a função e os argumentos) e interrompe a execução no ponto em que seria necessário chamar a função com um argumento de tipo/representação incompatível. Melhor dos dois mundos, não? Vai para o meu caderninho de idéias para a Linguagem Perfeita™.

Eu ia escrever mais umas notas sobre a vida, mas ultimamente eu ando mui receoso quanto a publicar coisas da minha vida pessoal – o que provavelmente é uma boa idéia. É fácil esquecer que qualquer um, no presente ou no futuro, pode ler o que a gente escreve nessa tal de Internet. Eu também já falei sobre isso mil vezes antes, which makes it all the more surprising que eu ainda tenha que me relembrar disso ocasionalmente. Anyway.

Unrelated com qualquer coisa, conheci recentemente uma bandinha chamada Clannad, na qual eu me encontro totalmente viciado no momento. Também conheci uma coisa totalmente excelente chamada Galandum Galundaina, uma banda mirandesa com certeza.

12 comentários / comments

Blueprints for a shell, parte 4: Ramblings on syntax

2015-03-17 01:10 -0300. Tags: comp, prog, pldesign, shell, lash, em-portugues

Este post é parte de uma série sobre o lash, um shell que eu estou ideando.

Hoje discutiremos algumas questões sintáticas do shell. Depois disso eu provavelmente vou dar uma pausa na série e tentar implementar um protótipo do lash, mesmo com algumas questões ainda em aberto. Em particular, falta falar sobre estruturas de controle (mas o básico (if, while, each) não tem muito o que discutir) e módulos (que vão ficar para o futuro).

O meu objetivo ao escolher a sintaxe do shell é achar um ponto de equilíbrio entre minimalismo sintático total (e.g., S-expressions¹) e ter sintaxe especial para tudo (e.g., bash). No geral, o guiding principle é expor a maior parte das funcionalidades do shell por meio de funções, e usar sintaxe especial apenas quando seria inconveniente escrever uma chamada de função, especialmente para features freqüentemente usadas em modo interativo (e.g., redirects e pipelines). Este post é uma survey dos elementos sintáticos do (ba)sh e como eles serão representados em lash.

Comandos simples

A sintaxe básica de um comando em (ba)sh é, em BNF fuleiro:

command ::= {var=value}* {word | redirect}*

A semântica é: se há words no comando, a primeira word é o nome do comando a ser executado, e as demais são os argumentos. O comando é executado em um ambiente acrescido das variáveis de ambiente especificadas, e com os redirects em efeito. Se não há words, as variáveis especificadas (do shell, não de ambiente) recebem os valores atribuídos, e os redirects... bom, aparentemente não fazem nada, mas isso depende da variante de sh, porque o comportamento aparentemente é indefinido no padrão POSIX. A ordem de avaliação das coisas também é um pouco peculiar:

bash# a=$(date >&2) uname $(pwd >&2) 2>/dev/null
/tmp
Mon Mar 16 21:27:09 BRT 2015
Linux

dash# a=$(date >&2) uname $(pwd >&2) 2>/dev/null
/tmp
Linux

Vale notar que os redirects e as words podem aparecer intercalados na linha de comando (inclusive minha BNF está errada, porque redirects podem aparecer intercalados com as atribuições também); a ordem em que eles aparecem relativos aos outros elementos sintáticos parece ser irrelevante.

Em lash, depois de muita hesitação, eu decidi atirar pela janela as atribuições prefixadas; o comando env do Unix já serve para rodar comandos em um ambiente modificado (env FOO=bar comando). Eu pensei em obrigar os redirects a aparecerem no final, mas me dei conta de que pode ser útil escrever um redirect intercalado em comandos que recebem blocos. e.g.:

each_line </etc/passwd {|line|
    echo "bla bla $line"
}

Ainda não sei até que ponto isso pode ser útil, mas por enquanto fica aí. Fica a questão da ordem de avaliação. A remoção das variáveis prefixadas são uma coisa a menos na equação. Quanto ao momento em que os redirects tomam efeito, há algumas possibilidades:

Antes de tudo, afetando inclusive chamadas a comandos com $(...), $[...] e companhia. Tem o detalhe de que o redirect em si também pode envolver avaliação (ls >$[generate-a-file-name]). Nesse caso o redirect evidentemente só pode ter efeito depois do comando.
Depois da avaliação de tudo e imediatamente antes de executar o comando propriamente dito. Aparentemente é isso que o bash faz.
O redirect afeta a avaliação de tudo o que aparece depois dele na linha de comando, i.e., 2>/dev/null foo $(bar) afeta a execução de bar, mas foo $(bar) 2>/dev/null não.

Por ora o plano é fazer como o bash, primariamente porque sim.

Fica ainda a questão da atribuição, já mencionada anteriormente: usar um comando para atribuição (set x = 42), ou tratar o = especialmente no parser? Eu não gosto muito de casos especiais, mas talvez a atribuição mereça tratamento especial. Eu nem sei se atribuição (por oposição a definição de uma nova variável) é particularmente freqüente em um script para justificar um caso especial.

Quoting

O bash possui uma porção de coisas quote-like:

Contrabarra (\): remove o significado especial do caractere seguinte. Não realiza escapes estilo C (e.g., \n).
Aspas simples ('...'): remove o significado especial de todos os caracteres envolvidos, exceto o próprio '. Isso significa que não é possível ter uma aspa simples dentro de um trecho citado por aspas simples, já que nem o \ tem significado especial dentro delas.
Aspas duplas ("..."): $, \ e ` (sintaxe antiga para command substitution) mantêm seus significados especiais, assim como as próprias aspas duplas. É possível escrever uma aspa dupla dentro de aspas duplas precedendo-a por \. A contrabarra dentro de aspas duplas elimina o significado especial do caractere seguinte, se houver, ou é mantida como uma contrabarra literal caso contrário. "abc\\def" e "abc\def" resultam na mesma string, pois no primeiro caso a primeira contrabarra escapa a segunda, e no segundo a contrabarra é mantida literalmente, pois d não tem significado especial. Novamente, ela não serve para escapes C-like.
Aspas simples endinheiradas ($'...'): Funcionam como as aspas simples, mas a contrabarra possui o significado usual em C e outras linguagens, i.e., $'foo\nbar' inclui uma quebra de linha. (Extensão do bash.)
Aspas duplas endinheiradas ($"..."): Procura uma versão localizada da string usando gettext e mandinga braba. Nunca vi um script as usar que não fosse um exemplo de como as usar. (Extensão do bash.)
Contra-aspas (`...`): Sintaxe antiga equivalente a $(...) (com a diferença de que $(...) pode ser aninhado e ` não).

O plano para o lash é:

Aspas simples continuam funcionando do mesmo jeito, eliminando o significado especial de tudo, inclusive contrabarras.
A contrabarra (fora de aspas simples) adquire seus poderes C usuais diante de letras e números. Porém, isso pode atrapalhar o uso de comandos que interpretam as contrabarras por conta própria (e.g., sed). Em muitos casos, esses comandos interpretam as contrabarras apenas porque o sh não as interpreta, então a mudança não faria diferença na prática, mas alguns comandos dão significados diferentes para a contrabarra, e.g., sed "s/foo$.*$bar/\1/"). Esse problema pode ser contornado usando aspas simples, mas não funciona tão bem quando há variáveis do shell envolvidas. Talvez essa idéia de interpretar as contrabarras por default não seja tão genial assim after all. Talvez os números pudessem ser deixados de fora dos caracteres especiais reconhecidos depois do \, e contrabarras seguidas por caracteres sem significado especial seriam mentidas literalmente na saída. I don't know.
As aspas duplas funcionam como em bash (exceto que a contrabarra tem os poderes extra mencionados acima).
Ambos os tipos de aspas endinheiradas são eliminados. (A menos que o plano das contrabarras seja abortado, nesse caso as aspas simples endinheiradas ficam).
As contra-aspas perdem o significado de command substitution. A princípio elas ficariam vagas, mas eu estava pensando em repurposeá-las para um mecanismo de "aspas extensíveis". A idéia é que foo`whatever` seria equivalente a $[quote-foo whatever], onde whatever é uma string com o conteúdo literal entre as aspas duplas. Com isso seria possível implementar regras arbitrárias de interpretação da string (a semântica das aspas "localizadas" poderia ser implementada assim, por exemplo, possivelmente em uma biblioteca, bem longe do core do shell). For great justice, também seria possível aceitar um sufixo na string, que seria passado como argumento para o comando, e.g., foo`whatever`bar seria equivalente a $[quote-foo whatever bar].
Fica a questão de como escrever uma contra-aspa entre contra-aspas. Uma coisa que eu pensei é que a citação poderia começar com um número arbitrário de contra-aspas e só terminar quando o mesmo número de contra-aspas fosse encontrado, e.g.:
```
foo```Posso escrever ` e `` aqui```
```
mas isso falha se eu quiser que a string comece ou termine com um `.

Outra utilidade de strings com delimitador (semi-)arbitrário é que elas supririam a funcionalidade dos "here-documents" do bash, os quais veremos adiante.

Here-documents

Here-documents permitem embutir um trecho de texto, delimitado por uma string à escolha, a ser enviado para a entrada padrão (ou outro file descriptor) do comando a ser executado:

cat <<FIM >foo.txt
The quick brown fox
jumps over the lazy dog.
FIM

Por padrão, o shell realiza substituições no conteúdo do here-document. Se o delimitador for citado/escapado, o conteúdo é interpretado literalmente. Além disso, se o delimitador é precedido de -, espaços e tabs no começo de cada linha são descartados.

Em alguma versão o bash introduziu também "here-strings", que permitem usar uma string simples ao invés de um documento multi-linha como entrada:

sed 's/foo/bar/' <<<"$content"

Se o lash adotasse um mecanismo para strings com delimitadores (semi-)arbitrários, como a contra-aspa descrita anteriormente, seria possível unificar esses dois casos. Strings com delimitador arbitrário podem ser usadas também para inicializar variáveis, por exemplo, coisa que não é possível com here-documents em bash.

Parameter substitution

O bash possui uma dúzia de coisas da forma ${varsomething}, que permitem fazer alguma transformação sobre o valor de uma variável. Além de a sintaxe ser abstrusa, a string a ser manipulada tem que estar armazanada em uma variável (não pode ser o resultado de outra substituição, por exemplo; para aplicar múltiplas substituições é necessário armazenar os resultados parciais em uma variável). O plano em lash é substituir todas as substituições (heh) por funções.

Existe um pequeno problema envolvido: o bash distingue entre ${var//$match/$replacement} e ${var//"$match"/$replacement}. No primeiro caso, *, ? e similares dentro de $match têm seus significados de globbing, enquanto no segundo eles são interpretados literalmente. Esse problema afeta outras coisas que trabalham com patterns. No comentário linkado (que trata da função glob, que retorna uma lista dos arquivos que casam com um padrão), a solução que eu encontrei foi usar uma format string para separar as partes que devem ser interpretadas como pattern das partes que devem ser interpretadas literalmente (assim como printf em C separa a string de controle de strings incluídas com %s e que são usadas literalmente), mas no caso de substituições não sei se seria muito conveniente – talvez agrupando o pattern e seus argumentos em um array:

# Equivalente a ${string//"$match"*/"$replacement"} em bash.

subst $string ("%s*" $match) $replacement

Kinda weird, mas eu consigo sobreviver. Na verdade, acho que o melhor seria tratar o pattern como literal por padrão, senão certo que alguém vai escrever $[subst $var $match $replacement] sem nem pensar se $match contém asteriscos ou não, e aí vai ser outra daquelas situações em que um script funciona 99% do tempo, até que um dia alguém resolve usar uma string com * e o script tem um comportamento inesperado. A sintaxe de subst poderia ser:

subst STR MATCH REPLACEMENT substitui todas as ocorrências de MATCH (literal) por REPLACEMENT em STR.
subst STR -pat MATCH REPLACEMENT trata MATCH como um pattern, onde * e afins têm significado especial.
subst STR -pat (MATCH-FMT ARGS...) REPLACEMENT permite especificar uma string de controle e argumentos. O caso anterior é equivalente a usar uma string de controle sem argumentos.

Qual a sua opinião?

Outra situação que usa patterns e sofre do mesmo problema é o case, que a princípio há de ser um comando comum sem sintaxe especial (case STRING (PATTERN-1 BLOCO-1 ... PATTERN-N BLOCO-N)²). Idealmente a sintaxe adotada para as substituições deverá ser utilizada para o case também.

And, or, not

Em (ba)sh, comando1 && comando2 executa comando1 e, se este retornar 0 (i.e., verdadeiro), executa comando2. O exit status do comando como um todo é o exit status do último comando que for executado. Analogamente, comando1 || comando2 executa comando1 e, se este retornar não-zero (i.e., falso), executa comando2. Em ambos os casos, comando é um "comando completo", que pode envolver pipelines. Há dois casos de uso principais desses operadores:

Combinar testes em um if ou similar. Em bash esse uso não é tão comum porque o operador de testes possui seus próprios conectores lógicos ([[ -f $file && -d $dir ]]), mas em lash [[ ... ]] vai pra banha, [ e test vão deixar de ser builtins (eles existem como comandos externos; o bash os reimplementa como builtins apenas por performance, acredito), e sua funcionalidade vai ser suprida por funções individuais (e.g., exists $file, isfile $file, isdir $dir), o que provavelmente vai tornar esse uso dos conectores lógicos mais freqüente.
Executar comandos em seqüência condicionalmente. A princípio esses casos podem ser reescritos como if {comando1} {comando2} e if {! comando1} {comando2}, mas certos usos são relativamente freqüentes na linha de comando, pelo menos para mim (gcc -o hello hello.c && ./hello).

Portanto, eles permanecem.

! nega o exit status do comando (troca de não-zero para 0 e de 0 para 1). Ele também se aplica a um "comando completo", negando uma pipeline inteira (o exit status de uma pipeline é o exit status do último comando), e essa seria a única razão que eu vejo para tratá-lo como sintaxe especial e não apenas um comando chamado !. Não sei se justifica; além de ser uma situação bem rara, nada impede de simplesmente escrever o ! antes do último comando da pipeline. Além disso, talvez fosse o caso de escrever ! {comando1 | comando2} anyway, por clareza. While we are at it, podíamos renomear o comando para not, para deixar mais claro que se trata de um comando comum e não sintaxe especial, mas aí já não sei.

Process substitution

Em bash, <(comando) cria um pipe (um par de file descriptors em que tudo que entra numa ponta sai na outra), executa comando com a saída padrão redirecionada para o lado entrante do pipe, e a expressão é substituída por um nome de arquivo que corresponde ao lado de saída do pipe. Por exemplo, é possível escrever:

diff <(sort file1) <(sort file2)

que executa sort file1 e sort file2 e chama algo como diff /dev/fd/63 /dev/fd/62. Analogamente, >(comando) executa comando com a entrada padrão vinda da ponta de saída do pipe, e a expressão é substituída por um nome de arquivo correspondente à ponta de entrada.

Embora essa sintaxe seja bastante conveniente para usar na linha de comando (e na verdade acho que o exemplo com o diff é o único que eu já usei na linha de comando na vida), não sei se eu quero mantê-la em lash. Não só pelo princípio de evitar sintaxe extra gratuita, mas também porque ela parece um redirecionamento, mas é uma word. Se eu quisesse redirecionar um file descriptor para o resultado do process substitution (o que é útil primariamente para fazer um pipeline com um file descriptor que não seja a stdout, e.g., redirecionar a stderr para um comando), eu teria que escrever algo como (o espaço é necessário):

ls 2> >(comando)

o que não é exatamente óbvio. Talvez uma função desse conta do recado, algo como:

diff $[popen -r {sort file1}] $[popen -r {sort file2}]

Ok, a cara disso é terrível³. Talvez se a popen ganhar outro nome, e o comando aceitar um nome de comando e argumentos diretamente ao invés de obrigatoriamente um bloco:

diff $[readfrom {sort file1}] $[readfrom {sort file2}]
diff $[pipefrom {sort file1}] $[pipefrom {sort file2}]
diff $[pipefrom sort file1] $[pipefrom sort file2]

Não sei.

Outro problema com a sintaxe do bash é que o comando parece um array, e talvez um array fizesse sentido como alvo do redirect (redirecionaria para todos os nomes de arquivo no array). Por outro lado, o caso do array poderia ser representado pelo array "spliced", qualquer que seja a sintaxe escolhida para ele (e.g., >$@(file1 file2)), ou simplesmente permitindo múltiplos redirects do mesmo file descriptor (>file1 >file2; o zsh permite isso, acho). Não sei.

Humanitas precisa dormir

Por hoje ficamos por aqui. Como sempre, tudo o que eu digo que "é" de tal jeito é só o plano atual, tudo está sujeito a discussão, comentários e sugestões são sempre bem-vindos, live free or die, do what you want 'cause a pirate is free, etc. Como esse é, a princípio, o último post da série for a while, sinta-se a vontade para comentar aqui sobre tópicos não abordados até agora na série.

_____

1 Em tempos de outrora eu pensei em usar S-expressions para toda a sintaxe (inclusive redirecionamentos e pipelines), mas permitir omitir os parênteses em torno de comandos que aparecem sozinhos em uma linha. O resultado não me foi exatamente satisfatório. Além disso, turns out que um shell totalmente baseado em S-expressions já foi feito (o qual por sinal provavelmente é uma boa fonte de inspiração).

2 Os patterns e blocos vão em um array primariamente para permitir que eles ocupem múltiplas linhas sem ter que pôr um \ no final de cada linha:

case $file (
    "*.mp3" { ... }
    "*.ogg" { ... }
    "*" { ... }
)

3 Revisando o post, eu olhei para isso e pareceu a sintaxe mais natural do mundo, mas a essa altura minha percepção já está meio alterada pelo sono.

8 comentários / comments

Blueprints for a shell, parte 3: Tipos de dados

2015-03-13 22:47 -0300. Tags: comp, prog, pldesign, shell, lash, em-portugues

Este post é parte de uma série sobre o lash, um shell que eu estou ideando.

A world made of strings

Em (ba)sh só existe um tipo de dado: a string. Em bash, uma variável pode ser declarada como um array (e em versões mais recentes, como um dicionário), mas embora a variável seja um array, o array em si não é um valor de primeira classe: não é possível passar um array como argumento para uma função, ou armazenar um array dentro de outro, por exemplo. Isso limita um bocado o que se pode fazer em bash sem apelar para gambiarras do inferno. (Claro que "dá" para viver sem essas coisas. Também "dá" para programar com máquinas de Turing...)

lash quebra com a tradição, se revolta contra o sistema e introduz arrays, dicionários e blocos de primeira classe (bem como possivelmente outros objetos, como canais de comunicação, mas isso ainda está em aberto). Assim, é possível fazer coisas futurísticas como manter uma coleção de dados estruturados e escrever funções para manipular arrays e produzir outros arrays. Fantástico, não? Welcome to 2015.

Independentemente do shell, variáveis de ambiente e argumentos de processos no Unix também são strings (e strings que não podem conter \0, ainda por cima), o que significa que não temos como passar diretamente nossos valores estruturados para outros processos. Uma abordagem alternativa seria fazer como Tcl: representar tudo como strings, definir certos formatos de string para armazenamento de dados estruturados (e.g., keyed lists, ou XML if you're feeling crazy), e prover funções para interpretar e manipular tais strings. Isso permitiria passar dados "estruturados" para subprocessos, pois eles seriam apenas strings. Mas, seriously, guardar tudo como string e parsear/procurar dentro da string para obter um elemento de uma lista/dicionário? Gerar uma string nova toda vez que se altera um elemento? Tá certo que seria possível mitigar um pouco esses problemas usando alguma representação interna mágica para strings, mas sei lá. Por ora eu prefiro ter dados estruturados normais.⁵ Além disso, blocos têm que ser dados especiais de qualquer forma, para carregar informação de escopo.

So, tipos de dados.

Strings e números

Uma string em lash é uma seqüência de bytes; internamente, o shell não está preocupado com a interpretação desses bytes (como caracteres codificados em UTF-8, por exemplo). No geral, o ambiente Unix como um todo não está preocupado com o conceito de codificação; nada exige que nomes de arquivo sejam strings UTF-8 válidas, por exemplo, e o resultado de um globbing deveria ser representável por strings do shell sem nenhum mistério. Arquivos/streams também não tem nenhuma codificação inerente, e coisas como echo $str não deveriam ter que fazer nada de mágico para decidir como mandar o conteúdo da string para o arquivo. Interpretar os bytes de uma string como UTF-8 (ou outro encoding) é responsabilidade das funções que o shell provê para manipular strings.

Acho que em um shell não faz muito sentido ter um tipo numérico distinto. Em um shell, quando se escreve algo como my x = 01, espera-se que o 0 permaneça lá; quando se chama xargs -0, espera-se que o - não se perca, etc. Além disso, os argumentos que o script recebe da linha de comando são todos strings, e não me parece interessante ter que convertê-los manualmente para números antes de fazer operações aritméticas com eles. Ao invés disso, a interpretação de uma string como um número cabe aos operadores aritméticos. Por questão de eficiência, o resultado de uma operação aritmética pode ser armazenado internamente como um número (a idéia é evitar ter que converter o resultado para string e reconverter para número caso ele seja usado novamente em uma operação aritmética), mas isso não é observável pelo script.

Diferentemente do (ba)sh, o lash deverá suportar aritmética de ponto flutuante. Isso levanta a questão de como distinguir divisão inteira de divisão em ponto flutuante. Eu sou favorável a adotar / para divisão em ponto flutuante e // para divisão inteira, a la Python 3. Os demais operadores aritméticos produzem resultado em ponto flutuante se um dos argumentos for float, e inteiro caso contrário. A representação em string de um número em ponto flutuante sempre inclui um ponto¹ (a idéia é que se alguma coisa estiver produzindo resultados float indevidamente, isso não vai passar silenciosamente durante a execução (ou assim se espera)). Operações aritméticas sobre strings que não são números válidos produzem um erro de execução, i.e., nada de NaN propagation a la JavaScript ou interpretação implícita como 0 a la PHP. Na verdade nem o bash deixa esse tipo de coisa passar em silêncio... com algumas exceções: uma string vazia é tratada como um 0, e espaços em torno de um número são ignorados. Aqui fico na dúvida entre "strictness" e conveniência; talvez em um script seja uma boa aceitar esses dois casos.

Strings não são arrays, e (assim como em bash) não são indexáveis com a sintaxe normal de arrays. Haverá funções para obter substrings, mas ainda não pensei bem nos nomes e na sintaxe, e em como especificar o range de bytes/caracteres desejado (início e tamanho? início e fim? inclusivo ou exclusivo? Todas as opções, dependendo dos parâmetros?). Uma possibilidade seria:

byte STR M retorna uma string contendo o M-ésimo byte (contando do 0);
byte STR M -toin N retorna uma string com os bytes M a N (inclusive);
byte STR M -toex N retorna uma string com os bytes M a N (exclusive);
byte STR M -len LEN retorna uma string com LEN bytes a partir de M.
A posição final pode ser end, indicando o fim da string (nesse caso -toin e -toex fazem a mesma coisa? Quem sabe usamos -toend ao invés?);
char faz a mesma coisa para caracteres ao invés de bytes. (Assumindo o encoding do locale?)

Pode ser meio verboso, mas captura de substring parece ser uma coisa relativamente rara em bash, baseado em um grep na minha amostra extremamente significativa de meia dúzia de scripts que estavam à mão, então acho que a clareza e a flexibilidade compensam a verbosidade.

O tamanho da string pode ser obtido com as funções bytelen e charlen, dependendo do tipo de tamanho desejado. (Há ainda a situação em que se quer a largura impressa da string (combining characters não contam no comprimento, e caracteres chineses-et-al ocupam duas posições), bem como substrings baseadas na posição impressa dos caracteres, mas isso vai ficar para o futuro distante, possivelmente numa biblioteca.)

Funções que trabalham com delimitadores (e.g., split STRING DELIM) têm que aceitar delimitadores de tamanho arbitrário, pelo simples fato de que elas têm que funcionar com delimitadores em UTF-8 e ao mesmo tempo se manterem agnósticas quanto à codificação. (Por outro lado, isso assume que a codificação tem a mesma propriedade do UTF-8, de que é possível identificar o começo de um caractere inambiguamente a partir de um ponto arbitrário na stream, o que basicamente só é verdade no UTF-8 e em encodings em que 1 byte = 1 caractere. Meh.)

Arrays

Arrays são seqüências de valores quaisquer. A sintaxe literal para arrays é (valor1 ... valorN). (Os parênteses são herdados da sintaxe de inicialização de variáveis-array do bash. Além disso, colchetes e chaves já têm outros usos. Isso a princípio conflita com a sintaxe do (ba)sh para rodar um comando em um subprocesso⁴ (( comandos )), mas eu já não pretendia ter essa sintaxe em lash to begin with. Uma função poderia prover essa funcionalidade (e.g., subproc { comandos }).)

Arrays são indexados com a sintaxe $var[expr]. Assim como em bash, expr é avaliado como uma expressão aritmética, sem necessidade de escrever $var[$((expr))]. Diferentemente de bash, chaves não são exigidas, i.e., não é necessário escrever ${var[expr]}. Por um lado isso é mais limpo, mas por outro pode conflitar com o uso de [] como wildcard, e.g., my prefix = /dev/tty; echo $prefix[1-8]. Acho que isso não chega a ser um grande problema, pois isso gera um erro de execução ($prefix não é um array), e portanto é fácil de detectar e corrigir (para ${prefix}[1-8]; dá até para incluir essa informação na mensagem de erro).

Assim como em bash, o array tem que estar em uma variável para ser indexado ($[função][expr] não seria interpretado como uma indexação do resultado de função, a princípio (ou seria?)), mas nada impede que haja uma função index ARRAY N, com a qual se poderia escrever $[index $[função] N].

A sintaxe de atribuição funciona com arrays também (var[i] = 42). Isso implica que atribuição tem que ter tratamento sintático especial, para que coisas como var[i*i] = 42 não causem globbing.

Como fica o caso de arrays multidimensionais (i.e., arrays que contêm outros arrays)? $var[i][j] é uma sintaxe válida? Se sim, não tem por que não aceitar $[função][expr] também, acho.

É possível atribuir a uma posição que ainda não existe (a la Perl), ou isso é um erro (a la Python)? Se a "label" do índice é importante (e não apenas a ordem), não seria o caso de usar um dicionário anyway? Eu consigo pensar em duas situações em que se poderia querer especificar um índice não-existente explicitamente:

Adicionar um elemento no fim do array. Mas para esse caso poderia haver uma função push (ou append, porque aí também podemos ter uma prepend para adicionar no começo; ou poderia haver uma função mais geral insert, para inserir um elemento entre dois quaisquer, ou no início/fim), ou uma sintaxe a la PHP (var[] = 42).
Inicializar um vetor/matriz com alguma fórmula matemática, e.g.:
```
my array = ()
range 0 -toin 10 {|i|
    array[i] = $(( i * i ))
}
```
Parece um caso de uso razoável, mas de qualquer forma ele falha com arrays multidimensionais ($array[i][j] = 42 é um erro porque $array[i] não é um array, a menos que seja inicializado primeiro). Pode-se suprir esse caso com uma função make_matrix que recebe o tamanho das dimensões e retorna um vetor inicializado.

Ou podemos permitir atribuição out-of-bounds (e preencher qualquer elemento entre a última posição preenchida e a posição atribuída com a string vazia) e era isso. Não sei (o plano inicial é não permitir).

Outra função básica de manipulação de arrays é each, que recebe um array e um bloco e chama o bloco com cada elemento do array. Também pode haver uma map, que produz um novo array com cada resultado retornado pelo bloco, e uma versão destrutiva de map (chamada map!, talvez²).

A função len retorna o número de elementos do array. Não sei se há necessidade de uma sintaxe especial para isso (e.g., $#var).

$@var "splices" o array, produzindo um argumento ("word" na terminologia do (ba)sh) para cada elemento do array, i.e.:

my array = (1 2 3)
foo $array         # chama foo com um argumento (o array)
foo $@array        # chama foo com três argumentos (1, 2 e 3)

Dicionários

Um dicionário é um mapeamento de strings para valores. (Por que só strings? Talvez faça sentido permitir valores quaisquer como chave.) A sintaxe literal para dicionários é %(chave1=valor1 chave2=valor2 ...) (o % é para sugerir uma vaga relação com hash-tables em Perl), com espaços opcionais em torno do =, o que fica meio estranho sem delimitadores entre os pares chave = valor, mas pode-se usar quebras de linha se desejado:

my person = %(
    name = Hildur
    age = 18
    country = Iceland
)

[Note to self: Em coisas como %(foo=(1 2 3)), assim como em my foo=(1 2 3), foo=(1 2 3) não é uma "palavra" normal do shell, porque é parte string, parte array, i.e., tanto dicionários literais quanto declaração de variável exigem tratamento especial pelo parser (a menos que haja um tipo de dados "associação" ao qual coisas da forma A=B possam ser mapeadas).]

Elementos de um dicionário são acessados com a sintaxe $var{chave}. Não se usa colchetes como em arrays porque a expressão entre colchetes sofre avaliação aritmética, que não é o que queremos em um dicionário. (Será que foi uma boa idéia fazer avaliação aritmética automática after all?) Isso é outro elemento de sintaxe (além dos blocos) que conflita com a sintaxe de brace expansion do bash (foo{1,2,3}). Não sei se isso é um ponto a favor da mudança da sintaxe de acesso a dicionário ou do brace expansion. Outra possibilidade seria usar colchetes, assim como arrays (e aí eles perdem a propriedade de avaliação aritmética, o que pode tornar o acesso a array meio inconveniente), ou talvez $var<chave>, mas isso conflita com a sintaxe de redirecionamento. (Lembrando que isso poderia ser um redirecionamento se $var contivesse um file descriptor. Nesse caso o > posterior seria um erro de sintaxe, então só a interpretação como acesso a dicionário seria válida, mas eu só descubro isso quando chego no >; além disso a chave não poderia ter um espaço não-escapado. Fora que é uma sintaxe totalmente não-usual para acesso a dicionário (as chaves pelo menos têm precedente em Perl).)

Se my dict = %(a=1 b=2 c=3), qual o resultado de $@dict?

Poderia ser um erro de execução (i.e., $@var só funcionaria com arrays);
Poderia gerar a 1 b 2 c 3;
Se tivermos o tipo "associação" anteriormente referido, poderia gerar a=1 b=2 c=3 (i.e., três valores do tipo "associação"), A vantagem disso seria poder escrever:
```
my newdict = %($@dict d=4)
```
Cool, huh? Mas que o tipo associação é no mínimo peculiar, é. (E se a mesma chave aparece duas vezes? Vale a última, em princípio, mas poderia ser um erro também, o que tornaria essa construção menos útil.)

Haveria uma porção de funções para iterar sobre dicionários: each-key; each-value; each-entry, que reberia um dicionário e um bloco de dois argumentos e o chamaria com a chave e o valor de cada entrada no dicionário; ou, havendo o tipo associação, chamaria o bloco com cada associação. Alternativamente, havendo o pipeline de objetos, poderia haver uma função keys que produz todas as chaves, e aí escreveríamos keys $dict |> each {|key| ... } (ou qualquer que seja a sintaxe do pipe de objetos), e da mesma forma para os valores (e associações, em as havendo).

Será que é uma boa ter um tipo dicionário distinto de array, ou o melhor é unificar os dois a la PHP, JavaScript, etc.? Acho que eu prefiro ter dois tipos separados, mas há de se pensar melhor.

Interações entre valores estruturados e strings

Em (ba)sh, diferentemente das linguagens de programação em geral, uma variável pode aparecer como parte de uma "palavra" maior, e.g., foo$bar; o conteúdo da string é concatenado na palavra e era isso. Mas e se $bar não for uma string? Pode-se produzir uma versão serializada do valor (o que provavelmente é mais útil), ou gerar um erro.

Coisas como foo$@bar (onde my bar = (1 2 3)) poderiam expandir para foo1 foo2 foo3, como o brace expansion do bash. O problema é que $@ assume que o array está em uma variável. Daria para expandir arrays literais também³, e,g., foo(1 2 3) geraria foo1 foo2 foo3, e aí seria possível eliminar o uso de chaves para brace expansion. O problema é que by far o meu uso mais freqüente de brace expansion na linha de comando é com a string vazia, e.g., mv file{,~} ao invés de mv file file~, e na nova sintaxe isso seria mv file("" ~) (na verdade o ~ teria que ser escapado para não sofrer tilde expansion...). Talvez dê para sobreviver.

^D

Por hoje ficamos por aqui. Como sempre, tudo o que foi apresentado são só os planos e idéias atuais, tudo pode ser mudado, e comentários e sugestões são muito bem-vindos (mas provavelmente só vou ver/responder comentários depois do fim-de-semana).

_____

1 Ou talvez um e+42 da vida (talvez só como formato de entrada válido, mesmo que as operações do shell sempre produzam resultados em notação decimal).

2 (update) Ou adicionar uma opção -overwrite à função map (que parece uma coisa mais shell-like); ou ainda, adicionar opções -collect e -overwrite à each e nem ter uma map separada.

3 (update) Note' to self: Isso também é uma string misturada com um array, então o my x=(1 2 3) não é mais um caso especial para o parser (ou pelo menos para o "reader", porque ainda teria uma interpretação diferente do caso foo(1 2 3)).

4 (update) Na verdade não conflita, porque um array não faz sentido como primeira coisa na linha de comando (ou faz?).

5 (update) Parafraseando um grande sábio, "If you want Tcl, you know where to find it." (Dito isso, eu vejo mérito na abordagem "everything is a string".)

5 comentários / comments

Blueprints for a shell, parte 2: Variáveis, definições e escopo

2015-03-13 00:11 -0300. Tags: comp, prog, pldesign, shell, lash, em-portugues

Este post é parte de uma série sobre o lash, um shell que eu estou ideando.

Um pouco de contexto

Em (ba)sh todas as variáveis são globais (inclusive as "locais", que são globais com escopo dinâmico). Independentemente das variáveis do shell, todo processo no Unix possui um conjunto de variáveis de ambiente (environment variables). Os shells tendem a unificar variável do shell e de ambiente de alguma forma. A maneira como isso é feito em (ba)sh é tratar todas as variáveis uniformemente como "do shell" e marcar certas variáveis como "exported": essas variáveis são passadas como variáveis de ambiente para os processos chamados pelo shell. Além disso, o bash possui um comando local, que faz com que os valores atribuídos às variáveis passadas ao local a partir desse ponto só durem até a função onde o local foi chamado retornar, i.e., o local permite "shadowar" uma variável durante a execução de uma função. Funções chamadas pela função que declarou a variável "local" também vêem o novo valor, e nada impede "localizar" uma variável de ambiente (que continua sendo uma variável de ambiente).

Nessa situação, determinar a que variável o código está se referindo ao dizer $x é uma questão bastante simples: só existe uma variável x no programa inteiro. Evitar conflitos de nomes é basicamente problema do programador.

Se isso já é um problema em bash, em um shell com lambdas isso seria um disastre, pois um bloco de código pode ser chamado dentro de uma função diferente da que o definiu, e quem escreve o bloco não necessariamente tem como saber (nem deveria ter que saber) os nomes das variáveis usadas nesse outro ponto do programa. Assim, lash adota escopo léxico, como qualquer linguagem sã, o que significa que pode haver múltiplas variáveis com o mesmo nome em um programa. Isso também implica que nós vamos ter que conciliar escopo léxico com variáveis de ambiente.

So, variáveis em lash

O comando my introduz variáveis léxicas, cujo escopo é o bloco onde o my se encontra. A sintaxe básica é:

my nome = valor

Eu estou meio na dúvida quanto ao uso de espaços em torno do =. Em bash, atribuição de variável não permite espaços. Não havendo espaços, seria possível definir múltiplas variáveis no mesmo comando:

my x=1 y=2 z=3

Com espaços, para a coisa continuar legível, acho que seria necessário introduzir um delimitador entre as atribuições, mas isso não é tão simples em um shell, porque em:

 my x=1, y=2, z=3

a vírgula poderia ser parte da string que se está atribuindo. Uma alternativa é permitir declarar uma única variável com espaços, ou múltiplas variáveis sem espaços. A sintaxe não é ambígua, de qualquer forma.

Pergunta: uma definição com my x=1 afeta referências a x no mesmo bloco que apareçam antes do my? Por exemplo, em:

my x = 1
while {true} {
    echo $x
    my x = 2
    echo $x
}

que x é visto pelo primeiro echo quando o while executar pela segunda vez? Ou, de maneira mais convoluta:

my x = 1
my block = {
    my f = { echo $x; }
    my x = 2
    $f
}

imprime o valor de qual x? Se o desejado for o 1, então a implementação de variable lookup tem que tomar o cuidado de não simplesmente pegar o primeiro x subindo na hierarquia de ambientes (a princípio o bloco interno procuraria a variável x primeiro no ambiente do próprio bloco, depois no bloco em que o bloco se encontra, depois fora dos blocos). Por outro lado, essa semântica em que a referência a uma variável nunca muda, independente de declarações posteriores, permitiria resolver tudo estaticamente, o que pode deixar o lookup com uma performance melhor. Outra questão é: esse tipo de coisa acontece na prática? Eu fico seriamente tentado a dizer que é indefinido nesses casos qual das duas variáveis é acessada. Provavelmente alguém vai querer comer meu fígado por introduzir comportamento indefinido em um shell, mas eu não estou propondo nada da natureza de comportamento indefinido em C, em que o programa pode fazer qualquer coisa, incluindo roubar seu dinheiro e fugir do país; certamente uma das duas variáveis é acessada, sem nenhum efeito inesperado. A idéia é apenas manter em aberto a possibilidade de diferentes implementações de lookup de variáveis. Se você acha que isso é uma má idéia, por favor se manifeste.

Atribuição

Estou na dúvida se atribuição vai usar uma keyword do tipo set, ou se só o sinal de igual vai ser suficiente. Parece concebível que alguém invente um comando que recebe = como argumento, então:

foo = 42

poderia ser uma chamada a foo. Esse problema poderia ser evitado exigindo set foo = 42, ou proibindo os espaços em volta do = (que é o que o (ba)sh faz), mas o espaço me parece bem desejável quando o valor atribuido é uma expressão maior com chamadas a funções e what-not, ou quando o lado esquerdo é um array[índice]. Por outro lado, não lembro de nenhum comando que recebe = como primeiro argumento, então talvez tratar um = não escapado/quoted na segunda posição como algo especial e dispensar o set não seja problema. Será?

Também há de se considerar a possibilidade de introduzir outros operadores de atribuição, como +=, e nesse caso, se haverá operadores separados para strings, números e arrays ou se um só basta. (Em bash, += appenda strings e arrays; olhando o lado direito da atribuição dá para saber qual é o caso. Para incrementar variáveis numéricas, é necessário estar em "modo de expressão aritmética", i.e., dentro de ((...)), $((...)), índice de array, etc.)

O que acontece ao se atribuir um valor a uma variável não declarada? Acho que isso seria no mínimo um warning, talvez um erro. Acessar uma variável não-definida também, mas seria bom ter alguma coisa equivalente ao ${var:-default}, i.e., "usa o valor de $var, ou a string default caso var não esteja definida (ou seja vazia, se o : estiver presente)". Eu tinha pensado em ter uma função or valor1 valor2, que devolve valor1 se ele for um valor diferente da string vazia (ou um valor nulo especial? nós teremos um?), ou valor2 caso contrário. O problema é que $[or $var default] vai emitir um warning se $var não estiver definida. Talvez pudesse haver uma sintaxe especial $?var que devolve o valor da variável ou vazio caso ela não exista, sem emitir um warning, e então o equivalente do ${var:-default} seria $[or $?var default]. Meio verboso, mas não parece ruim (eu acho).

Variáveis globais

Nós teremos um sistema de módulos (cujos detalhes eu ainda não pensei direito e que será assunto de um post futuro), e concebivelmente um módulo poderá querer tornar algumas variáveis visíveis a outros módulos. Possibilidades:

Tornar visíveis todas as variáveis declaradas no top-level do arquivo do módulo, i..e, todas as declarações do tipo my x = 1 que não aparecem dentro de nenhum bloco são acessíveis por outros módulos.
Introduzir uma outra keyword (e.g., public) para declarar variáveis que são visíveis fora do módulo. Variáveis declaradas com my só são visíveis dentro do módulo (ou com reflection).

Separar variáveis públicas das demais parece uma boa, mas não sei se não é "só uma coisa a mais".

Funções

Funções e variáveis vivem em namespaces separados em (ba)sh, e a princípio isso deve ser mantido em lash. Em (ba)sh, todas as definições de função possuem escopo global (na verdade tudo tem escopo global em (ba)sh). Como já comentado anteriormente, embora possa parecer "óbvio" mudar isso em lash e tornar as definições de função léxicas, assim como as variáveis, código como:

if {some-condition} {
    def foo {
        ...
    }
}

em que se espera que a definição de foo resultante seja global, é comum em arquivos de configuração e afins. Possibilidades:

def define funções globais, i.e., no escopo do módulo em que a definição foi feita. (No escopo léxico, ou no escopo dinâmico? Se um bloco que contém um def é passado como argumento e chamado em uma função definida em outro módulo, em que módulo o def tem efeito? Bom, a julgar pelo if, no módulo em que o def se encontra, i.e., no escopo léxico.) Não há definições locais de função e era isso.
def define funções globais, mas é possível escrever algo como my def foo { ... } para definir uma função local. Pode ser uma boa, só não sei se vale a pena o esforço. Também teria algum efeito no lookup de funções/comandos que precisa ser melhor considerado.
def define funções no escopo léxico local. Bagunça com o caso do def dentro de um if, mas isso poderia ser contornado permitindo algo como public def foo { ... } dentro do if. (Mas quem disse que eu queria exportar do módulo? Também poderia ser usada uma keyword diferente (e.g., global), que torna global mas não exporta do módulo.)

No momento eu estou inclinado à alternativa (1), mas aceito contra-argumentos.

Funções definidas em um módulo são visíveis a partir de outros módulos por default, ou é necessário dizer public def foo { ... } para exportar uma função? (Lembrando que a gente nem decidiu ainda se vai ter uma keyword public ou não na linguagem...)

Variáveis de ambiente

O escopo de uma variável de ambiente a princípio é o processo inteiro. (É possível conceber que cada módulo pudesse ter sua própria idéia de ambiente, mas acho que nunca antes na história desse país uma linguagem tratou variáveis de ambiente assim.) Em um shell, espera-se acessar variáveis de ambiente com a mesma sintaxe das variáveis comuns (acho inventar uma sintaxe nova para dizer $HOME não vai ser uma proposta popular). Outra peculiaridade das variáveis de ambiente é que seus valores só podem ser strings. Seria possível serializar outros valores para permitir passá-los como variáveis de ambiente para subprocessos, mas só o lash reconheceria essas variáveis como valores especiais, e seria necessário indicar de alguma maneira reliable que a variável contém um valor especial, e não uma string que parece muito com um valor especial. Depois do causo do ano passado com o Shellshock, eu estou meio receoso de permitir coisas que não sejam strings em variáveis de ambiente.

Em bash uma conseqüência não muito agradável de o shell misturar as variáveis de ambiente com as comuns é que é possível um script começar a usar uma variável feliz da vida sem saber que havia uma variável de ambiente com o mesmo nome. Isso é agravado pelo fato de que em bash uma variável inexistente pode ser usada sem warning nem erro (a menos que set -u esteja ativo), então um script pode ser escrito assumindo que uma dada variável está vazia e inadvertidamente herdar do ambiente uma variável com conteúdo. Mesmo que esse não seja o caso e o script inicialize suas variáveis antes de usar, ele ainda pode estar inadvertidamente alterando uma variável de ambiente, que será herdada por subprocessos.

Em lash a situação a princípio é menos problemática porque toda variável tem que ser declarada antes de usar, e um my sobrepõe uma variável de ambiente de mesmo nome. Em geral, se eu esquecer de declarar a variável, o shell emitirá um erro, então um script que roda sem erros para mim pelo menos está imune a variáveis de ambiente inesperadas presentes nos sistemas dos outros, mas eu ainda posso acabar esquecendo o my sem gerar erro se der o acaso de eu usar um nome de variável que é uma variável de ambiente presente no meu sistema. Soluções:

Exigir que toda variável de ambiente usada seja explicitamente importada antes do uso. Acho que isso não seria uma opção muito popular. Talvez não fosse tão ruim se algumas variáveis mais tradicionais fossem importadas por default (e.g., HOME, USER), mas isso me parece super-arbitrário.
Permitir o acesso a variáveis de ambiente como qualquer outra variável, mas permitir atribuição apenas com um comando especial (e.g., setenv HOME = /). Acho que isso pega como erro a grande maioria das capturas indevidas de variáveis de ambiente. Fica o caso de se o programador erra o nome da variável de ambiente (uma nova variável seria criada, ao invés de emitir um erro). Evitar esse problema acho que traria mais inconveniente do que vantagem.
Não fazer nada. Na real isso mal é uma opção, já que o setenv tem que existir de qualquer forma para criar variáveis de ambiente novas, e uma vez que ele exista não tem por que não aplicar a solução (2).

So (2) it is, aparentemente.

Escopo dinâmico

E quando eu quero escopo dinâmico, after all? Pode-se argumentar que ninguém em sã consciência quer escopo dinâmico, mas, por exemplo, se formos implementar o tal pipeline de objetos, precisamos de um meio de redirecionar o canal de saída de um comando para o canal de entrada de outro, e uma maneira de fazer isso é ter os canais de entrada e saída como variáveis dinâmicas e shadowá-las para fazer o redirecionamento; é como normalmente se redireciona *standard-output* e companhia em Common Lisp, e (current-output-port) et al. nos Schemes que suportam "fluid variables" (que são variáveis dinâmicas com outro nome).

Se formos ter variáveis dinâmicas, para evitar o caos manifesto, parece uma boa exigir que elas sejam previamente declaradas como tal (i.e., não é possível "localizar" a la bash uma variável previamente declarada com my). Também há o problema de como implementar o escopo dinâmico. Na situação em que só há uma thread, a operação de shadowar uma variável pode ser implementada simplesmente salvando o valor antigo, atribuindo o valor novo, e depois restaurando o valor antigo. Quando há múltiplas threads, entretanto, deseja-se que um shadow dentro de uma thread não afete as outras. E guess what? O nosso pipeline de objetos exige que cada parte do pipeline rode simultaneamente (ou pelo menos cooperativamente), dentro do mesmo processo, e o que cada uma vê como canal de entrada e de saída é diferente, então essa implementação "ingênua" de shadowing não nos serve.

Eu tenho um certo receio de que, a menos que as variáveis dinâmicas sejam identificáveis estaticamente, a presença delas bagunce / afete a performance do lookup de todas as variáveis. Quando a definição da variável dinâmica está lexicamente visível é fácil distingui-las, mas quando elas vêm de outro módulo, isso pode ser complicado. Uma solução é simplesmente usar uma sintaxe diferente para acessar variáveis dinâmicas, e.g., earmuffs: $*output_channel*. Essa sintaxe tem a vantagem de ser imediatamente familiar ao grande contingente de programadores de Common Lisp (right?), e a desvantagem da potencial confusão com o * que faz globbing (e.g.:

dynamic *prefix* = foo
touch foo1 foo2 foo3
echo $*prefix**

), mas outra sintaxe que distinguisse variáveis dinâmicas de variáveis comuns poderia ser escolhida.

Acho que por hoje deu

Reiterando, sempre que eu digo que alguma coisa em lash "é" de tal e tal jeito, eu só quero dizer que esse é o plano atual, mas estou aberto a sugestões. Feedback é sempre bem-vindo.

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.

Posts com a tag: pldesign

2016-05-28 13:14 -0300. Tags: comp, prog, pldesign, lisp, ramble, in-english

2016-05-19 23:11 -0300. Tags: comp, prog, c, pldesign, ramble, in-english

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A programming language

Optional types

Language interoperability

EOF

2015-09-02 22:11 -0300. Tags: comp, prog, pldesign, lisp, php, lows, in-english, em-portugues

Tipos de dados

Arrays

Símbolos

Miscelânea

Operadores

Funções

Variáveis globais e constantes

Namespaces

Classes, métodos e bagulheiras

Macros

Enough talk

Data types

Arrays

Symbols

Miscellaneous

Operators

Functions

Global variables and constants

Namespaces

Classes, methods, and stuff

Macros

Enough talk

2015-09-02 04:25 -0300. Tags: comp, prog, pldesign, php, lisp, lows, in-english, em-portugues

Prelúdio

Idéia

Transformando Lisp em PHP

Expressões e statements

lambda

Name clashes

PHP formatado

O futuro

Prelude

Idea

Transforming Lisp into PHP

Expressions and statements

lambda

Name clashes

Pretty-printed PHP

The future

2015-05-14 23:42 -0300. Tags: comp, prog, pldesign, lash, life, em-portugues

2015-04-10 01:34 -0300. Tags: comp, prog, pldesign, lash, life, mind, ramble, music, em-portugues

2015-03-17 01:10 -0300. Tags: comp, prog, pldesign, shell, lash, em-portugues

Comandos simples

Quoting

Here-documents

Parameter substitution

And, or, not

Process substitution

Humanitas precisa dormir

2015-03-13 22:47 -0300. Tags: comp, prog, pldesign, shell, lash, em-portugues

A world made of strings

Strings e números

Arrays

Dicionários

Interações entre valores estruturados e strings

^D

2015-03-13 00:11 -0300. Tags: comp, prog, pldesign, shell, lash, em-portugues

Um pouco de contexto

So, variáveis em lash

Atribuição

Variáveis globais

Funções

Variáveis de ambiente

Escopo dinâmico

Acho que por hoje deu

Main menu

Recent posts

Recent comments

Tags

Elsewhere

Posts com a tag: `pldesign`