Elmord's Magic Valley

Posts com a tag: `ramble`

Some thoughts on Gemini and the modern Web

2022-02-20 19:22 +0000. Tags: comp, web, ramble, in-english

The web! The web is too much.

Gemini is a lightweight protocol for hypertext navigation. According to its homepage:

Gemini is a new internet protocol which:

Is heavier than gopher

Is lighter than the web

Will not replace either

Strives for maximum power to weight ratio

Takes user privacy very seriously

If you’re not familiar with Gopher, a very rough approximation would be navigating the Web using a text-mode browser such as lynx or w3m. A closer approximation would be GNU Info pages (either via the info utility or from within Emacs), or the Vim documentation: plain-text files with very light formatting, interspersed with lists of links to other files. Gemini is essentially that, except the files are remote.

Gemini is not much more than that. According to the project FAQ, “Gemini is a ‘less is more’ reaction against web browsers and servers becoming too complicated and too powerful.” It uses a dead-simple markup language called Gemtext that is much simpler than Markdown. It has no styling or fancy formatting, no client-side scripting, no cookies or anything that could be used for tracking clients. It is designed to be deliberately hard to extend in the future, to avoid such features from ever being introduced. For an enthusiastic overview of what it is about, you can check this summary by Drew DeVault.

I’m not that enthusiastic about it, but I can definitely see the appeal. I sometimes use w3m-mode in Emacs, and it can be a soothing experience to navigate web pages without all the clutter and distraction that usually comes with it. This is big enough of an issue that the major browsers implement a “reader mode” which attempt to eliminate all the clutter that accompanies your typical webpage. But reader mode does not work for all webpages, and won’t protect you from the ubiquitous surveillance that is the modern web.

I do most of my browsing on Firefox with uBlock Origin and NoScript on. Whenever I end up using a browser without those addons (e.g., because it’s a fresh installation, or because I’m using someone else’s computer), I’m horrified by the experience. uBlock is pretty much mandatory to be able to browse the web comfortably. Every once in a while I disable NoScript because I get tired of websites breaking and having to enable each JavaScript domain manually; I usually regret it within five minutes. I have a GreaseMonkey script to remove fixed navbars that websites insist on adding. Overall, the web as it exists today is quite user-inimical; addons like uBlock and NoScript are tools to make it behave a little more in your favor rather than against you. Even text-mode browsers like w3m send cookies and the Referer header by default, although you can configure it not to. Rather than finding and blocking each attack vector, a different approach would be to design a platform where such behaviors are not even possible. Gemini is such a platform.

The modern web is also a nightmare from an implementor side. The sheer size of modern web standards (which keep growing every year) mean that it’s pretty much out of question for a single person or a small team to write and maintain a new browser engine supporting modern standards from scratch. It would take years to do so, and by that time the standards would have already grown. This has practical consequences for users: it means the existing players face less competition. Consider that even Microsoft has given up on maintaining its own browser engine, basing modern versions of Edge on Chromium instead. Currently, three browser engines cover almost all of the browser market share: Blink (used by Chrome, Edge and others); WebKit (used by Safari, and keeping a reasonable portion of the market by virtue of being the only browser engine allowed by Apple in the iOS app store); and Gecko (used by Firefox, and the only one here that can claim to be a community-oriented project). All of these are open-source, but in practice forking any of them and keeping the fork up-to-date is a huge task, especially if you are forking because you don’t like the direction the mainstream project is going, so divergencies will accumulate. This has consequences, in that the biggest players can push the web in whatever direction they see fit. The case of Chrome is particularly problematic because it is maintained by Google, a company whose main source of revenue comes from targeted ads, and therefore has a vested interest in making surveillance possible; for instance, whereas Firefox and Safari have moved to blocking third-party cookies by default, Chrome doesn’t, and Google is researching alternatives to third-party cookies that still allow getting information about users (first with FLoC, now with Topics), whereas what users want is not to be tracked at all. The more the browser market share is concentrated in the hands of a few players, the more leeway those players have in pushing whatever their interests are at the expense of users. By contrast, Gemini is so simple one could write a simplistic but feature-complete browser for it in a couple of days.

Gemini is also appealing from the perspective of someone authoring a personal website. The format is so simple that there is not much ceremony in creating a new page; you can just open up a text editor and start writing text. There is no real need for a content management system or a static page generator if you don’t want to use one. Of course you can also keep static HTML pages manually, but there is still some ceremony in writing some HTML boilerplate, prefixing your paragraphs with <p> and so on. If you want your HTML pages to be readable in mobile clients, you also need to add at least the viewport meta tag so your page does not render in microscopic size. There is also an implicit expectation that a webpage should look ‘fancy’ and that you should add at least some styling to pages. In Gemini, there isn’t much style that can be controlled by authors, so you can focus on writing the content instead. This may sound limiting, but consider that most people nowadays write up their stuff in social media platforms that also don’t give users the possibility of fancy formatting, but rather handle the styling and presentation for them.

* * *

As much as I like this idea of a bare-bones, back-to-the-basics web, there is one piece of the (not so) modern Web that I miss in Gemini: the form. Now that may seem unexpected considering I have just extolled Gemini’s simplicity, and forms deviate quite a bit from the “bunch of simple hyperlinked text pages” paradigm. But forms do something quite interesting: they enable the Web to be read-write. Project such as collaborative wikis, forums, or blogs with comment sections require some way of allowing users to send data (other than just URLs) to the server, and forms are a quite convenient and flexible way to do that. Gemini does have a limited form of interactivity: the server may respond to a request with an “INPUT” response code which tells the user browser to prompt for a line of input, and then repeat the request with the user input appended as the query string in the URL (sort of like a HTTP GET request). This is meant to allow implementing pages such as search engines which prompt for a query to search, but you can only ask for a single line of input at a time this way, which feels like a quite arbitrary limitation. Forms allow an arbitrary number of fields to be inputted, and even arbitrary text via the <textarea> element, making them much more general-purpose.

Of course, this goes against the goal stated in the Gemini FAQ that “A basic but usable (not ultra-spartan) client should fit comfortably within 50 or so lines of code in a modern high-level language. Certainly not more than 100.” It also may open a can of worms, in that once you want to have forums, wikis or other pages that require some form of login, you will probably want some way to keep session state across pages, and then we need some form of cookies. Gemini actually already has the idea of client-side certificates which can be used for maintaining a session, so maybe that’s not really a problem.

As a side note, cookies (as in pieces of session state maintained by the client) don’t have to be bad, the web just happens to have a pretty problematic implementation of that idea, amplified by the fact that webpages can embed resources from third-party domains (such as images, iframes, scripts, etc.) that get loaded automatically when the page is loaded, and the requests to obtain those resources can carry cookies (third-party cookies). Blocking third-party cookies goes a long way to avoid this. Blocking all third-party resources from being loaded by default would be even better. Webpages would have to be designed differently to make this work, but honestly, it would probably be for the best. Alas, this ship has already sailed for the Web.

* * *

There are other interesting things I would like to comment on regarding Gemini and the Web, but this blog post has been lying around unfinished for weeks already, and I’m too tired to finish it at the moment, so that’s all for now, folks.

4 comentários / comments

#255

2020-03-19 22:53 +0000. Tags: about, life, mind, ramble, in-english

It's been a while since I last posted here. Many things have happened in the meantime, including a global pandemic. But also a move to Lisbon, a new job, and meeting lots of great people. Not a bad beginning of year, on balance.

Life has not been quite the piece of cake either. Every person has their problems, ghosts and flaws that we have to learn to overcome and to live with – those things are not mutually exclusive, really. Turns out you can practice both self-compassion and self-improvement.

This blog has been around for 8 years now. I'm glad it has lasted so long, and I hope it lasts yet longer. My presence in various social networks has waxed and waned over the years, but this blog (and the website as a whole) has been my permanent corner on the Internet. A consistent bit of life (and a good one at that) across changes in residence (six, if I count correctly), jobs, moving from academia to industry, and now moving from one country to another. Readership has also been relatively consistent – small but consistent – and I'm glad to have you fellow readers to share stuff with.

I hope to start posting more frequently here again. I'll be happy if I succeed in posting at least once or twice per month, but I won't promise any numbers.

As the first post of 2020, I'll finish with a late new year, but timely equinox resolution:

To not live ruled by fear.

Wish everyone a happy equinox, and may the sun shine for us in complicated times.

1 comentário / comment

Partial goals, and other rambles

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

Designing a programming language is a huge undertaking. The last few days I have been thinking about types and classes and interfaces/traits and I got the feeling I was getting stuck in an analysis paralysis stage again. There is also lots of other questions I have to thinking about, such as: How will I handle mutability, and concurrency, and how those things interact? Can I add methods to classes at runtime? Can I implement interfaces to arbitrary existing types? How do I get a trait implementation 'in scope' / available? How static and dynamic typing will interact? How does this get implemented under the hood? And so on, and so forth…

At the same time, it may seem that not solving those problems undermine the whole point of designing a new language in the first place. If you don't solve the hard problems, why not keep using an existing language?

And that's the recipe for paralysis.

But the thing is, even if you solve part of the problems, you can already have something valuable. If Fenius 0.3 gets to be just a 'better' Scheme with nicer syntax and fewer namespacing problems*, that would be already a language I would like to use.

So what would a Minimum Viable Fenius need?

Records, with single inheritance. We already have those, actually, it's just that there is no syntax for inheritance exposed yet.
Throwing and catching exceptions of a given (record) type.
Basic string capabilities. There is the whole Unicode thing, and the problem of text vs. binary data, but we can use the host Scheme strings as a start.
Basic pattern matching, and particularly pattern matching over ASTs. This is necessary to make macros convenient to write. No need for the whole syntax-rules and hygiene machinery right now.
Basic file manipulation (opening, reading, writing, closing, testing for existence).
An R5RS-comparable standard library. That means basic arithmetic, iteration (map, foreach, filter), that kind of thing.
Fixing up parser corner cases (e.g., add the ability to refer to operators as identifiers, allow expressions to continue in the next line by either adding \ at the end or ending a line with an infix operator).
More robust error handling (and more meaningful error messages).
Fix the module import path situation. (Just allowing to specify the import path (say, via an environment variable) would already be a usable solution.)

These would go a long way already. These are all pretty doable and don't involve much hard thinking. The greater problems can be tackled later, after the language is usable for basic tasks.

I would also love to have basic sockets to try to write web stuff in Fenius, but these will have to wait, especially given that Chez does not come with sockets natively. Thunderchez has a socket library; maybe I can use that. But this will come after the items above are solved. I can write CGI apps in it, anyway, or make a wrapper in another language to receive the connections and pass the data via stdin/stdout.

Also, I sometimes wonder if it wouldn't be more profitable to implement Fenius on the top of Guile instead of Chez, since it has more libraries, and now also has a basic JIT compiler. But at the same time, the ultimate goal is to rewrite the implementation in itself and leave the Scheme dependency behind, so maybe it's better not to depend too much on the host ecosystem. Although the Chez compiler is so good, maybe it would make more sense to compile to Scheme and run on the top of Chez rather than compile to native code. If only it did not have the annoying startup time… we'll see in the future!

_____

* Of course, 'better' is subjective; what I want is for Fenius 0.3 to be a better Scheme for me.

Addendum: And by 'better syntax' I don't even mean the parentheses; my main annoyance when programming in Scheme is the syntax for accessors: the constant repeating of type names, like (person-name p) rather than p.name, (vector-ref v i) rather than v[i], (location-line (AST-location ast)) rather than ast.location.line, and so on. This goes hand in hand with the namespacing issue: because classes define a namespace for their methods, you don't have to care about name conflicts so often.

Comentários / Comments

On Scheme's minimalism

2017-09-14 19:34 -0300. Tags: comp, prog, lisp, scheme, pldesign, ramble, in-english

[This post started as a toot, but grew slightly larger than 500 characters.]

I just realized something about Scheme.

There are dozens, maybe hundreds, of Scheme implementations out there. It's probably one of the languages with the largest number of implementations. People write Schemes for fun, and/or to learn more about language implementations, or whatever. The thing is, if Scheme did not exist, those people would probably still be writing small Lisps, they would just not be Scheme. The fact that Scheme is so minimal means that the jump from implementing an ad-hoc small Lisp to implementing Scheme is not that much (continuations notwithstanding). So even though Scheme is so minimal that almost everything beyond the basics is different in each implementation, if there were not Scheme, those Lisps would probably still exist and not have even that core in common. From this perspective, Scheme's minimalism is its strength, and possibly one of the reasons it's still relevant today and not some forgotten Lisp dialect from the 1970s. It's also maybe one of the reasons R6RS, which departed from the minimalist philosophy, was so contentious.

Plus, that core is pretty powerful and well-designed. It spares each Lisp implementor from part of the work of designing a new language, by providing a solid basis (lexical scoping, proper closures, hygienic macros, etc.) from which to grow a Lisp. I'm not one hundred percent sold on the idea of first class continuations and multiple values as part of this core*, and I'm not arguing that every new Lisp created should be based on Scheme, but even if you are going to depart from that core, the core itself is a good starting point to depart from.

[* Though much of the async/coroutine stuff that is appearing in modern languages can be implemented on the top of continuations, so maybe their placement in that core is warranted.]

2 comentários / comments

Lisp without cons cells

2016-05-28 13:14 -0300. Tags: comp, prog, pldesign, lisp, ramble, in-english

Okay, I'm gonna write this down now to distract myself for a while before I get back to Master's stuff.

In a recent post I talked about the problem of cross-process garbage collection, and suggested wrapping objects in a reference-counted container when crossing process boundaries as a possible solution, but I remarked that this would have a large overhead when passing many small objects. The prime example would be passing a linked list, as (at least naively) every node of the list would get wrapped as the elements of the list are accessed.

Now, I particularly cared about this case because the linked list (based on cons cells) is a very prominent data structure in Lisp. And although they have some nice properties (they are conceptually simple, you can insert and remove elements into the middle/end of a list by mutating the cdrs), they also are not exactly the most efficient data structure in the world: half the memory they use is just for storing the "next" pointer (which fills processor cache), whereas in a vector you just need a header of constant size (indicating the vector size and other metadata) and the rest of the memory used is all payload. Also, vectors have better locality. On the other hand, "consing" (i.e., nondestructively inserting) an element into a vector is O(n), because you have to copy the whole vector, and even destructive insertion may require a whole copy every once in a while (when you exceed the current capacity of the vector). I've been wondering for a long time: could you make a Lisp based on a data structure that is halfway between a linked list and a vector?

If we are to allow the common Lisp idioms with this new kind of list, it has to support consing and taking the tail of the list efficiently. (Another possibility is to replace the common idioms with something else. That is much more open-ended and requires more thought.)

What I've been thinking of as of late is roughly a linked list of vectors, with some bells and whistles; each vector would be a chunk of the list. Each vector/chunk would have a header containing: (1) the number of elements in the chunk; (2) a link to the next chunk; (3) an index into the next chunk. Then comes the payload. So, for example, if you have the list (w x y z), and you want to append the list (a b c) on the front of it, you'd get a structure like this (the | separates graphically the header from the payload; it does not represent anything in memory):

[3 * 0 | a b c]
   |
   `->[4 * 0 | w x y z]
         |
         `-> ø

The reason for the index is that now you can return the tail of a list lst without the first n elements by returning a vector chunk with 0 length and a pointer into lst with index n: [0 lst n | ]. If the n is greater than the size of the first chunk (e.g., if you want to drop 5 elements from the (a b c w x y z) list above), we must follow the "next" pointers until we find the chunk where the desired tail begins. This is likely to be more efficient than the cons cell case, because instead of following n "next" pointers, you follow the number of chunks, subtracting the length of the skipped chunk from n each time. In the worst case, where there is one chunk for each element, the performance is the same as for cons cells, at least in number of pointers traversals. (We must only allow empty chunks, like the [0 lst n | ] example, at the beginning of a list, never in the middle of a chunk sequence. This ensures worst-case cons-like behavior. If we allowed empty chunks anywhere, reaching the nth element of a list could require arbitrarily many chunk traversals.)

One problem with this is that now (cdr lst) allocates memory (it creates a [0 lst 1 | ] chunk and returns it), unlike the cons cell case, where cdr never allocates memory (it just returns the value of the cell's "next" pointer). One possible solution is to try to make the result of cdr go in the stack rather than being heap-allocated, to reduce the overhead (the compiler could special-case cdr somehow to make it return multiple values rather than a new chunk, and build the chunk on the fly in the caller if it turns out to be necessary.) Another way around this would be to return a pointer into the middle of a chunk instead of a new chunk. I see two ways of achieving this:

Change the layout of the vector such that it begins with the payload, followed by a special value that marks the end of the payload, followed by the pointer to the next chunk. Now we don't need an index into the next chunk anymore, because we can just make the "next" pointer point into the middle of the next chunk. The drawback is that now finding the end of the chunk requires traversing the whole chunk, rather than looking at the length at the header. This would still be faster than with cons cells, because you're traversing adjacent elements in memory.
Require all chunks to have a fixed size and fixed memory alignment (say, every chunk has 64 bytes and is allocated at an address that is a multiple of 64). Then you can always find the header by zeroing out the last bits of the address.
Third idea that occurred to me now: if all chunks are, say, 16-byte aligned, you can encode an index 0~15 into the lower-order bits of the pointer.

All these have drawbacks. First, you need to know that the pointer you have is a pointer to a cons cell to be able to safely do the pointer arithmetic. (The fixed-size chunks case is simpler to solve: you zero out the pointer and see if it points to a chunk type tag.) Also, pointers into the middle of objects complicate garbage collection (and even more reference counting, I think). Finally, if you fix the size of chunks some of the advantages of using chunks in first place go away; if I allocate a 1000-element list at once, that should get me a single 1000-element chunk.

Or should it? Another problem here is that now garbage collection / reference counting can only collect whole chunks. If you choose your chunks badly, you may end up holding memory for longer than necessary. For instance, if you have a 1000-element list and at some point your program takes tails until it only remains with a reference to the last three elements, and the list was made out of a single 1000-element chunk, now you're stuck with a huge chunk most of which is unused – and more, all the elements in it are held from being collected too. Maybe we'd need a heuristic: if the tail size you want is less than some threshold size of the chunk, the system would return a copy of the tail rather than the tail. This would mess with mutability (you'd never know if the tail list you got shares storage with the original), but maybe immutable lists are the way to go anyway.

The other problem to solve is how to make cons efficient: the classical Lisp cons adds (non-destructively) one element to the front of an existing list, and we don't want to create a new chunk per cons invocation, otherwise the chunks just degenerate into cons cells. One idea I had is to allocate chunks with a least a certain amount of elements. For example, if you create a list with just a, you'd get a chunk with a few blank spaces (and enough metadata to know what is blank and what isn't; this could be an extra header element, or just a distinguished value meaning "blank"): [4 ø 0 | _ _ _ a]. Now, when you cons a new element x into that list, cons would check if there is a space immediately before the a in the existing chunk, and mutate it in place: [4 ø 0 | _ _ x a]. This won't mess with the program's view of the list because so far it only had references to the already filled part of the list. The problem with this is if you have multiple threads wanting to cons onto the same list at the same time: we must ensure only one of them gets to mutate the chunk. For example, say one thread want to cons x onto the list (a), and another thread wants to cons y onto the same list (a). We must make sure that only one gets to mutate the chunk in place ([4 ø 0 | _ _ x a]), and the other one will fail and fall back to either by copying the chunk and then mutating the copy, or by creating a new chunk that points to the old one ([4 [4 ø _ _ x a] 3 | _ _ _ y]; note that outer chunk points into the inner chunk with an index 3, skipping the first 3 elements, including the x added by the other thread). This could have a synchronization overhead. I'm not sure if it would be significant, though, because all you need is a compare-and-swap: "try to write into this space if it is blank". You don't need a lock because you don't need to wait anyone: if this first try fails (i.e., if the other thread got the space first), the space won't be available anymore, so you must immediately fall back to creating a new chunk rather than waiting for anything.

A possible side-effect of all of this is that now vectors as a separate data structure may not be necessary: you just allocate an n-element list at once, and it will largely have the same performance as an n-element vector. Well, unless we make lists immutable, then we may need (mutable) vectors. And lists still have some arithmetic overhead to find the position of the element (because in general we don't know that the list is a single chunk when performing an access, we have to find that out), so vectors may still be advantageous in many circumstances.

Now, back to (trying to) work.

[Update: Apparently I reinvented a half-hearted version of VLists. Also, I didn't mention that, but the Lisp Machine had a feature similar in spirit (but not in implementation) called CDR coding, which used a special tag in cons cells to mean that the rest of the list itself rather than a pointer to it was stored at the cdr place, thus saving one pointer and gaining locality. In the Lisp Machine, every memory object was tagged, so this special tag came more or less for free, which is generally not the case for modern architectures.]

Comentários / Comments

C is in an identity crisis, and some thoughts on undefined behavior

2016-05-19 23:11 -0300. Tags: comp, prog, c, pldesign, ramble, in-english

So, stories about undefined behavior have been making rounds again in my Twitter and RSS feeds (two things I was supposed not to be using, but anyway), which brought me some new thoughts and some other thoughts I meant to blog about ages ago but forgot about them.

The most recent one was this comment on Hacker News (via @pcwalton, via @jamesiry), which presents the following code, which is supposed to take a circular linked list, take note of the head of the list, and walk around the list freeing each node until it finds the node that points back to the head (and thus the end of the list):

void free_circularly_linked_list(struct node *head) {
  struct node *tmp = head;
  do {
    struct node *next = tmp->next;
    free(tmp);
    tmp = next;
  } while (tmp != head);
}

This looks (to me) as good C code as it gets. However, this code triggers undefined behavior: after the first iteration of the loop frees the node pointed to by head, it is undefined behavior to perform the tmp != head comparison, even though head is not dereferenced.

I don't know what is the rationale behind this. Maybe that would make it possible to run C in a garbage-collected environment where as soon as an object is freed, all references to it are zeroed out. (The fact that no one has ever done this (as far as I know) is mere detail. The fact that in a garbage-collected environment free would likely be a no-op is a mere detail too.)

The feeling I had after I read this is that C is in a kind of identity crisis: C allows you to do all sorts of unsafe operations because (I'd assume) it's supposed to let you do the kind of bit-bashing you often want to do in low-level code; at the same time, modern standards forbid that very bit-bashing. What is the point of programming in C anymore?

[Addendum: To be more clear, what is the purported goal of the C language? The feeling I have is that it has moved from its original function as a "higher-level assembly" that is good for systems programming, and is trying to serve a wider audience more preoccupied with performance, but in doing so it is not serving either audience very well.]

And the standards forbid these operations in the worst possible way: by claiming that the behavior is undefined, i.e., claiming that compilers are free to do whatever the hell they please with code perfoming such operations. Compilers keep becoming better and better at exploiting this sort of undefinedness to better "optimize" code (for speed, anyway). Meanwhile, they keep breaking existing code, and opening new and shiny security vulnerabilities in programs. The NSA probably loves this.

At this point, I'm beginning to think that C does not serve its purpose well anymore. The problem is that there seems to be no real alternative available. Maybe Rust can be it, although I don't really know how well Rust does in the bit-twiddling camp (e.g., can you easily perform bitwise and/or with a pointer in Rust? Well, come to think of it, even C does not allow that; you have to cast to an integer first.)

* * *

The other undefined behavior I've been reading about lately is signed overflow. In C, signed overflow is undefined, which means that code like:

if (value + increment < value) {
    printf("Overflow occurred! Aborting!\n");
    exit(1);
}
else {
    printf("No overflow; proceeding normally\n");
    value += increment;
}

is broken, because the compiler is likely to optimize the overflow check and the then branch away and just leave the else branch. I have seen two rationales given for that:

Pointer arithmetic. In the good old times, an int and a pointer used to have the same size. People happily used ints as array indices. Array indexing is just pointer arithmetic, and in some architectures (like x86), you can often perform the pointer arithmetic plus load in a single instruction.

Then came 64-bit architectures. For reasons I don't really get (compatibility?), on x86-64 and other 64-bit architectures ints remained 32-bit even though pointers became 64-bit. The problem now is that transformations that assumed integers and pointers to be the same size don't work anymore, because now their point of overflow is different. For example, suppose you had code like:

void walk_string(char *s) {
    for (int i=0; s[i]; i++) {
        do_something(s[i]);
    }
}

Usually, the compiler would be able to replace this with:

void walk_string(char *s) {
    for (; *s; s++) {
        do_something(*s);
    }
}

which is potentially more efficient. If ints and pointers have the same size, then this transformation is okay regardless of overflow, because the int would only overflow at the same point the pointer would anyway. Now, if ints are supposed to wrap at 32-bits but pointers wrap at 64-bits, then this transformation is not valid anymore, because the pointer version does not preserve the overflow behavior of the original. By making signed overflow undefined, the problem is sidestepped entirely, because now at the point of overflow the compiler is free to do whatever the hell it pleases, so the fact that the overflow behavior of the original is not preserved does not matter.

Now, there is a number of things wrong in this scenario:

Now ints cannot be used to index into an arbitrarily-sized array, only 2³²-element ones at most. In any sane language (I'd argue), the standard integer type would be one encompassing the full range of valid array indices. (Common Lisp explicitly requires fixnums to have at least that range, for instance.)
Unsigned overflow is defined. That's good, but at the same time it seems kinda strange that we have been using signed integers for array indices all this time, to the point that the undefinedness was chosen precisely to enable optimization of this usage, even though, come to think of it, it seems kinda wrong to use a signed integer in this circumstance (if your array takes more than half the memory space, this would be wrong, for instance).
[Update: Speaking of the devil...]

Optimizations based on "real math". The other reason I am aware of for making signed overflow undefined is to enable optimizations based on the mathematical properties of actual mathematical integers. An example is assuming that x+1 > x, for instance (which is what breaks the overflow test mentioned before). Another example is assuming that in a loop like:

for (i=0; i<=limit; i++) { ... }

the halting condition i<=limit will eventually be true, and therefore the loop will finish; if i were defined to overflow, then this loop would be infinite when limit == INT_MAX. Knowing that a loop terminates enables some optimizations. The linked article mentions enabling use of loop-specific instructions which assume termination in some architectures. Another advantage of knowing that a loop terminates is enabling moving code around, because non-termination is an externally-visible effect and you may not be able to move code across the boundaries of an externally-visible event ^{[please clarify]}. Now, something that occurred to me back when I read that post is that it assumes a dichotomy between either treating overflow as undefined, or defining it to wrap around. But there are other possibilities not explored here. I don't necessarily claim that they are feasible or better, but it's interesting to think on what optimizations they would enable or preclude. For instance:

One possibility is to actually make integers behave like their mathematical counterparts. This is what Common Lisp and Python do, for instance, by automatically promoting fixnums to bignums when they exceed the representation capacity of fixnums. In this case, all optimizations based on actual arithmetic are not only possible but safe, because the language integers actually respect those properties. The drawback is that now you have to deal with bignums, and check whether a number is a fixnum or a bignum before performing arithmetic.
Another possibility is to make integer overflow trap, i.e., raise an exception, abort execution, or something along those lines. In this case, an assumption like x+1 > x is true as long as x+1 does not trap. This might not enable the same amount of optimization as claiming undefined behavior for overflow, but at least it enables knowing that a loop terminates (either by hitting the termination condition or by aborting on overflow). Although aborting on overflow is an externally-visible event too, so I don't know if you would be able to perform the aforementioned code moving. Actually, I'd need to see an example of code moving that would be benefited by knowing about termination. I need to research this. [I rememeber seeing some code involving multiple threads and the observable behavior of one thread from another, but (1) I'd have to find it; (2) I'd have to be convinced that the mentioned optimization would be a good idea anyway.]

Now, the problem with the trapping semantics is that you have to check for overflow on every operation. This could be costly, but there are people working on making it more efficient. Seriously, this is the kind of thing that would be trivial if only architectures would help a little bit. Having the processor trap on overflow (either by having special trapping arithmetic instructions, or by having a special mode/flag which would enable trapping) would make this essentially costless, I think. Another nice-to-have would be a set of arithmetic instructions which treated the lower bits of words specially as flags, and trapped when the flags were not, say, all zeros. This could drastically reduce the cost of having fixnums and bignums in the language; the instructions would trap on non-fixnums and invoke a handler to perform the bignum arithmetic (or even find out that the operands are not numbers at all, and signal a dynamic type error), and perform the fast integer arithmetic when the operands had the fixnum flag. Alas, unfortunately we cannot just invent our own instructions, as we typically want to be able to use our programming languages on existing platforms. We could try to lobby Intel/AMD, though (as if).

(At this point I'm not really thinking about semantics for C anymore, just about the possible semantics for integers in a new programming language. Even if, say, Clang incorporated an efficient mechanism for trapping integer overflow, the standard would still say that signed overflow is undefined, so I'm not sure there is much hope for C in this situation.)

2 comentários / comments

Random project ideas

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A few days ago someone asked me what kind of projects I would like to work on. I have some ideas that have been wandering my mind for a long time, so I decided to write some of them down. (I'm sure someone will come around, steal some of them, and make millions of dollars, but there we go.) I have been meaning to write a post like this one for quite a while, but never got around it. For the next days I will be working on finishing my Master's monograph, so I decided write them down now. The text is kind of a mess, and especially towards the end it is more of a "thinking out loud" sort than a clear exposition of ideas, but it's probably a good idea (for me, anyway) to register these ideas at once before I forget them. As Zhuangzi would say, "Let me say a few careless words to you and you listen carelessly, all right?"

A programming language

Although there are plenty of nice programming languages around (and plenty of un-nice ones as well), I don't know any language I'm fully satisfied with, and I certainly would like to try creating one. I don't think there can be such a thing as one perfect language to rule them all, but I'd like to create one that is better suited to the way I like to do things and to the kinds of things I like to do. Some features of that hypothetical language would be:

Optional types

After writing some larger programs in Scheme, I realized that some static type analysis would go a long way in catching many programming mistakes. Moreover, while working (for a short time) with the Clang/LLVM codebase, I realized how types can be useful in finding other places in a program which need to be changed after a change in one part. However, I strongly believe in the ability to run incomplete (and incorrect) programs, both for the sake of debugging (it is often convenient to be able to run an incorrect program with some concrete data to see what is the state at the point of the error, rather than relying solely on static type errors), and for the sake of exploratory programming (when you want to try out stuff and figure out what program do you want to write after all).

(Some static-types people, when confronted with this notion, give replies like: "but static types help me with exploratory programming!". Well, dynamic typing helps me with exploratory programming. As I said before, I don't believe in a single programming language to rule them all, and neither do I believe that either static typing or dynamic typing is inherently superior in all circumstances and for all people. Indeed, I would actually like not to have to choose one or another, and that's kind of the point of this section.)

So, I would like to alternate between dynamic and static types according to convenience. First of all, I don't think static type mismatches should preclude compilation/execution; rather, these should emit warnings, and the compiler should emit code that runs as far as possible before hitting the error.

In dynamically-typed languages, it is relatively easy to call functions with arguments of the wrong type and keep running, because all values carry enough metadata (tags of some kind) to enable checking the type at any given moment. Statically-typed languages, on the other hand, usually employ "unboxed" representations for data (e.g., a 32-bit integer in memory is just 32 bits, with no extra tag indicating that it is an integer; a pointer is just another 32 (or 64) bits in memory, with no way of distinguishing it from an integer of the same size), therefore calling a function with a value of the wrong type (e.g., passing an integer to a function expecting a pointer) must be blocked; otherwise, there would be a violation of safety (e.g., the function would try to use the integer as a pointer, probably with disastrous consequences). Conventional statically-typed languages block execution of such unsafe function calls by refusing to compile the program, but that's not necessary: one might instead emit code that evaluates the program until the point where an unsafe function call would be performed, and abort execution just then. For instance, if f is a function taking integers, and g(x) returns a string, and the program contains the expression f(g(x)), you can still emit code that evaluates g(x) (and f), and then interrupts execution instead of calling f with the string. (I've written about this before, but anyway.)

So, you could have a language where (1) type declarations are optional; (2) if function parameter/return types are not declared, dynamic types and boxed representations are used by default; (3) if types are declared, static type checking is performed on function calls, and where the compiler cannot prove that the types match (e.g., because they don't match, or because a function with statically-typed parameters is called with a dynamically-typed value), it emits code to interrupt execution (and provide a decent diagnostic message, and/or trigger the debugger) just before the function is called. I wonder if interrupting execution at the function call wouldn't be too early (as ideally I'd like to be able run an incomplete/incorrect program as far as is reasonable). Perhaps there could be a compiler option/pragma to emit code for the entire module as if everything were dynamically typed, even when type declarations are provided (but still emit warnings upon detected type mismatches).

This mix of static and dynamic types brings some implementation problems. Because statically-typed functions can't just be called with any random argument due to representation mismatches, if a function expects another function as an argument (let's say, a function f expects a function taking an integer and returning an integer), you cannot pass it a dynamically-typed function g that happens to return integers when given integers, because the representation of the values it returns would be different (g returns boxed integers, but f expects a function that returns raw integers). Likewise, if a function f expects a dynamically-typed function, it cannot be given a function g that returns raw integers. One solution would be to generate a dynamically-typed wrapper around statically-typed functions when they are used in a context that expects a dynamic function, and vice-versa.

To research: It'd probably be good to see how Haskell deals with the situation where an Int -> Int function is used where an a -> a one is expected. Although now that I wrote this I realize that that's not the same problem, because in Haskell the concrete a is always statically known at the point where the integer value is used. Probably more promising would be to look at how Java and C# deal with this. (In Java, the difference between a raw integer and an Integer object is directly visible to the programmer, whereas I think that's not the case in C#.)

There has also been lots of work on the subject of gradual types, both recent and old, and I have to check it out.

Language interoperability

I'm probably beginning to sound repetitive, but as I said before, I don't believe in a single language for all circumstances, and not only I want to be able to use other existing programming languages, I'm probably going to create more than one in my lifetime. Still, I don't like the idea of having to reimplement code already available in a library just because the library is not available in the language I want to use. I'd like to be able to easily integrate pieces of code written in different programming languages, without having to manually write wrappers/bindings/etc. Ideally, there would be a "universal" way of exchanging data / invoking procedures across programming languages, and the "bridging" work would only have to be done once for each programming language (to make it support the "universal" way), rather than for each library and each language you want to use the library in.

This already kinda exists: the JVM and CLR do this – for languages that are implemented on the top of the JVM or CLR. The problem is that you are constrained to implement your language on the top of these virtual machines, which might not be a good match for the programming language in question (e.g., the VM may not support tail call optimization, or multiple inheritance, or you want to use the stack in a really strange way, or you want your strings to be UTF-8 instead of UTF-16, or you want direct access to operating system features, or whatever), or you may simply not want to use a bulky VM, or one whose evolution is in the hands of a single company and cramped by API copyright claims and/or patents.

An idea that occurred to me in the past is that of an extensible virtual machine: a machine where you could load modules that implemented new opcodes/features that would make it better suited to various programming languages. It is possible to do this without opcode conflicts between the different extensions by giving opcodes (fully-qualified) names instead of fixed numbers; the mapping from concrete opcode numbers to opcode names would be given in a header in the bytecode file, and would not be fixed. (I'm pretty sure I got this idea from WebAssembly, but I haven't been able to find a source right now.) So, for instance, you could have a Python extension for the VM, with Python-specific features and opcodes, and a Scheme extension, and so on. Of course, making all extensions work together is not just a matter of avoiding opcode conflicts, but having the freedom to extend the VM with features more convienient to each language to be supported sounds more promising than being forced to work with a fixed instruction set which did not have the language of interest in mind.

VMs are not the only possibility (although VMs have a number of other possible advantages, which I may write about in another post). Another way would be to let each language implementation be completely independent, but make it possible somehow for them to share data and call code from each other. This has the advantage of (potentially) making it easier to accomodate existing language implementations (rather than having to reimplement them as VM extensions), but of course brings with it a number of challenges to make interoperability work, such as:

Handling mismatches between data layout in memory. One idea that immediately comes to mind is message-passing/OO: it does not matter if your strings are UTF-8 or UTF-16 or something else if you always access their contents with a chatAt(i) method which knows how to find the ith character in it. You still have to agree on the format of characters, though. And it probably would be slow as hell to access everything through methods / message-passing. (But you could have methods like asUTF8() or asUTF16() which would return the string (or a copy thereof) in the specified format, which you could directly manipulate.) You also have to agree on a convention on how to invoke a method (but we would have to do that anyway, in principle).
Garbage-collecting data that crosses language boundaries. Basically you have multiple garbage collectors working at the same time, trying not to step on each others' toes. If language A calls a method from language B which returns an object x, now A holds a reference to x, but B's garbage collector is not prepared to go looking inside A's objects in memory for references to x. So it might end up collecting x while A still has references to it. One awkward solution would be to make everyone use the same garbage collector (for instance, a conservative one like Boehm–Demers–Weiser), but this precludes interesting implementations like Cheney-on-the-MTA linked before (which is used by Chicken Scheme, for instance). Another possibility would be to wrap every object that crosses language boundaries in a reference-counted container (which might also take care of other interoperability issues, such as containing enough information to allow inspecting the encapsulated object's methods). In the previous example, B would return x to A encapsulated within a wrapper X with a reference count of 2, meaning two runtimes (A and B) have access to the object. When B decides nothing in its memory holds a reference to x, the element would somehow (handwave) rememeber that x has been given to another runtime inside a container X, and decrement the container's refcount, instead of freeing x. When the refcount reaches zero, x can be freed. That could become unwieldy for large numbers of objects (e.g., if B gives A a linked list, every time A asks for the next element in the list, the element would get wrapped).

Another possibility, which got into my mind after reading this, would be a model in which no memory is shared between languages: everything is passed by copy, and things don't even have to be in the same process. (This brings up another problem that has been in my mind since forever: the fact that Unix streams, files, etc., are limited to bytes rather than structured data. Here we would be defininig a mechanism for exchanging higher-level data across processes.) This solves the data exchange problem, but not the code invocation problem. This could be done by message passing, but I wonder how that would work. For instance, suppose language A wants to use a regex library from language B.

A wants to search for all occurrences of a regex r within a 50MB string s. It would have to pass r and s by copy to B, and get the (potentially large) result by copy too, which sounds sub-optimal.
A wants to search for all occurrences of a regex r within a stream (open file, socket, generator, whatever) s. How would it pass s to B? Would it pass a handle h to B, and then listen for messages of the form "[nextItem h]" and answer accordingly? How would it know, in the general case, when to stop listening for messages about h from B? Wouldn't I have now a cross-process garbage collection problem, and now we're back to the wrapping solution?

To research: There is an infinite number of things to research here. One that I realized recently is GObject, GNOME/Gtk's way of doing objects in C which was created with the goal of easily supporting bindings to other languages in mind. Maybe GObject solves all the problems and there's nothing to do (other than making everything support GObject for exchange of arbitrary objects), or maybe GObject can be used as an inspiration. I should also look up other inter-process communication mechanisms (especially D-Bus, and Erlang/OTP). Other people have been working on this problem from other angles.

EOF

Sorry if this post looks like a mess of random ideas thrown around, because that's exactly what it is. As always, writing about stuff helps me organize my ideas about them and realize problems I hadn't thought about before (though how organized things are here is up to question). Feel free to comment.

Comentários / Comments

You're doing it completely wrong

2015-04-28 22:03 -0300. Tags: comp, prog, wrong, ramble, em-portugues

Encontrei um e-mail de três anos atrás que explica uma das coisas que eu acho que estão "completely wrong" com os ambientes computacionais modernos. O texto era para ter servido de base para um post que nunca foi escrito. Reproduzo-o semi-intacto abaixo. (Fico feliz que a minha visão ainda seja a mesma, mas por outro lado o "plano para os próximos dez mil anos" parece longe de se realizar...)

* * *

From: Vítor De Araújo
To: Carolina Nogueira
Subject: You're doing it completely wrong
Date: Mon, 12 Mar 2012 00:37:18 -0300

[Core dump follows.]

Acho que eu não vou escrever o post tão cedo, então explico agora. O que há de completely wrong é que os programas em execução são "caixas pretas"; em geral um programa é um brinquedo estático, que tu não pode "olhar dentro" e modificar facilmente. Em um mundo ideal tu deveria poder alterar partes de qualquer programa sem ter que recompilar tudo, e de preferência sem sequer ter que fechar e abrir o programa para as mudanças entrarem em efeito.

Exemplo simples: o Pidgin a cada 60 segundos manda um "XMPP ping" pro servidor de XMPP (Jabber), para ver se a conexão ainda está de pé. Se eu quiser mudar de quanto em quanto tempo ele manda o request, ou quanto tempo ele espera por uma resposta, eu tenho que alterar o fonte. Idealmente:

Isso poderia ser (e talvez já seja) uma variavelzinha feliz do programa, que eu deveria poder alterar durante a execução;
Alterar o fonte não deveria ser nenhum grande mistério; o fonte deveria ser facilmente acessível e modificável a partir do programa, e eu deveria poder recompilar uma função ou módulo individual sem ter que recompilar o universo. (Mais de uma vez eu quis mudar coisinhas simples do Pidgin, e.g., eliminar o atalho Ctrl-Up, que bizonhamente exigem alterar o fonte, e falhei miseravelmente em conseguir fazer o troço compilar.)

Além disso, deveria ser fácil desfazer qualquer modificação. Versioning é uma tecnologia muito supimpa dos anos 60 que, como muitas tecnologias supimpas do passado, se perdeu com a popularização de maquininhas pequenas com recursos insuficientes pra suportar a coisa. Hoje em dia temos recursos sobrando, mas a tecnologia caiu no esquecimento.

Exemplo menos simples: o Claws Mail por padrão ordena as mensagens em ordem ascendente de data (da mais antiga pra mais recente), e ele lembra o critério de ordenação por pasta, o que quer dizer que o usuário tem que alterar pra ordem decrescente individualmente pra cada uma das mais de oito mil pastas que há (Inbox, Sent, Queue, Trash, Drafts pra cada mailbox, mais os feeds de RSS). Provavelmente 99.5% das pessoas esperam ordem decrescente por padrão, e provavelmente teria que alterar pouca coisa do código pra mudar o padrão, mas a barreira pra alterar o software é muito grande (baixar fontes, encontrar o trecho de código relevante, se entender com ./configures e bibliotecas da vida, recompilar o mundo, fechar e abrir o programa). Idealmente, eu deveria poder apontar para um menu relacionado com o que eu quero, teclar alguma combinação mágica, e ir parar na função que aquele menu ativa. Daí em diante eu deveria conseguir navegar pelo código até achar o trecho relevante; os nomes de funções e variáveis deveriam funcionar como links (coisa que acho que é possível com a maior parte das IDEs atuais). Tendo achado o dito cujo, eu deveria poder alterar o código, mandar recompilar aquele trecho, e as mudanças se tornarem imediatamente efetivas.

E aparentemente essa utopia toda já existiu, sob o nome de Lisp Machine. Este cara diz:

Within DW, you can move your mouse over any object or subobject that has been displayed, whether or not it was part of any GUI. You get this virtually for free and have to work hard to suppress it if you don't want it. But it doesn't require any special programming.
One of the most common ways to get a foothold in Genera for debugging something is to (1) find a place where the thing you want to use appears visually, (2) click Super-Left to get the object into your hands for read-eval-print, inspection, etc, (3) use (ed (type-of *)) to find its source reliably and with no foreknowledge of what the type is, who wrote the program, what their file conventions were, who loaded it, or any of myriad other things that other systems make me do.

Quer dizer, tu pode pegar um objeto qualquer presente na tela (e praticamente tudo é um objeto, pelo visto), descobrir o tipo/classe dele e mandar editar o código correspondente [(ed (type-of *))]. Eu queria rodar o simulador de Genera aqui pra ver isso funcionando na prática, mas não consegui fazer ele funcionar total... :-(

Enfim, muita tecnologia gloriosa do passado se perdeu. Um dos meus objetivos para os próximos dez mil anos é avançar o estado da arte a o que ele era nos anos 80... :P

Aí o camarada se pergunta: será que faz sentido ter essa tecnologia em um desktop? Será que o "usuário final" tem algum interesse nisso tudo? A resposta é a mesma que se pode dar para justificar a presença de uma linha de comando em um sistema visando o "usuário final" (e.g., Ubuntu, Mac OS):

Mesmo que o usuário final não tenha interesse, se é de ajuda ao desenvolvedor e não prejudica o usuário de maneira alguma, não há por que não ter a feature.
Alguns usuários têm interesse em explorar o funcionamento do sistema e usar as features mais avançadas. Em particular, se a barreira para editar os programas for pequena e for fácil desfazer quaisquer modificações, é bem possível que muito mais gente tenha interesse em bagunçar com os internals das coisas.
Uma coisa muito legal que se vê com Ubuntus da vida é que mesmo um usuário que não saiba usar a linha de comando pode pedir ajuda num fórum e alguém que sabe pode ajudar o cara a resolver um problema que envolva brincar de linha de comando. Da mesma forma, um usuário poderia resolver problemas que envolvessem alterar um programa com a ajuda de pessoinhas em fóruns e afins.

Além de dar todo o poder para os cidadãos infra-vermelhos, o que lembra um pouco as idéias do camarada Alan Kay [1][2], a tendência é que a facilidade de alterar e testar os programas leve a programas mais bem testados, menos bugados e que tendam à perfeição mais rápido. Todos comemora.

Existem mil dificuldades envolvidas na criação de um sistema assim, especialmente se quisermos abandonar o modelo da Lisp Machine de uma máquina mono-usuário em que todos os processos podem tudo, mas isso fica pra outra discussão.

Escrevi bem mais do que eu pretendia. Parece que eu até tenho material para um post... :P

Comentários / Comments

Mind dump

2015-04-10 01:34 -0300. Tags: comp, prog, pldesign, lash, life, mind, ramble, music, em-portugues

Coloquei o lash no GitHub, for what it's worth. Eu me pergunto se foi uma coisa sensata publicar ele agora, mas já faz um tempo que eu vinha anunciando que ia publicar "em breve", então coloquei lá de uma vez. (Além disso, esses dias eu quis mexer nele fora de casa e não tinha o código.) O código está em um estágio bem inicial – vergonhosamente inicial, dado que já faz umas três semanas que comecei a trabalhar nele, e o que eu fiz até agora eu provavelmente poderia ter feito em uns três dias se tivesse tido a disciplina de trabalhar nele semi-diariamente. Por outro lado, o projeto está andando para frente, mesmo que devagar, o que já é melhor do que todos os anos anteriores em que eu disse "puxa, eu queria fazer um shell" e não escrevi uma linha de código. So, that's progress. Além disso, na atual conjuntura eu provavelmente deveria tentar relaxar um pouco a cuca e me preocupar menos com isso; afinal, isso é um projeto pessoal e eu não devo nada para ninguém. No final das contas, megalomanias de dominação mundial à parte, o principal afetado pelo bom sucesso do projeto sou eu mesmo.

Eu cheguei à brilhante e inaudita conclusão de que eu vou ter que reduzir bastante minha atividade twittereira e internética em geral se eu quiser começar a fazer alguma coisa produtiva com a minha vida (onde escrever um shell e estudar línguas obscuras contam como coisas produtivas). Por outro lado, a Internet atualmente é responsável por uns 95% das minhas interações sociais, especialmente agora que eu não tenho mais aulas, coisa que faz uma certa falta, a despeito da minha fama de anti-social. A solução provavelmente é (shudder) sair de casa e falar com pessoas.

Eu também cheguei à conclusão (igualmente brilhante) de que muita coisa nessa vida é questão de criar hábitos. Por exemplo, até algumas semanas atrás eu costumava usar toda a louça da casa até não ter mais louça limpa, momento em que eu aplicava o garbage collector e lavava tudo (ou, dependendo da preguiça, só o que eu precisasse na hora). Eu me dei conta de que isso não estava sendo muito conveniente e resolvi começar a lavar as coisas logo depois que uso, ou antes de ir dormir. No começo era meio ruim ter que me "obrigar" a fazer isso, mas agora eu já me habituei e isso não me incomoda mais tanto (além do que, como a louça não acumula, normalmente o esforço de lavar é pequeno). Talvez seja uma questão de criar o hábito de sentar uma hora do dia para programar/estudar/whatever. O flip side disso é que a gente também se habitua ao longo da vida a uma porção de coisas que a gente deveria questionar e/ou atirar pela janela, não só hábitos acionais como também (e principalmente) hábitos mentais. Estas são minhas (brilhantes e inauditas) palavras de sabedoria do dia. (Tecnicamente todo hábito é mental, mas deu pra entender. Acho.)

Escrever o lash em Chicken Scheme tem sido uma experiência bastante agradável. Eu estou aprendendo (a parte não-R5RS d)a linguagem "as I go", mas até agora a linguagem não me deixou na mão, a implementação é estável e gera executáveis pequenos e razoavelmente rápidos, e o sistema de pacotes funciona. (Rodar chicken-install pacote e consistentemente ver o pacote ser baixado e compilado sem erros era quase chocante no começo. O fato de que as bibliotecas são shared objects (a.k.a. DLLs) de verdade e carregam instantaneamente também muito alegrou o espírito, especialmente dada minha experiência anterior com bibliotecas em Common Lisp.) A única coisa que deixa um pouco a desejar é o error reporting, mas nada "deal-breaking".

Eu me dei conta de que uma das coisas que eu mais gosto em linguagens "dinâmicas" é a habilidade de rodar um programa incompleto. Eu já meio que escrevi sobre isso antes, mas eu já não lembrava mais quão deeply satisfying é poder rodar um programa pela metade e ver a parte que foi escrita até o momento funcionando. Por outro lado, é bastante incômodo errar um nome de função ou os argumentos e só descobrir o erro em tempo de execução. Faz muito tempo que eu acho que o ideal seria uma linguagem com análise estática de tipos, mas em que erros de tipo gerassem warnings ao invés de impedir a compilação, e que permitisse a declaração opcional dos tipos de variáveis e funções. Uma dificuldade que eu via nisso até agora é que enquanto em uma linguagem dinâmica os dados costumam ter uma representação uniforme em memória que carrega consigo alguma tag indicando o tipo do dado, e portanto é possível chamar uma função com argumentos do tipo errado e detectar isso em tempo de execução, em uma linguagem estaticamente tipada convencional os dados costumam ter uma representação untagged/unboxed e de tamanho variável, o que tornaria impossível compilar um programa com um erro de tipo sem violar a segurança da linguagem (e.g., se uma função f recebe um vetor e eu a chamo com um int, ou eu rejeito o programa em tempo de compilação, o que atrapalha minha habilidade de rodar programas incompletos/incorretos, ou eu gero um programa que interpreta o meu int como um vetor sem que isso seja detectado em tempo de execução, o que provavelmente vai causar um segfault ou algo pior). Porém, esses dias eu me dei conta de que ao invés de compilar uma chamada (insegura) a f(some_int), poder-se-ia simplesmente (além de gerar o warning) compilar uma chamada a error(f, some_int), onde error é uma função que avalia os argumentos e lança uma exceção descrevendo o erro de tipo. O resultado prático é que o executável gerado roda até o ponto em que é seguro rodar (inclusive avaliando a função e os argumentos) e interrompe a execução no ponto em que seria necessário chamar a função com um argumento de tipo/representação incompatível. Melhor dos dois mundos, não? Vai para o meu caderninho de idéias para a Linguagem Perfeita™.

Eu ia escrever mais umas notas sobre a vida, mas ultimamente eu ando mui receoso quanto a publicar coisas da minha vida pessoal – o que provavelmente é uma boa idéia. É fácil esquecer que qualquer um, no presente ou no futuro, pode ler o que a gente escreve nessa tal de Internet. Eu também já falei sobre isso mil vezes antes, which makes it all the more surprising que eu ainda tenha que me relembrar disso ocasionalmente. Anyway.

Unrelated com qualquer coisa, conheci recentemente uma bandinha chamada Clannad, na qual eu me encontro totalmente viciado no momento. Também conheci uma coisa totalmente excelente chamada Galandum Galundaina, uma banda mirandesa com certeza.

12 comentários / comments

Yeah, I code-switch heavily

2014-11-02 02:31 -0200. Tags: lang, life, ramble, em-portugues

Semana passada eu resolvi ir no Verda Kafo (evento esperantista que ocorre no último sábado de todo mês em Porto Alegre), depois de alguns meses de sumiço. Um verdkafano perguntou como ia o meu mestrado, e o Marcus mencionou que o meu blog tinha a resposta. A pedidos, eu passei a URL para ele e mais um dos participantes. Foi só algumas horas depois disso que eu pensei "bá, vão me encher o saco no próximo Verda Kafo por causa das frases em inglês strewn in no meio dos textos", mas aí já era tarde.

Esse fenômeno de alternar entre línguas em um mesmo diálogo ou em uma mesma frase é denominado code-switching.* O artigo da Wikipédia menciona uma porção de explicações sociológicas de por que as pessoas code-switcham. No meu caso, entretanto, for most part, acho que nenhuma das explicações apresentadas se encaixa. Eu simplesmente acho mais fácil dizer algumas coisas em inglês, às vezes porque a sintaxe do termo em inglês é diferente, às vezes sem nenhum motivo aparente. A grande maioria das coisas que eu leio são em inglês, e eu passo boa parte do meu tempo lendo, então acho que não é de admirar. Em tempos de outrora, quando eu estava aprendendo esperanto e o usava com mais freqüência, era bastante comum eu achar mais fácil dizer algumas coisas em esperanto do que em português, primariamente graças ao sistema de composição e derivação supimpa (eu ainda uso "X-ilo" ocasionalmente, onde X é uma palavra em português ou em esperanto, para me referir ao "utensílio de fazer X"), mas às vezes também porque a minha cuca queria dizer alguma coisa com uma estrutura sintática e o português exige outra. Um exemplo "clássico" disso com o inglês são frases como she was named after a tree, que eu nem tenho certeza de como dizer em português ("ela foi nomeada segundo uma árvore" doesn't quite cut it (como se diz "doesn't quite cut it" em português?)).

Poder-se-ia alegar que isso representa a decadência do português e o efeito do imperialismo estadunidense. Eu não sei. Em primeiro lugar, a identidade da língua portuguesa está bem saudável, já que são poucos os falantes de português que fazem code-switching. Em segundo lugar, assim como eu acho mais fácil dizer certas coisas em inglês, há uma porção de outras coisas que eu acho mais fácil dizer em português. Acontece simplesmente que, nas situações em que meus interlocutores falam tanto português quanto inglês, a conversa se dá primariamente em português (evidentemente), então é raro eu ter a oportunidade de falar inglês com frases em português strewn in. (Cabe notar que eu só atravesso termos em inglês quando eu sei que o interlocutor os há de entender, já que, imaginem vocês, comunicação exige entendimento entre as partes. Porém, em alguns casos eu tenho que fazer um esforço extra para dizer certas coisas em português ao invés de falar da maneira que me é mais confortável.)

Por fim, enquanto eu estava ideando este post, eu pensei comigo mesmo: "Seriously, tu tá te justificando pela maneira como tu escreve no teu próprio blog? Que diabos é isso, um blog de gente se explicando?" Não era nem para eu ter que escrever isso (de fato, eu não tenho que escrever isso), mas enfim.

_____

* Segundo o artigo, o uso de múltiplas línguas na escrita é chamado de linguagem macarrônica, mas aparentemente o termo é mais usado para descrever certas formas literárias em que a mistura tem algum propósito especial, freqüentemente humorístico. No caso aqui do blog, entretanto, geralmente o meu uso de inglês atravessado no meio do texto simplesmente reflete a maneira como eu falo quando sei que o interlocutor entende ambas as línguas. No geral, eu tendo a fazer isso mais nos posts mais "pessoais" e menos nos posts mais informativos. Acho.

Elmord's Magic Valley

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.

Posts com a tag: `ramble`

Some thoughts on Gemini and the modern Web

2022-02-20 19:22 +0000. Tags: comp, web, ramble, in-english

#255

2020-03-19 22:53 +0000. Tags: about, life, mind, ramble, in-english

Partial goals, and other rambles

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

On Scheme's minimalism

2017-09-14 19:34 -0300. Tags: comp, prog, lisp, scheme, pldesign, ramble, in-english

Lisp without cons cells

2016-05-28 13:14 -0300. Tags: comp, prog, pldesign, lisp, ramble, in-english

C is in an identity crisis, and some thoughts on undefined behavior

2016-05-19 23:11 -0300. Tags: comp, prog, c, pldesign, ramble, in-english

Random project ideas

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A programming language

Optional types

Language interoperability

EOF

You're doing it completely wrong

2015-04-28 22:03 -0300. Tags: comp, prog, wrong, ramble, em-portugues

Mind dump

2015-04-10 01:34 -0300. Tags: comp, prog, pldesign, lash, life, mind, ramble, music, em-portugues

Yeah, I code-switch heavily

2014-11-02 02:31 -0200. Tags: lang, life, ramble, em-portugues

Main menu

Recent posts

Recent comments

Tags

Elsewhere

Quod vide

Computers, languages, and computer languages. Às vezes em Português, sometimes in English.

Posts com a tag: ramble

2022-02-20 19:22 +0000. Tags: comp, web, ramble, in-english

2020-03-19 22:53 +0000. Tags: about, life, mind, ramble, in-english

2019-05-25 22:33 -0300. Tags: comp, prog, pldesign, fenius, ramble, in-english

2017-09-14 19:34 -0300. Tags: comp, prog, lisp, scheme, pldesign, ramble, in-english

2016-05-28 13:14 -0300. Tags: comp, prog, pldesign, lisp, ramble, in-english

2016-05-19 23:11 -0300. Tags: comp, prog, c, pldesign, ramble, in-english

2016-05-13 00:00 -0300. Tags: comp, prog, pldesign, ramble, in-english

A programming language

Optional types

Language interoperability

EOF

2015-04-28 22:03 -0300. Tags: comp, prog, wrong, ramble, em-portugues

2015-04-10 01:34 -0300. Tags: comp, prog, pldesign, lash, life, mind, ramble, music, em-portugues

2014-11-02 02:31 -0200. Tags: lang, life, ramble, em-portugues

Main menu

Recent posts

Recent comments

Tags

Elsewhere

Quod vide

Posts com a tag: `ramble`