I have spent the entire weekend hacking on Gazelle and I have to say I’m somewhat disappointed in my results.
I was looking quite forward to Gazelle-hacking this weekend. There is so much I want to do — so many things completely designed in my head that only need to be typed out. Specifically:
- fixing all the open issues on Google code (thanks to everyone who has reported issues!)
- making Gazelle self-hosting.
- operator-precedence parsing.
- integration with protobufs — this is how ASTs are going to be supported (it’s a beautiful fit!)
- work on my in-progress protobuf implementation
I’ve been itching to work on all these things, and had the free time this weekend to do it.
So I sat down Friday night and looked at the first Google Code issue I intended to fix: Regex declaration are not taken into account if declared after they are used. When I thought about the right way to solve this problem, I realized that I needed to bind symbols lazily. If I’m parsing a grammar file and I see:
a -> b;
If “b” hasn’t been defined yet, I don’t know if it’s going to be a rule or a named regex. So I need to put a placeholder in “a” definition that I can bind later, once I know what “b” is.
And when I thought about ways to do that, an intuitive feeling came over me that I’ve reached a point with the Gazelle compiler that I need to start imposing a bit more structure and discipline. Don’t get me wrong, I’m proud of the compiler’s code in many ways. It’s got lots of useful comments and a pretty good design. But when you’re using a dynamic language like Lua that gives you functions but not classes, you find yourself writing things like this:
if fa.is_nonterm(self._transitions[1][1])
— …
end
What is this table nested two deep? What does indexing into [1][1] mean? You just tend to get these tables where the knowledge about what the structure means is all implicit. Here’s the worst example I found (this one I’m *not* proud of). This expression occurs in the middle of a string concatenation:
gla_state.rtn_paths:to_array()[1].prediction[2].rtn.name
What are all of those intermediate data structures about? Well, if you search in the source file for “prediction” you find the answer:
obj.prediction = {predicted_edge, predicted_dest_state}
Ok, so this “prediction” thing is just a sort of ad-hoc tuple of (edge, dest_state), so the prediction[2] index above was pulling out the predicted dest state. But that wasn’t at all obvious from the call site.
Another thing that tends to happen with a language where adding properties is easy (like it is in Lua) is that you start attaching properties to objects in random places. Especially with the Gazelle compiler where there are all these graph data structures floating around that are related to each other in various ways, you’ll be writing some code and find that you just need a reference to this one other related object. Like you’ll have a handle to a state in a graph, and you want a reference to the graph itself. So you’ll find someplace else in the program where the state had a reference to that graph, and just slip in a quick little:
state.graph = graph
…and presto, your other code can now just read state.graph and you’ve got what you need. Everybody’s happy!
Except that having too much of this stuff accumulate can make your program hard to reason about. What are all the properties that these state objects ever have? Where are these properties added? When can you count on them being set? Is this the right list of properties for this object? Are the properties all orthogonal, or are some of them redundant? These questions are very difficult to answer when you have random code setting and reading properties in a dynamically-typed language like Lua.
For this reason I’ve got to respectfully disagree with Steve Yegge about the attractiveness of the properties pattern. The biggest bummer about it is that there’s no place that you write down the list of properties for each type and what they mean. Without that, it’s hard for other people to get the high-level picture about your program’s design, and it’s hard for you as the program’s author to think about the design of the program and how it should be refactored.
So I had an intuitive feeling that now was the time to introduce a little bit more structure. And my idea was to implement a very small subset of Ruby’s object model in Lua. Lua is a lovely language in that it gives you primitives that give you the tools you need to do a pretty convincing job of implementing whatever object-oriented model you like. And I’m a big fan of Ruby’s object model.
So I sat down on Friday night, thinking this would take me a few hours of hacking, after which I could move on to the work I really wanted to do (the list I mentioned above).
I’m a chronic underestimator, but this instance was pretty extreme. All in all, I spent about 20 hours on this problem this weekend. And while I finished what I set out to do, the performance is so disappointing that I’m not sure I’ll stick with it.
My primary goals were:
- encapsulation. I want to know that an object’s instance variables are only modified in methods of that object. as a program grows, you’ve got to have lines you can draw around chunks of code and say “I can reason about this code in isolation without worrying that its state can be mutated in unexpected ways.”
- inheritance. I was using the OO scheme that’s explained in the Lua book, but it’s really bizarre in some ways. for example, to derive from a class you create an instance of it, which becomes the subclass (!?), and then you add methods to it and use it to construct other instances.
- a well-defined set of properties for each object. I want this list somewhere so I can docment it and reason about it, and I want to prevent typos
With Ruby as my inspiration, I took a first go at it on Friday night until about 3AM. My first iteration supported properties even: if you said foo.bar = 5, that would turn into a foo.set_bar(5) call. I got it working, but it was super complex. I really wanted to keep the complexity down.
I went through about two more iterations before arriving at this:
object.lua (read the comments at the top of the file for more explanation and usage).
Rather happy with the interface at this point, I started porting a lot of the compiler’s classes to this new object system. This was an extremely long and painful process, and the final result was this diff:
It would have all been worth it except for the horrible moment when I ran my unit test suite and saw this:
$ time make test
[...]
real 0m2.361s
user 0m1.889s
sys 0m0.070s
Compare this with almost the same test suite directly before the port:
$ time make test
[...]
real 0m0.633s
user 0m0.500s
sys 0m0.043s
A 4x slowdown is just more than I can swallow for this.
So I’m very sad to say that after an intense weekend of hacking, I’m pretty much back where I started, except for the experience of having written a small subset of Ruby’s object model in Lua. It’s very nice to use, but it’s just too slow. And a time profile doesn’t have too much low-hanging fruit.
I still need a way to manage complexity as the compiler grows. I think my next iteration may be to just create a list of properties for each object and prevent assignment to any property that’s not on the list. I won’t get encapsulation with this scheme, but at least I’ll get a list of properties for each object, which will give me more a way to reason about the design of my objects. For example, just looking at the latest version of compiler/fa.lua is illuminating, because all of the objects have attr_accessor/attr_reader/attr_writer for all the properties that they are ever assigned. Before this change, I would have had no easy way of telling you this information.
Still, the weekend has been a disappointment. I don’t so much enjoy any of this object system hacking for its own sake, especially when I have so much truly useful and ground-breaking work just on the tip of my tongue! I want to find a solution that I’m satisfied with, so I can get it over with and get back to real work.
I have to say, sometimes I really think that what I want is a lightweight, embeddable strongly typed language. Sure, it’s nice when you’re writing little programs not to have to specify the structure of your program with lots of classes, type, and prototype declarations. But when your program grows to a certain size, the ability to impose structure can be extremely freeing, because explicit structure implies predictability. Predictability that when you get called, you’ll get called with the right number of arguments of the right types. Predictability that the contents of your objects are what is specified in your class definition, no more and no less. Predictability that when you’re implementing an interface, you implemented all the methods that were pure virtual in the base class (because if you hadn’t the compiler would have yelled at you).
I’m using Lua because it’s a language that has a very small and efficient implementation, and also has implementations for other runtimes like the JVM. If there was a statically-typed language with these same characteristics, I would be tempted to use that instead. It could probably be more efficient too, because statically-typed languages can do offset-based member lookup instead of having everything be hash table lookups.
But like I said, I’d much rather stop talking about all of this and get back to parsing. But I want to find a strategy for managing the growing complexity of Gazelle’s compiler.