Writing a Simple Programming Language from Scratch - Part 2
Source code found here, for much easier running and compiling.
Up to this point, we made part of a compiler front end, namely the lexer and parser. Now, we’ll adapt the parser and add some steps to make.. more code! Also known as an intermediate representation of the source code. More on that later.
To do this, we’ll use a visitor pattern.
The visitor software pattern is very useful, especially in compiler design. Visitors can essentially implement new methods on each node while traversing a tree. Or a not tree, but we’ll use a tree.
Let’s make a basic visitor!
Then here’s the output when this program is called:
What’s happening here is pretty simple. The node “accepts” the visitor, which prompts the visitor to visit that particular node.
accept will call
IntegerNode’s accept because of the
virtual method. Then, the visitor can do whatever it wants with that object, knowing its type exactly. If we had a
PlusNode that accepted a visitor, then we would call that accept method, thus we would visit the
I won’t bore by going through all of the nodes, it’s mostly just busy work to set it all up. There are ways to write those classes without all of the busy work (tools, metaprogramming). Useful, but writing a simple set of nodes isn’t too much work. Sometimes people go through a lot of effort to automate a process that would take less time to just do manually.
So now we can add any number of nodes that we want. I’ll show how we could do some recursion in this using the plus node, then hopefully extrapolating out will be a bit more simple.
This will recursively use your new
PlusNode and output the added numbers! Here’s the output:
The only nodes we’ll add that aren’t obvious are
StatementNode. A “program” will be a collection of statements for our simple language, and a “statement” is a bunch of math followed by a semicolon. Really a statement can be seen as a command, like
So how will we use this? The program can actually be represented easily as a tree. This is often represented as an abstract syntax tree (AST). We will then visit each node in the AST, and the visitor will generate some code. This generated code is an intermediate representation of the code - we don’t write this code by hand, but the computer also doesn’t use it. It’s mainly used by the language to facilitate optimizations and compiling for different computers with different architectures.
We can also have other kinds of visitors! One class
Visitor could be inherited by your type checker, or what makes your symbol table, then each of these visitors can be run. This example will only use one visitor - the one that generates code.
Lexer and Parser Updates
Now let’s update our parser from part 1 to return nodes instead of doing the calculations itself. We’ll also rename the file the
parser.ypp to generate C++. Here is what we’ll use for our C declarations:
There are some basic changes here, but two main ones for understanding how this works: first, we import all of the nodes we’ll use inside the parser. Second, we declare a variable
*program to point to the initial node. We’ll build everything off of this, then use it in a different file.
Here are the Bison declarations that we’ll use. We changed the union to include nodes that we want to be exchanged between grammar rules, and put the return values in their type definitions.
This is similar to what we did in part 1 - just including the nodes!
Finally, here are the grammar rules:
Each rule returns a new node that we can access in another program as an object - after we finish parsing, we’ll only touch these nodes, we don’t care about the source code. So, when we parse a division, we’ll create a
DivNode to represent that division, then don’t touch that source code again.
Now we also add a
yylineno variable to keep track of line numbers. Most programming languages have some way of managing where in the source file certain nodes came from in order to emit diagnostics, telling the programmer what they did wrong. We’re not going to do much with that, but if you ever want good diagnostics, consider including some source information.
program has a bit of interesting logic. If we get to the first, empty case for the
program nonterminal, we make a new
ProgramNode to use. If we get to the point where we have a statement, we’ll also have a program, so we add the statement (where a
program is just a
std::vector of statements in implementation). We also assert if we don’t have a
program - some kind of
null check is SO important in C++ code, preferably more than an
assert since it doesn’t apply in production! Never underestimate the power of humans to segfault a C/C++ program.
Our lexer didn’t change much, just adding a couple options, but I’ll put it here anyway:
Now that we have that, let’s make a file that runs this! This was moved from the bottom of the parser from part 1.
All this does is it calls some method
yyparse() which we get as an extern function from the Bison files we generated. That populates our
program variable. From this, we can visit the program we parsed!
Let’s run it on a program with one line -
3 + 3.2;
This is good! We visited the program, then the statement, then the
3+3.2 in our code. This proves we have a parser which creates the AST, then a visitor which goes through and visits the nodes.
Now, we have an accurate AST for our program. We want to use this AST in order to write code that our computer can run. Normally, we’ll do this with a true intermediate representation (like LLVM). We won’t be doing that quite yet. Instead, we’ll go with something easier to understand, that way the concept of code generation and visitors aren’t blurred by the complexity of another foreign language. So, instead, we’ll just turn our code into C++. It pains me to do this, but it’s a good tool to use since we know what proper C++ code looks like.
Another consequence of writing C++ code is we’re going to make it human readable. This won’t be the case for most languages - once it gets to this point, the machine tends to be one of the only things actually reading the code, so it won’t care much for readability. We’re humans and C++ is a human programming language, so it’s relatively easy to understand! LLVM is somewhat readable, but it’s not super easy.
How will we do this? We’ll make a new visitor that generates
.cpp files! Here’s the implementation file for that (let’s ignore the header files):
Basically, if you understood what was going on before with us printing out each statement, you’ll understand this. Instead of using
std::cout directly, we make a C++ program that calls them. Each statement (which, in our language, is just some math followed by a semicolon) will go into an
std::cout call in another file - very, very useful! Okay maybe not but you can see the potential.
What’s amazing about this isn’t that we wrote a program that (kind of) writes C++ programs, it’s that you can write your visitor to generate many different languages - it’s not too much harder to just make it generate assembly! It’s really easy to compile/run C++ code to see the fruits of your labor, so that’s what we use here.
Let’s run on this file:
One last run!
From that, we get the following file created:
And if we run that…
Exactly as we’d expect. Good job, folks!
So we created C++ code. Who cares? Why didn’t we just write the C++ program? Well, the intermediate representation won’t really be so high level - it’d be more like assembly in the end. We don’t want to constantly write assembly for obvious reasons, so we write programs that transform our code into that - the whole purpose of a compiler. While here we made the intermediate representation a C++ file, that will never happen in a real language, but it’s not too far off from it. That’s why there’s a bit more to the puzzle we have yet to cover.
With that being said, that’s everything! You now should at least have a basis for doing some operations on your source code, eventually creating an intermediate representation that can be used. Congrats!