Ever wanted to make your own programming language? Ever wondered how? Well, you've come to the right place!
In this tutorial, we'll be implementing our very own language, KITTEH. (For those unacquainted, "kitteh" is lolspeak for "kitten". I'm calling it this because the language will be small and portable.) Here's a taste of it:
VAR x // starts off at zero
x = (x + 1) + -(-9)
WHILE x DO
x = x - 1
PRINT x
END
Well ok, I lied about a taste of it. That's the entire language right there. I know, it's tiny, but you can see the makings of most modern programming languages in it. (And I guarantee it will be plenty enough to implement in a tutorial.)
How the things do the stuff
So just how does a language work? Well, it's sorta like an assembly line. Each step takes something before it and converts it into a more abstracted form. This starts with the raw input (the code) and ends with the output (the execution). But there's a lot of steps in between.
Important note: A language is actually just a computer program! That assembly line I talked about is just a computer program that you write. Yes, so that means when you write your coding language you need to write it in another coding language. "But how was the first one invented?" you ask. Well, it's a "turtles all the way down" situation ending with machine code, which gets directly run by the computer hardware.
Aside: compiled vs interpreted
This is probably something you've heard of before — "x is a compiled language" or "y is an interpreted language". But what does it mean?
Well, an interpreted language gets run on the spot. When you run that "assembly line" computer program, it takes in your source code and spits out the output (the print statements and the like). But a compiled language is actually just a translator. A compiled language translates your language into another lower-level language. For instance, the C compiler translates C into assembly code.
In this tutorial, we'll be making an interpreted language. That means our assembly line will look something like this:
Input the source code, presumably by reading a file
Turn the source code into a list of tokens, which are like all the meaningful pieces of a program. This is called lexing. For instance, in the program x = (x + 1) // a comment, our tokens would be x, =, (, x, +, 1, ). No, I didn't forget the comment — remember I said meaningful pieces of the program, and the meaning of a program isn't changed by a comment.
Turn the list of tokens into an abstract syntax tree, or AST. This is called parsing. For our example program (x = (x + 1)), the AST would be something like:
=
/ \
ID:x +
/ \
ID:x NUM:1
Notice how this reflects the meaning of the program — we're assigning x to (the result of adding x and 1). Also notice how the parentheses were lost — parentheses don't have any meaning by themselves, they just group things together differently.
Turn the AST into bytecode, which is like a really compact form for expressing a program. This can be called generation, but the terminology here is a bit more vague. The bytecode for the above AST would be something like:
LOAD x
PUSH 1
ADD
SET x
This bytecode is stack-based — there is an implicit stack which gets pushed and popped to. If you've heard of this that's great, if not don't worry since we'll talk more about what that means later.
Run the bytecode. This is the execution. Let's assume that x has the value 2 preloaded for example sake. For the above bytecode our "execution trace" would be:
This should give you a better idea of how the stack works. It's like a list of items, and some commands add to the list ("push") and others remove from the list ("pop").
So that's the "language pipeline". Without further ado, let's get started!
Further ado
Ok I lied, there's further ado. We need to talk about the language we'll be implementing KITTEH in — C. Don't worry if you're not good with C, as long as you have a basic understanding of it you'll be good to go. I'll try to walk through step-by-step exactly what each bit of code does and how it works. However, this tutorial does assume you have rudimentary C knowledge (what is a function, what is #include, how do types work, etc). It also assumes you have intermediate to advanced knowledge of programming in general.
So let's start setting things up. To begin, make yourself a new C repl. Let's remove the default program add in our starter framework:
Pardon my poor pointer etiquette — it's a habit I'm probably unable to change. The implementation of the read_file function is irrelevant to this tutorial; if you're unfamiliar with C all you need to know is that when given a filename it will read it and return a string (char* — pointer to the first character in the string) of its contents.
I'm just making the filename a constant for brevity of this tutorial, obviously in a real language you'd want to take command-line input to get the filename. Speaking of which, go ahead and make a file "source.kitteh". Leave it empty for now, we'll make it our "running test file" to ensure everything we add works.
Getting started
Okay, now we can get started. Let's tackle each step one at a time. To begin, we need to turn the source code into a list of tokens. But actually, there's a better way. What if we just read the tokens "on-the-fly", i.e. read each token only when we need it? Basically, instead of turning the tokens into a big list, we're doing it generator-style by only reading each token when requested. This has a few big benefits:
We don't have to deal with allocating a list of unknown size
We don't have to use tons of memory by allocating this big list
So all we need to write is some function, "scan()", which returns the next token each time it's called. Can't be too hard, right?
Let's make a file named lexer.h. In the file, let's set up our include guards (an include guard prevents a file from getting included multiple times, which causes name conflicts since it defines the same thing twice):
Here we declare that our function scan will take no arguments and produce a struct Token. Hey, what's Token? Well, Token will be our type to represent a token. Speaking of which, let's define that now (put this above the declaration of scan():
Here we're defining Token to be a struct with a type field to record what kind of token it is, a value field to represent the actual text content of the token, and a line field to show which line the token is on. (That's used for error messages, so it's important to have.)
But hey, our type enum is empty! Let's fix that — change the enum to be this:
You can probably guess which tokens each type corresponds to — in case not, here's the rundown:
T_VAR is for the keyword VAR
T_WHILE is for the keyword WHILE
T_DO is for the keyword DO
T_END is for the keyword END
T_PRINT is for the keyword PRINT
T_ID is for a name like x or twenty_three
T_NUM is for numbers like 2 or 123
T_LPAR is for the left parenthesis character (
T_RPAR is for the right parenthesis character )
T_PLUS is for the plus character +
T_MINUS is for the minus character -
T_EQ is for the equals character =
T_SEP is for the newline character. We'll force a newline after each statement (so VAR x VAR y is invalid), so this is important.
T_EOF is for the end of file. When there are no more tokens left to read we'll just repeatedly emit this token. Hopefully the parser will get the message :D
The list is pretty limited since the language is limited — but a more complex language will obviously have a lot more token variety.
Ok great, we now have our token struct! Let's start implementing the scan function. Create a new file, lexer.c, and initialize it with the following:
#include "lexer.h"
struct Token scan(){
// todo
}
We're including our setup (lexer.h header) file and then making the framework of our scan function. And boom, problem. How do we know where to be scanning from? How do we record our place in the file? We need to create a scanner struct to record this important data. Let's do that right now, beneath the #include statement:
struct {
char* position;
int line;
} scanner;
Here we define our current position in the file as a char*. For those unfamiliar with C, a * after a type means a "pointer". A pointer does just what you expect — it points to a position in memory. When we start lexing, we'll make that pointer point to the start of the file. As we scan tokens, we'll move the pointer farther and farther through the file until we get to the end of the file. Then we'll just emit T_EOF's.
Ok, let's make a helper function to initialize our file. In lexer.h, add the following after our declaration of scan:
void initializeScanner(char*);
In our main function, we'll call this function on the file we read to set up our scanner. But let's define it first, at the bottom of our lexer.c file (after our framework for scan()):
We start by handling the one character tokens. These are easy, since we just match the current character in our file (*scanner.position). For those unfamiliar, the * before a pointer "de-references" it, ie gets the value that it's pointing to. The (struct Token){ /* .fields = values */ } syntax is simply a shorthand which allows us to create a Token with the listed fields at the listed values.
I didn't forget a scanner.current ++ after the EOF token. I purposely did not include that in order to make it repeatedly output EOF tokens once it gets to the end of the file.
We merge duplicate newlines in the \n handler by just repeatedly consuming consecutive newlines.
We now handle the multi-character tokens. We first check if the current character is an alphabetic character or _. If it is, we're dealing with a name token, so we just keep advancing the current character pointer until we get to a char that isn't alphabetic or _. At that point, we allocate enough memory to create a string out of the token and copy the data from the file into the token's value. Here, we also check if the token is one of the special keywords, in which case we give it a different type.
If it wasn't, we check if it's a digit. The process here is much the same, just simpler. (We won't deal with floats for the purpose of brevity.)
If it wasn't a digit either, we throw an error since we know the user inputted something incorrectly.
Take a minute to digest that. It's not super complicated once you realize how it works, so it's important that you get a feel for how it's working.
Now, there's one problem with this system. It doesn't handle whitespace (outside of newlines)! We need to skip whitespace and comments... so let's do that before we try to scan. At the beginning of the scan function (before switch(*scanner.position){...), we can fix that:
struct Token scan(){ // this is just here for reference, don't copy it
for(;;) {
switch(*scanner.position) {
case ' ': case '\t': case '\r':
break;
case '/':
if(*(scanner.position + 1) == '/') {
while(*scanner.position != '\n') {
scanner.position ++;
}
break;
}
default:
goto done_checking_ws;
}
}
done_checking_ws:
switch(*scanner.position) { // once again just for reference
Now yes, I know I've just incurred death by raptor, but I think this is one of the few good uses of goto. The idea here is simple; loop forever, checking the current character:
If it's , \t, or \r, skip it and keep going.
If it's /and the next one is /, skip those until you reach a newline (skipping a single line comment).
If it's none of those, break out of the loop.
You'll notice that I don't have a break after the / case, only inside the if — that's because I want it to fall through to the default (break out of the loop) if the second character isn't /.
Testing the lexer
Now the moment of truth — testing. We need to make another function to print out a token, so let's do that first. In lexer.h, we add its declaration after initializeScanner:
void printToken(struct Token);
And now, at the bottom of lexer.c we implement it:
We're using the fact that enums are just sequential numbers given nice names, so we can use the token's type (an enum value) as an index into the list of the string names of each enum value. (Read that sentence again, might make more sense the second time around.) At that point, it's just good old printf.
Now in main.c, right above our // todo — everything comment let's add some tests:
when you run it! Mess around a bit, make sure it's working, and then we can proceed!
Parsing — Giving Meaning to Tokens
Remove the tests in main.c (the three lines of printToken(scan())), it's time to implement parsing. If you remember our assembly line, this is when we take tokens and turn them into an AST.
So now is probably a good time to implement an AST. Let's get started!
Time to set up a new file — ast.h. Make it, and we'll put in some include guards:
#ifndef ast_h
#define ast_h
#endif
Okay, time to make an AST structure. Remember that an AST is basically just a tree that represents the program's semantics. That means we need two kinds of tree nodes: statement nodes, for things like VAR x and WHILE x DO ... END, and expression nodes for things like x, 1, or x = 1.
Aside: Wait, why is x = 1 an expression?
It's often nice to have multiple assignments at once — i.e. x = y = 1. This should be parsed as x = (y = 1), which means that y = 1 needs to be an expression (which returns 1).
Additionally, sometimes you want to put assignments inside a conditional — ie, WHILE x = getchar() DO ... END. Obviously KITTEH doesn't have functions, but your language probably will.
Inside the include guards let's write our statement node definitions:
Let's go through this step-by-step. We're defining a StmtNode (statement node) to have two fields — an enum which keeps track of the kind of node it is, and a union which keeps track of the node's data. The reason we have a union is that each type of node needs to contain different data, so we have to have multiple kinds of structures for our multiple kinds of nodes.
The structures themselves are as follows:
The structure for a var declaration (VAR x) simply keeps track of the variable's name.
The structure for a while statement (WHILE expr DO stmts END) keeps track of the expression (expr) and the list of statements (stmts), and how many statements there are.
The expression simply has one star — it's a single pointer to an expression. But the statements has two stars; why? Well, we need a pointer to a pointer to a statement node. Basically, we're going to have a list of pointers to statement nodes. We'll keep a pointer to the first pointer in the list and how many pointers there are.
The structure for a print statement (PRINT 1, PRINT 1 + x - 25) simply keeps track of the expression on the right hand side.
The structure for an expression statement (1, x, or x = 1) is just an expression. An expression statement might seem like an oxymoron, but it's important to have since we made x = 1 an expression (as most languages do) and function calls would be expressions too (if KITTEH had them).
Now that we've made our statement node structure, we need to make our expression node structure. Above the statement node definitions (right after the #define ast_h), we'll write our expression definitions:
This is much the same as before — we're just defining our expression types this time. We have a number literal type (1234), a variable name type (foo), a binary operator type (foo - 1234), and a unary operator type (-foo). Note that we'll only implement +, - and = (assignment) operators for the sake of brevity.
Finally, we need to create a special helper struct, Program, which will keep track of our program. A program in KITTEH is just a list of statements, so Program just stores a list of statements and its length (refer to the while_stmt struct a while ago if you're unsure of what's going on with the double pointer). Let's add that now, under our ExprNode definition:
struct Program {
struct StmtNode** statements;
int stmtnum;
};
Let's get parsing already
Okay, it's time to actually do the aforementioned parsing. Time for — you guessed it — another file! (I know, it's monotonous, but separating your project into modular file "units" is really helpful.) Let's set up parser.h:
#ifndef parser_h
#define parser_h
#include "ast.h"
struct Program parse(void);
#endif
We're including the lexer since we'll need its scan function.
Now, there are a plethora of ways to go about parsing. One way is to use something called a Pratt parser, which is a really interesting method but it's beyond the scope of this tutorial. Parser combinators are another way, but they too are beyond this tutorial's reach. (If you're interested, feel free to research those a bit.) So, we'll just be implementing a simple recursive descent parser.
The basic idea of a recursive descent parser is that you have a function for each type of thing you want to parse. So we'll have a function for parsing a program (list of statements), a function for parsing a single statement, a function for parsing a binary operator, and a function for parsing a "simple expression" (number literal or variable name). Then, each of those functions recursively calls other functions — so our statement parser might call the binary operator parser, which will call the simple expression, etc.
You can implement precedence by having different function "levels" — for instance, you could have a function which only parses the binary operators + and -, and then that function calls a function which only parses * and /.
Now that was all pretty theoretical, so it might be confusing at first. Let's start implementing it, I think it will make better sense that way.
I said, let's get parsing already!
Okay, okay! Remember how we implemented our lexer as "lex-as-you-go"? Well, there is one downside to that — how do we "peek" ahead to check what the next token is? We can't call scan() since that actually consumes the token... so how do we get around this?
What we can do is store two tokens in a "parser" struct — the next token and the current one. That way, we can peek ahead by just looking at the next token rather than consuming it.
So after our #include's in parser.c, let's create this parsing struct:
The advance function will move the parser one step forward in our token stream. We can then get the current token with parser.current or peek ahead to the next one with next.
The consume function is a little more complex. Essentially, it checks if the next token is of the given type. If not, it throws an error in the form of "[line <the line number>] Error at <the token>: <the error message>". The switching logic is to make newlines or EOF tokens look nicer.
Armed with our new functions, we can finally get parsing! Isn't this thrilling? ;)
Let's start by implementing our parse function (replace the dummy definition):
The first thing we do is advance(). This is to "prime" our parser struct — it starts out empty, so by advancing we put something in the next token slot.
Then we set up our program struct. We're allocating a statements array with enough space to hold 1 StmtNode*, and we record this in initcap.
Recall that a program is just a list of statements. So, we parse a statement (we have yet to define parse_statement but we'll do that next) and put it in our statements array. We increment our count of how many statements we have; if this is greater than or equal to our array capacity, we grow the array. We keep going until we see an EOF, at which point we shrink the array down to the bare minimum size and return the program struct.
Okay, now let's implement the aforementioned parse_statement() function (put this right before the parse() function definition):
We're really getting into it now, so let's take this slowly.
First, we check if the next token is VAR. If it is, we know we're parsing a variable declaration. We skip past the VAR token and then expect an identifier (a name) and then a newline. Finally, we allocate a StmtNode and fill in its values (we're using *varstmt = { ... } because varstmt is a pointer, so we need to dereference it). We return our newly allocated node.
If it wasn't, we check if the next token is PRINT. The procedure here is much the same as before, except instead of accepting only an identifier we accept any kind of expression (since you should be able to print anything).
If it wasn't that as well, we check if the next token was WHILE. If so, we skip past the WHILE token and parse an expression as the while loop's condition. Then we expect a DO token to begin the statement block. The newline after the DO is optional — that way WHILE 1 DO PRINT 42 END is valid. Now we set up an array for our statements. You'll notice the code here is much the same as it was in the parse() function; that's because it's doing the same thing just with different variable names. So, we parse an array of statements up until the END token, and afterwards we expect a newline. Finally we fill in all the info into the AST node.
Finally, if it was none of those then we know it must be an expression statement. So, we simply parse an expression and plop it into an expression statement AST node.
That was a pretty big leap for parsing. Now all we need to do is parse expressions.
We can think of our expression parser as an onion with a bunch of layers. The innermost "core" is atomic expressions, like foo or 123 or (1 + 2 + 3). If you're wondering about the last one, that's considered an atomic expression because the parentheses make it one thing. All of them are one unit. The next layer is unary operators — we only have one, -, so that's things like -1 or -(2 + 3). Notice how the unary operator expression contains an atomic expression in it (the thing on the right), thus my onion analogy. Outside of that layer we have binary operators, like 1 + 2 or -foo + (1 + -2). The second example is important — notice how there's only one binary operator at the "top" level; -foo and (1 + -2) are both "lower-level" expressions. The final and outermost layer is the binary operator =, for x = 1 or x = 1 + -2. We put this operator by itself outside of the others because we want x = 1 + 2 to parse as x = (1 + 2) not (x = 1) + 2. (Think of it like this: 1 + 2 should have a higher level of "grouping" than x = 1 should.)
Aside: Other operators
If we had * and /, they would be in between the layer for +/- and unary operators. Why? Well, 1 + 2 * 3 should be parsed as 1 + (2 * 3), and that will only happen if * is a lower-level expression than +. In general, operators with higher precedence should be lower-level (in our onion, closer to the core). Operators with lower precedence should be higher-level (in our onion, closer to the outer layer). This fits with where we put =.
The nice part about this Matryoshka doll approach is that it's modeled very nicely by functions. We can have a function for our binary operators (our outer layer) which can then call our function for unary operators (one layer lower) which can call our function for atomic expressions (one layer even lower). So let's get implementing!
Our outermost layer is =, so we'll implement that first. Above our definition for parse_statement, we'll define parse_expression:
Here we parse a lower-level binop expression (a + or - one) and then check if the next token is an equals sign. If it is, we know we're parsing an assignment, so we set up an AST node and parse another expression as the right-hand-side. We're calling the function recursively since x = y = z is valid, and it should parse to x = (y = z).
Let's move on to + and -. Above parse_expression, we'll define parse_binop_expression:
Here we parse a unary op expression (which is one level lower down) as our left hand side and call it expr. If there is in fact a binary operator (+ or -), then we store which kind it is and skip past it. Then we parse the right-hand-side, and update expr to be the binary operator node we just parsed. We keep doing this for as long as there's +'s or -'s, so that way 1 + 1 + 1 + 1 will parse as ((1 + 1) + 1) + 1 (which is correct).
Notice how we're using a while loop here rather than recursion as with =. This is because = is right-associative (x = y = z is x = (y = z)), while +/- is left-associative (1 - 1 - 1 is (1 - 1) - 1). The while loop mechanism represents left association best while the recursion mechanism represents right association best.
Time to implement parsing unary operators. Above our parse_binop_expression definition let's define parse_unary_op_expression:
Here, we first set up a dummy node since the unary operator comes before the actual expression. We then build our node by repeatedly checking if the next token is - and working recursively outwards. (Since there is only one type of unary operator, -, we just automatically set the type to that. If you add more, you'll want to check which type it is just like we did for binary operators.)
After there are no more -'s left, we do some logic to replace our dummy NULL node with the actual expression that comes next. Then we return our built node.
There's just one parsing task left — parsing atomic expressions, the "core" we talked about before. Let's do that right above parse_unary_op_expression:
We check which kind of token we're encountering. If it's a variable name, we construct an AST node accordingly. The process for a number literal is much the same, but note that atol converts a string to a long int (which is what we defined our number type to be). Finally, we have the logic for parentheses. Notice they don't actually make an AST node — they exist solely for parsing (all they do is grouping).
If it was none of those options, we raise an error. I'm adding the exit(1) so that clang doesn't complain about control reaching the end of a non-void function — however, as our consume function will throw an error regardless since the type is -1 (outside the bounds of our enum) we know this isn't necessary.
Speaking of parentheses, notice we used parse_expression? That's a recursive dependency, so right above our definition of parse_atomic_expression we need to add a forward declaration:
struct ExprNode* parse_expression(void);
Hooray! We've just finished our parser. That was quite a bit, so take a second to relish your achievement.
Let's test it out now. Back in main.c, let's update our main function (replace the old one):
We're ignoring the return value (which is the generated AST) for now, we'll use it in a bit. First we need to include our parser file — at the top let's add that to our include list (other includes shown for reference):
Okay, let's fire it up! Go ahead and run the program — if all goes well, the program should compile fine and it should produce nothing. This is because source.kitteh is empty! Let's fix that, and add the following:
VAR x // comment
WHILE x DO
PRINT 1
x - 2
END
x = (1 + 2) + -3
This should test the entire grammar. Copy and paste it into source.kitteh and hopefully the parser will be silent (not throw any parsing errors)!
IMPORTANT: Our simple parser requires a blank line after the last statement (since every statement needs a newline to end it). This is something you'd want to fix in your own language, but for brevity and simplicity I've chosen not to implement that. Therefore, make sure to add (at least one) blank newline at the end of your source.kitteh file from now on! (If you forget to, our handy parser error system will remind you.)
Now let's try deliberately introducing an error. Change the program to the following:
VAR x // comment
WHILE x DO
PRINT 1 1
x - 2
END
x = (1 + 2) + -3
Notice the duplicate 1 after the print statement. Let's see what happens:
[line 4] Error at "1": Expect newline after statement!
Great, our error system works! It knows that the 1 shouldn't belong there, and it knows which line it's on. Change the program back to the valid version (remove the 1) and let's see what AST our parser is building.
To do this, we'll need a way of pretty-printing an AST structure. Where better to implement this than a new file, ast.c:
#include <stdio.h>
#include "ast.h"
void print_expr_node(struct ExprNode node) {
switch(node.type) {
case NUM_LITERAL:
printf("%ld", node.as.num_literal.num);
break;
case VAR_NAME:
printf("ID ( %s )", node.as.var_name.name);
break;
case BINARY_OP:
print_expr_node(*node.as.binary_op.left);
printf(" %c ", node.as.binary_op.binop_type);
print_expr_node(*node.as.binary_op.right);
break;
case UNARY_OP:
printf("%c", node.as.unary_op.unop_type);
print_expr_node(*node.as.unary_op.left);
break;
}
}
void print_stmt_node(struct StmtNode node) {
switch(node.type) {
case VAR_DECL:
printf("VAR DECLARATION ( name = %s )\n", node.as.var_decl.name);
break;
case WHILE_STMT:
printf("WHILE LOOP ( condition = ");
print_expr_node(*node.as.while_stmt.expression);
printf(" ) DO\n");
for(int i = 0; i < node.as.while_stmt.num; i ++) {
print_stmt_node(*node.as.while_stmt.statements[i]);
}
printf("END\n");
break;
case PRINT_STMT:
printf("PRINT STATEMENT ( expr = ");
print_expr_node(*node.as.print_stmt.expression);
printf(" )\n");
break;
case EXPR_STMT:
print_expr_node(*node.as.expr_stmt.expression);
putchar('\n');
break;
}
}
void print_program(struct Program program) {
for(int i = 0; i < program.stmtnum; i ++) {
print_stmt_node(*program.statements[i]);
}
}
There's nothing really magical going on here, we're just handling each type of node and each subtype. In ast.h, after our Program struct, let's make these functions public:
Okay, the time has come — run the program! If everything goes well, you should get this:
VAR DECLARATION ( name = x )
WHILE LOOP ( condition = ID ( x ) ) DO
PRINT STATEMENT ( expr = 1 )
ID ( x ) - 2
END
ID ( x ) = 1 + 2 + -3
Looking through that, it looks like it parsed the program correctly! The first statement is declaring a variable, x. The second statement is a while loop, with the condition being x, and the inner statements are a print statement (with 1 being the thing printed) and an expression statement (x - 2). Finally, after the loop we have an assignment, x being assigned to 1 + 2 + -3.
So pat yourself on the back, since you've just created a fully working parser!
That concludes part one of this tutorial. Stay tuned for part 2, where we'll be finishing our language by implementing bytecode generation and a way to run the bytecode!
All code can be found in the attached repl. If you have any questions, feel free to ask!
Ever wanted to make your own programming language? Ever wondered how? Well, you've come to the right place!
In this tutorial, we'll be implementing our very own language, KITTEH. (For those unacquainted, "kitteh" is lolspeak for "kitten". I'm calling it this because the language will be small and portable.) Here's a taste of it:
Well ok, I lied about a taste of it. That's the entire language right there. I know, it's tiny, but you can see the makings of most modern programming languages in it. (And I guarantee it will be plenty enough to implement in a tutorial.)
How the things do the stuff
So just how does a language work? Well, it's sorta like an assembly line. Each step takes something before it and converts it into a more abstracted form. This starts with the raw input (the code) and ends with the output (the execution). But there's a lot of steps in between.
Important note: A language is actually just a computer program! That assembly line I talked about is just a computer program that you write. Yes, so that means when you write your coding language you need to write it in another coding language. "But how was the first one invented?" you ask. Well, it's a "turtles all the way down" situation ending with machine code, which gets directly run by the computer hardware.
In this tutorial, we'll be making an interpreted language. That means our assembly line will look something like this:
x = (x + 1) // a comment
, our tokens would bex
,=
,(
,x
,+
,1
,)
. No, I didn't forget the comment — remember I said meaningful pieces of the program, and the meaning of a program isn't changed by a comment.x = (x + 1)
), the AST would be something like: Notice how this reflects the meaning of the program — we're assigning x to (the result of adding x and 1). Also notice how the parentheses were lost — parentheses don't have any meaning by themselves, they just group things together differently.Run the bytecode. This is the execution. Let's assume that
x
has the value 2 preloaded for example sake. For the above bytecode our "execution trace" would be:This should give you a better idea of how the stack works. It's like a list of items, and some commands add to the list ("push") and others remove from the list ("pop").
So that's the "language pipeline". Without further ado, let's get started!
Further ado
Ok I lied, there's further ado. We need to talk about the language we'll be implementing KITTEH in — C. Don't worry if you're not good with C, as long as you have a basic understanding of it you'll be good to go. I'll try to walk through step-by-step exactly what each bit of code does and how it works. However, this tutorial does assume you have rudimentary C knowledge (what is a function, what is
#include
, how do types work, etc). It also assumes you have intermediate to advanced knowledge of programming in general.So let's start setting things up. To begin, make yourself a new C repl. Let's remove the default program add in our starter framework:
Pardon my poor pointer etiquette — it's a habit I'm probably unable to change. The implementation of the
read_file
function is irrelevant to this tutorial; if you're unfamiliar with C all you need to know is that when given a filename it will read it and return a string (char*
— pointer to the first character in the string) of its contents.I'm just making the filename a constant for brevity of this tutorial, obviously in a real language you'd want to take command-line input to get the filename. Speaking of which, go ahead and make a file "source.kitteh". Leave it empty for now, we'll make it our "running test file" to ensure everything we add works.
Getting started
Okay, now we can get started. Let's tackle each step one at a time. To begin, we need to turn the source code into a list of tokens. But actually, there's a better way. What if we just read the tokens "on-the-fly", i.e. read each token only when we need it? Basically, instead of turning the tokens into a big list, we're doing it generator-style by only reading each token when requested. This has a few big benefits:
So all we need to write is some function, "scan()", which returns the next token each time it's called. Can't be too hard, right?
Let's make a file named
lexer.h
. In the file, let's set up our include guards (an include guard prevents a file from getting included multiple times, which causes name conflicts since it defines the same thing twice):Here we declare that our function
scan
will take no arguments and produce astruct Token
. Hey, what'sToken
? Well,Token
will be our type to represent a token. Speaking of which, let's define that now (put this above the declaration ofscan()
:Here we're defining Token to be a struct with a type field to record what kind of token it is, a value field to represent the actual text content of the token, and a line field to show which line the token is on. (That's used for error messages, so it's important to have.)
But hey, our
type
enum is empty! Let's fix that — change the enum to be this:You can probably guess which tokens each type corresponds to — in case not, here's the rundown:
T_VAR
is for the keywordVAR
T_WHILE
is for the keywordWHILE
T_DO
is for the keywordDO
T_END
is for the keywordEND
T_PRINT
is for the keywordPRINT
T_ID
is for a name likex
ortwenty_three
T_NUM
is for numbers like2
or123
T_LPAR
is for the left parenthesis character(
T_RPAR
is for the right parenthesis character)
T_PLUS
is for the plus character+
T_MINUS
is for the minus character-
T_EQ
is for the equals character=
T_SEP
is for the newline character. We'll force a newline after each statement (soVAR x VAR y
is invalid), so this is important.T_EOF
is for the end of file. When there are no more tokens left to read we'll just repeatedly emit this token. Hopefully the parser will get the message :DThe list is pretty limited since the language is limited — but a more complex language will obviously have a lot more token variety.
Ok great, we now have our token struct! Let's start implementing the scan function. Create a new file,
lexer.c
, and initialize it with the following:We're including our setup (
lexer.h
header) file and then making the framework of our scan function. And boom, problem. How do we know where to be scanning from? How do we record our place in the file? We need to create ascanner
struct to record this important data. Let's do that right now, beneath the#include
statement:Here we define our current position in the file as a
char*
. For those unfamiliar with C, a*
after a type means a "pointer". A pointer does just what you expect — it points to a position in memory. When we start lexing, we'll make that pointer point to the start of the file. As we scan tokens, we'll move the pointer farther and farther through the file until we get to the end of the file. Then we'll just emitT_EOF
's.Ok, let's make a helper function to initialize our file. In
lexer.h
, add the following after our declaration ofscan
:In our main function, we'll call this function on the file we read to set up our scanner. But let's define it first, at the bottom of our
lexer.c
file (after our framework forscan()
):Well that was easy! Now, back in
main.c
, we can update our main function (replace the old one):Oh, but we need to include our lexer now! On the line after
#include <stdio.h>
, let's do that:Getting to the fun
Let's get scanning. First we need to set up our includes. At the start of
lexer.c
(before the#include "lexer.h"
line) let's add our imports:Now we can remove our
// todo
comment inlexer.c
and really get started (replace the oldscan
function with the following):Whoa, that's a lot! Let's break it down a bit.
*scanner.position
). For those unfamiliar, the*
before a pointer "de-references" it, ie gets the value that it's pointing to. The(struct Token){ /* .fields = values */ }
syntax is simply a shorthand which allows us to create aToken
with the listed fields at the listed values.scanner.current ++
after the EOF token. I purposely did not include that in order to make it repeatedly output EOF tokens once it gets to the end of the file.\n
handler by just repeatedly consuming consecutive newlines._
. If it is, we're dealing with a name token, so we just keep advancing the current character pointer until we get to a char that isn't alphabetic or_
. At that point, we allocate enough memory to create a string out of the token and copy the data from the file into the token'svalue
. Here, we also check if the token is one of the special keywords, in which case we give it a different type.Take a minute to digest that. It's not super complicated once you realize how it works, so it's important that you get a feel for how it's working.
Now, there's one problem with this system. It doesn't handle whitespace (outside of newlines)! We need to skip whitespace and comments... so let's do that before we try to scan. At the beginning of the scan function (before
switch(*scanner.position){
...), we can fix that:Now yes, I know I've just incurred death by raptor, but I think this is one of the few good uses of
goto
.The idea here is simple; loop forever, checking the current character:
,
\t
, or\r
, skip it and keep going./
and the next one is/
, skip those until you reach a newline (skipping a single line comment).You'll notice that I don't have a break after the
/
case, only inside the if — that's because I want it to fall through to the default (break out of the loop) if the second character isn't/
.Testing the lexer
Now the moment of truth — testing. We need to make another function to print out a token, so let's do that first. In
lexer.h
, we add its declaration afterinitializeScanner
:And now, at the bottom of
lexer.c
we implement it:We're using the fact that enums are just sequential numbers given nice names, so we can use the token's type (an enum value) as an index into the list of the string names of each enum value. (Read that sentence again, might make more sense the second time around.) At that point, it's just good old printf.
Now in
main.c
, right above our// todo — everything
comment let's add some tests:Okay, almost time — in
source.kitteh
, let's add three tokens!Make sure to include a newline afterwards! (Markdown doesn't render it here.)
If you got what I got, it should print out
when you run it! Mess around a bit, make sure it's working, and then we can proceed!
Parsing — Giving Meaning to Tokens
Remove the tests in
main.c
(the three lines ofprintToken(scan())
), it's time to implement parsing. If you remember our assembly line, this is when we take tokens and turn them into an AST.So now is probably a good time to implement an AST. Let's get started!
Time to set up a new file —
ast.h
. Make it, and we'll put in some include guards:Okay, time to make an AST structure. Remember that an AST is basically just a tree that represents the program's semantics. That means we need two kinds of tree nodes: statement nodes, for things like
VAR x
andWHILE x DO ... END
, and expression nodes for things likex
,1
, orx = 1
.Inside the include guards let's write our statement node definitions:
Let's go through this step-by-step. We're defining a
StmtNode
(statement node) to have two fields — an enum which keeps track of the kind of node it is, and a union which keeps track of the node's data. The reason we have a union is that each type of node needs to contain different data, so we have to have multiple kinds of structures for our multiple kinds of nodes.The structures themselves are as follows:
VAR x
) simply keeps track of the variable's name.WHILE expr DO stmts END
) keeps track of the expression (expr
) and the list of statements (stmts
), and how many statements there are.PRINT 1
,PRINT 1 + x - 25
) simply keeps track of the expression on the right hand side.1
,x
, orx = 1
) is just an expression. An expression statement might seem like an oxymoron, but it's important to have since we madex = 1
an expression (as most languages do) and function calls would be expressions too (if KITTEH had them).Now that we've made our statement node structure, we need to make our expression node structure. Above the statement node definitions (right after the
#define ast_h
), we'll write our expression definitions:This is much the same as before — we're just defining our expression types this time. We have a number literal type (
1234
), a variable name type (foo
), a binary operator type (foo - 1234
), and a unary operator type (-foo
). Note that we'll only implement+
,-
and=
(assignment) operators for the sake of brevity.Finally, we need to create a special helper struct,
Program
, which will keep track of our program. A program in KITTEH is just a list of statements, soProgram
just stores a list of statements and its length (refer to thewhile_stmt
struct a while ago if you're unsure of what's going on with the double pointer). Let's add that now, under ourExprNode
definition:Let's get parsing already
Okay, it's time to actually do the aforementioned parsing. Time for — you guessed it — another file! (I know, it's monotonous, but separating your project into modular file "units" is really helpful.) Let's set up
parser.h
:Time to make
parser.c
— let's set it up as well:We're including the lexer since we'll need its
scan
function.Now, there are a plethora of ways to go about parsing. One way is to use something called a Pratt parser, which is a really interesting method but it's beyond the scope of this tutorial. Parser combinators are another way, but they too are beyond this tutorial's reach. (If you're interested, feel free to research those a bit.) So, we'll just be implementing a simple recursive descent parser.
The basic idea of a recursive descent parser is that you have a function for each type of thing you want to parse. So we'll have a function for parsing a program (list of statements), a function for parsing a single statement, a function for parsing a binary operator, and a function for parsing a "simple expression" (number literal or variable name). Then, each of those functions recursively calls other functions — so our statement parser might call the binary operator parser, which will call the simple expression, etc.
You can implement precedence by having different function "levels" — for instance, you could have a function which only parses the binary operators
+
and-
, and then that function calls a function which only parses*
and/
.Now that was all pretty theoretical, so it might be confusing at first. Let's start implementing it, I think it will make better sense that way.
I said, let's get parsing already!
Okay, okay! Remember how we implemented our lexer as "lex-as-you-go"? Well, there is one downside to that — how do we "peek" ahead to check what the next token is? We can't call
scan()
since that actually consumes the token... so how do we get around this?What we can do is store two tokens in a "parser" struct — the next token and the current one. That way, we can peek ahead by just looking at the next token rather than consuming it.
So after our
#include
's in parser.c, let's create this parsing struct:After our struct, let's define some helper functions:
The
advance
function will move the parser one step forward in our token stream. We can then get the current token withparser.current
or peek ahead to the next one withnext
.The
consume
function is a little more complex. Essentially, it checks if the next token is of the given type. If not, it throws an error in the form of "[line <the line number>] Error at <the token>: <the error message>". The switching logic is to make newlines or EOF tokens look nicer.Armed with our new functions, we can finally get parsing! Isn't this thrilling? ;)
Let's start by implementing our
parse
function (replace the dummy definition):The first thing we do is
advance()
. This is to "prime" ourparser
struct — it starts out empty, so by advancing we put something in thenext
token slot.Then we set up our program struct. We're allocating a
statements
array with enough space to hold 1StmtNode*
, and we record this ininitcap
.Recall that a program is just a list of statements. So, we parse a statement (we have yet to define
parse_statement
but we'll do that next) and put it in ourstatements
array. We increment our count of how many statements we have; if this is greater than or equal to our array capacity, we grow the array. We keep going until we see an EOF, at which point we shrink the array down to the bare minimum size and return the program struct.Okay, now let's implement the aforementioned
parse_statement()
function (put this right before theparse()
function definition):We're really getting into it now, so let's take this slowly.
StmtNode
and fill in its values (we're using*varstmt = { ... }
becausevarstmt
is a pointer, so we need to dereference it). We return our newly allocated node.WHILE 1 DO PRINT 42 END
is valid. Now we set up an array for our statements. You'll notice the code here is much the same as it was in theparse()
function; that's because it's doing the same thing just with different variable names. So, we parse an array of statements up until the END token, and afterwards we expect a newline. Finally we fill in all the info into the AST node.That was a pretty big leap for parsing. Now all we need to do is parse expressions.
We can think of our expression parser as an onion with a bunch of layers. The innermost "core" is atomic expressions, like
foo
or123
or(1 + 2 + 3)
. If you're wondering about the last one, that's considered an atomic expression because the parentheses make it one thing. All of them are one unit. The next layer is unary operators — we only have one,-
, so that's things like-1
or-(2 + 3)
. Notice how the unary operator expression contains an atomic expression in it (the thing on the right), thus my onion analogy. Outside of that layer we have binary operators, like1 + 2
or-foo + (1 + -2)
. The second example is important — notice how there's only one binary operator at the "top" level;-foo
and(1 + -2)
are both "lower-level" expressions. The final and outermost layer is the binary operator=
, forx = 1
orx = 1 + -2
. We put this operator by itself outside of the others because we wantx = 1 + 2
to parse asx = (1 + 2)
not(x = 1) + 2
. (Think of it like this:1 + 2
should have a higher level of "grouping" thanx = 1
should.)The nice part about this Matryoshka doll approach is that it's modeled very nicely by functions. We can have a function for our binary operators (our outer layer) which can then call our function for unary operators (one layer lower) which can call our function for atomic expressions (one layer even lower). So let's get implementing!
Our outermost layer is
=
, so we'll implement that first. Above our definition forparse_statement
, we'll defineparse_expression
:Here we parse a lower-level binop expression (a + or - one) and then check if the next token is an equals sign. If it is, we know we're parsing an assignment, so we set up an AST node and parse another expression as the right-hand-side. We're calling the function recursively since
x = y = z
is valid, and it should parse tox = (y = z)
.Let's move on to + and -. Above
parse_expression
, we'll defineparse_binop_expression
:Here we parse a unary op expression (which is one level lower down) as our left hand side and call it
expr
. If there is in fact a binary operator (+ or -), then we store which kind it is and skip past it. Then we parse the right-hand-side, and updateexpr
to be the binary operator node we just parsed. We keep doing this for as long as there's +'s or -'s, so that way1 + 1 + 1 + 1
will parse as((1 + 1) + 1) + 1
(which is correct).Notice how we're using a while loop here rather than recursion as with
=
. This is because=
is right-associative (x = y = z
isx = (y = z)
), while+
/-
is left-associative (1 - 1 - 1
is(1 - 1) - 1
). The while loop mechanism represents left association best while the recursion mechanism represents right association best.Time to implement parsing unary operators. Above our
parse_binop_expression
definition let's defineparse_unary_op_expression
:Here, we first set up a dummy node since the unary operator comes before the actual expression. We then build our node by repeatedly checking if the next token is
-
and working recursively outwards. (Since there is only one type of unary operator,-
, we just automatically set the type to that. If you add more, you'll want to check which type it is just like we did for binary operators.)After there are no more
-
's left, we do some logic to replace our dummy NULL node with the actual expression that comes next. Then we return our built node.There's just one parsing task left — parsing atomic expressions, the "core" we talked about before. Let's do that right above
parse_unary_op_expression
:We check which kind of token we're encountering. If it's a variable name, we construct an AST node accordingly. The process for a number literal is much the same, but note that
atol
converts a string to along int
(which is what we defined our number type to be). Finally, we have the logic for parentheses. Notice they don't actually make an AST node — they exist solely for parsing (all they do is grouping).If it was none of those options, we raise an error. I'm adding the
exit(1)
so that clang doesn't complain about control reaching the end of a non-void function — however, as ourconsume
function will throw an error regardless since the type is -1 (outside the bounds of our enum) we know this isn't necessary.Speaking of parentheses, notice we used
parse_expression
? That's a recursive dependency, so right above our definition ofparse_atomic_expression
we need to add a forward declaration:Hooray! We've just finished our parser. That was quite a bit, so take a second to relish your achievement.
Let's test it out now. Back in
main.c
, let's update our main function (replace the old one):We're ignoring the return value (which is the generated AST) for now, we'll use it in a bit. First we need to include our parser file — at the top let's add that to our include list (other includes shown for reference):
Okay, let's fire it up! Go ahead and run the program — if all goes well, the program should compile fine and it should produce nothing. This is because
source.kitteh
is empty! Let's fix that, and add the following:This should test the entire grammar. Copy and paste it into
source.kitteh
and hopefully the parser will be silent (not throw any parsing errors)!IMPORTANT: Our simple parser requires a blank line after the last statement (since every statement needs a newline to end it). This is something you'd want to fix in your own language, but for brevity and simplicity I've chosen not to implement that. Therefore, make sure to add (at least one) blank newline at the end of your
source.kitteh
file from now on! (If you forget to, our handy parser error system will remind you.)Now let's try deliberately introducing an error. Change the program to the following:
Notice the duplicate
1
after the print statement. Let's see what happens:Great, our error system works! It knows that the 1 shouldn't belong there, and it knows which line it's on. Change the program back to the valid version (remove the 1) and let's see what AST our parser is building.
To do this, we'll need a way of pretty-printing an AST structure. Where better to implement this than a new file,
ast.c
:There's nothing really magical going on here, we're just handling each type of node and each subtype. In
ast.h
, after ourProgram
struct, let's make these functions public:Now, let's change our
main()
function inmain.c
to print out the program. Replace theparse();
line with the following:And at the top of the file, let's update our includes (old ones shown for reference):
Okay, the time has come — run the program! If everything goes well, you should get this:
Looking through that, it looks like it parsed the program correctly! The first statement is declaring a variable,
x
. The second statement is a while loop, with the condition beingx
, and the inner statements are a print statement (with1
being the thing printed) and an expression statement (x - 2
). Finally, after the loop we have an assignment,x
being assigned to1 + 2 + -3
.So pat yourself on the back, since you've just created a fully working parser!
That concludes part one of this tutorial. Stay tuned for part 2, where we'll be finishing our language by implementing bytecode generation and a way to run the bytecode!
All code can be found in the attached repl. If you have any questions, feel free to ask!
@ellipie this is my entry for the jam
aww thx :))
@programmeruser