Learn to Code via Tutorials on Repl.it!

← Back to all posts
#TutorialJam — Build your own programming language in ⓒ! PART 1 of 2: Lexing and Parsing
fuzzyastrocat (1297)

Ever wanted to make your own programming language? Ever wondered how? Well, you've come to the right place!

In this tutorial, we'll be implementing our very own language, KITTEH. (For those unacquainted, "kitteh" is lolspeak for "kitten". I'm calling it this because the language will be small and portable.) Here's a taste of it:

VAR x // starts off at zero
x = (x + 1) + -(-9)

WHILE x DO
  x = x - 1
  PRINT x
END

Well ok, I lied about a taste of it. That's the entire language right there. I know, it's tiny, but you can see the makings of most modern programming languages in it. (And I guarantee it will be plenty enough to implement in a tutorial.)

How the things do the stuff

So just how does a language work? Well, it's sorta like an assembly line. Each step takes something before it and converts it into a more abstracted form. This starts with the raw input (the code) and ends with the output (the execution). But there's a lot of steps in between.

Important note: A language is actually just a computer program! That assembly line I talked about is just a computer program that you write. Yes, so that means when you write your coding language you need to write it in another coding language. "But how was the first one invented?" you ask. Well, it's a "turtles all the way down" situation ending with machine code, which gets directly run by the computer hardware.

Aside: compiled vs interpreted

This is probably something you've heard of before — "x is a compiled language" or "y is an interpreted language". But what does it mean?

Well, an interpreted language gets run on the spot. When you run that "assembly line" computer program, it takes in your source code and spits out the output (the print statements and the like). But a compiled language is actually just a translator. A compiled language translates your language into another lower-level language. For instance, the C compiler translates C into assembly code.

In this tutorial, we'll be making an interpreted language. That means our assembly line will look something like this:

  • Input the source code, presumably by reading a file
  • Turn the source code into a list of tokens, which are like all the meaningful pieces of a program. This is called lexing. For instance, in the program x = (x + 1) // a comment, our tokens would be x, =, (, x, +, 1, ). No, I didn't forget the comment — remember I said meaningful pieces of the program, and the meaning of a program isn't changed by a comment.
  • Turn the list of tokens into an abstract syntax tree, or AST. This is called parsing. For our example program (x = (x + 1)), the AST would be something like:
         =
        / \
    ID:x   +
          / \
      ID:x   NUM:1
    Notice how this reflects the meaning of the program — we're assigning x to (the result of adding x and 1). Also notice how the parentheses were lost — parentheses don't have any meaning by themselves, they just group things together differently.
  • Turn the AST into bytecode, which is like a really compact form for expressing a program. This can be called generation, but the terminology here is a bit more vague. The bytecode for the above AST would be something like:
    LOAD x
    PUSH 1
    ADD
    SET x
    This bytecode is stack-based — there is an implicit stack which gets pushed and popped to. If you've heard of this that's great, if not don't worry since we'll talk more about what that means later.
  • Run the bytecode. This is the execution. Let's assume that x has the value 2 preloaded for example sake. For the above bytecode our "execution trace" would be:

    STACK: []
    VARS: {x: 2} 
    
    AFTER RUNNING: "LOAD x"
    STACK: [2]
    VARS: {x: 2}
    
    AFTER RUNNING: "PUSH 1"
    STACK: [2, 1]
    VARS: {x: 2}
    
    AFTER RUNNING: "ADD"
    STACK: [3]
    VARS: {x: 2}
    
    AFTER RUNNING: "SET x"
    STACK: []
    VARS: {x: 3}

    This should give you a better idea of how the stack works. It's like a list of items, and some commands add to the list ("push") and others remove from the list ("pop").

So that's the "language pipeline". Without further ado, let's get started!

Further ado

Ok I lied, there's further ado. We need to talk about the language we'll be implementing KITTEH in — C. Don't worry if you're not good with C, as long as you have a basic understanding of it you'll be good to go. I'll try to walk through step-by-step exactly what each bit of code does and how it works. However, this tutorial does assume you have rudimentary C knowledge (what is a function, what is #include, how do types work, etc). It also assumes you have intermediate to advanced knowledge of programming in general.

So let's start setting things up. To begin, make yourself a new C repl. Let's remove the default program add in our starter framework:

#include <stdlib.h>
#include <stdio.h>

char* read_file(char* fname) {
  FILE* f = fopen(fname, "r");
  char* buffer = NULL;
  int string_size, read_size;

  if(f) {
    fseek(f, 0, SEEK_END);
    string_size = ftell(f);
    fseek(f, 0, SEEK_SET);

    buffer = malloc(sizeof(char) * (string_size + 1));
    buffer[string_size] = '\0';

    read_size = fread(buffer, sizeof(char), string_size, f);

    if(string_size != read_size) {
      free(buffer);
      printf("Unable to allocate file memory!\n");
      exit(1);
    }
    
    fclose(f);

    return buffer;
  } else {
    printf("Unable to read file!\n");
    exit(1);
  }
}

int main() {
 char* filename = "source.kitteh";
  char* contents = read_file(filename);

  // todo — everything

  return 0;
} 

Pardon my poor pointer etiquette — it's a habit I'm probably unable to change. The implementation of the read_file function is irrelevant to this tutorial; if you're unfamiliar with C all you need to know is that when given a filename it will read it and return a string (char* — pointer to the first character in the string) of its contents.

I'm just making the filename a constant for brevity of this tutorial, obviously in a real language you'd want to take command-line input to get the filename. Speaking of which, go ahead and make a file "source.kitteh". Leave it empty for now, we'll make it our "running test file" to ensure everything we add works.

Getting started

Okay, now we can get started. Let's tackle each step one at a time. To begin, we need to turn the source code into a list of tokens. But actually, there's a better way. What if we just read the tokens "on-the-fly", i.e. read each token only when we need it? Basically, instead of turning the tokens into a big list, we're doing it generator-style by only reading each token when requested. This has a few big benefits:

  • We don't have to deal with allocating a list of unknown size
  • We don't have to use tons of memory by allocating this big list

So all we need to write is some function, "scan()", which returns the next token each time it's called. Can't be too hard, right?

Let's make a file named lexer.h. In the file, let's set up our include guards (an include guard prevents a file from getting included multiple times, which causes name conflicts since it defines the same thing twice):

#ifndef lexer_h
#define lexer_h

struct Token scan(void);

#endif

Here we declare that our function scan will take no arguments and produce a struct Token. Hey, what's Token? Well, Token will be our type to represent a token. Speaking of which, let's define that now (put this above the declaration of scan():

struct Token {
  enum TokenType {

  } type;

  char* value;

  int line;
};

Here we're defining Token to be a struct with a type field to record what kind of token it is, a value field to represent the actual text content of the token, and a line field to show which line the token is on. (That's used for error messages, so it's important to have.)

But hey, our type enum is empty! Let's fix that — change the enum to be this:

enum TokenType {
  T_VAR,
  T_WHILE,
  T_DO,
  T_END,
  T_PRINT,
  T_ID,
  T_NUM,
  T_LPAR,
  T_RPAR,
  T_PLUS,
  T_MINUS,
  T_EQ,
  T_SEP,
  T_EOF
} type;

You can probably guess which tokens each type corresponds to — in case not, here's the rundown:

  • T_VAR is for the keyword VAR
  • T_WHILE is for the keyword WHILE
  • T_DO is for the keyword DO
  • T_END is for the keyword END
  • T_PRINT is for the keyword PRINT
  • T_ID is for a name like x or twenty_three
  • T_NUM is for numbers like 2 or 123
  • T_LPAR is for the left parenthesis character (
  • T_RPAR is for the right parenthesis character )
  • T_PLUS is for the plus character +
  • T_MINUS is for the minus character -
  • T_EQ is for the equals character =
  • T_SEP is for the newline character. We'll force a newline after each statement (so VAR x VAR y is invalid), so this is important.
  • T_EOF is for the end of file. When there are no more tokens left to read we'll just repeatedly emit this token. Hopefully the parser will get the message :D

The list is pretty limited since the language is limited — but a more complex language will obviously have a lot more token variety.

Ok great, we now have our token struct! Let's start implementing the scan function. Create a new file, lexer.c, and initialize it with the following:

#include "lexer.h"

struct Token scan(){ 
  // todo
}

We're including our setup (lexer.h header) file and then making the framework of our scan function. And boom, problem. How do we know where to be scanning from? How do we record our place in the file? We need to create a scanner struct to record this important data. Let's do that right now, beneath the #include statement:

struct {
  char* position;
  int line;
} scanner;

Here we define our current position in the file as a char*. For those unfamiliar with C, a * after a type means a "pointer". A pointer does just what you expect — it points to a position in memory. When we start lexing, we'll make that pointer point to the start of the file. As we scan tokens, we'll move the pointer farther and farther through the file until we get to the end of the file. Then we'll just emit T_EOF's.

Ok, let's make a helper function to initialize our file. In lexer.h, add the following after our declaration of scan:

void initializeScanner(char*);

In our main function, we'll call this function on the file we read to set up our scanner. But let's define it first, at the bottom of our lexer.c file (after our framework for scan()):

void initializeScanner(char* file) {
  scanner.position = file;
  scanner.line = 1;
}

Well that was easy! Now, back in main.c, we can update our main function (replace the old one):

int main() {
  char* filename = "source.kitteh";
  char* contents = read_file(filename);
  initializeScanner(contents);

  // todo — everything

  return 0;
} 

Oh, but we need to include our lexer now! On the line after #include <stdio.h>, let's do that:

#include "lexer.h"

Getting to the fun

Let's get scanning. First we need to set up our includes. At the start of lexer.c (before the #include "lexer.h" line) let's add our imports:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

Now we can remove our // todo comment in lexer.c and really get started (replace the old scan function with the following):

struct Token scan(){ 
  switch(*scanner.position) {
    case '\0':
      return (struct Token){ .type = T_EOF, .value = "\0", .line = scanner.line };
    case '(':
      scanner.position++;
      return (struct Token){ .type = T_LPAR, .value = "(", .line = scanner.line };
    case ')':
      scanner.position++;
      return (struct Token){ .type = T_RPAR, .value = ")", .line = scanner.line };
    case '+':
      scanner.position++;
      return (struct Token){ .type = T_PLUS, .value = "+", .line = scanner.line };
    case '-':
      scanner.position++;
      return (struct Token){ .type = T_MINUS, .value = "-", .line = scanner.line };
    case '=':
      scanner.position ++;
      return (struct Token){ .type = T_EQ, .value = "=", .line = scanner.line};
    case '\n':
      scanner.position++;
      scanner.line ++;
      int line = scanner.line - 1;
      while(*scanner.position == '\n') {
        scanner.position ++;
        scanner.line ++;
      }
      return (struct Token){ .type = T_SEP, .value = "\n", .line = line };

    default:
      if(isalpha(*scanner.position) || *scanner.position == '_') {
        char* start = scanner.position;
        while(isalpha(*scanner.position) || *scanner.position == '_') {
          scanner.position++;
        }
        size_t len = sizeof(char) * (scanner.position - start);

        char* value = (char*)malloc(len + 1);
        memcpy(value, start, len);
        value[len] = '\0';

        if(strcmp(value, "VAR") == 0) {
          return (struct Token){ .type = T_VAR, .value = value, .line = scanner.line };
        }
        if(strcmp(value, "WHILE") == 0) {
          return (struct Token){ .type = T_WHILE, .value = value, .line = scanner.line };
        }
        if(strcmp(value, "DO") == 0) {
          return (struct Token){ .type = T_DO, .value = value, .line = scanner.line };
        }
        if(strcmp(value, "END") == 0) {
          return (struct Token){ .type = T_END, .value = value, .line = scanner.line };
        }
        if(strcmp(value, "PRINT") == 0) {
          return (struct Token){ .type = T_PRINT, .value = value, .line = scanner.line };
        }

        return (struct Token){ .type = T_ID, .value = value, .line = scanner.line };
      }
      if(isdigit(*scanner.position)) {
        char* start = scanner.position;
        while(isdigit(*scanner.position)) {
          scanner.position++;
        }
        size_t len = sizeof(char) * (scanner.position - start);
        char* value = (char*)malloc(len + 1);
        memcpy(value, start, len);
        value[len] = '\0';
        return (struct Token){ .type = T_NUM, .value = value, .line = scanner.line };
      }

      printf("Unknown token %c at line %d\n", *scanner.position, scanner.line);
      exit(1);
  }
}

Whoa, that's a lot! Let's break it down a bit.

  • We start by handling the one character tokens. These are easy, since we just match the current character in our file (*scanner.position). For those unfamiliar, the * before a pointer "de-references" it, ie gets the value that it's pointing to. The (struct Token){ /* .fields = values */ } syntax is simply a shorthand which allows us to create a Token with the listed fields at the listed values.
    • I didn't forget a scanner.current ++ after the EOF token. I purposely did not include that in order to make it repeatedly output EOF tokens once it gets to the end of the file.
    • We merge duplicate newlines in the \n handler by just repeatedly consuming consecutive newlines.
  • We now handle the multi-character tokens. We first check if the current character is an alphabetic character or _. If it is, we're dealing with a name token, so we just keep advancing the current character pointer until we get to a char that isn't alphabetic or _. At that point, we allocate enough memory to create a string out of the token and copy the data from the file into the token's value. Here, we also check if the token is one of the special keywords, in which case we give it a different type.
  • If it wasn't, we check if it's a digit. The process here is much the same, just simpler. (We won't deal with floats for the purpose of brevity.)
  • If it wasn't a digit either, we throw an error since we know the user inputted something incorrectly.

Take a minute to digest that. It's not super complicated once you realize how it works, so it's important that you get a feel for how it's working.

Now, there's one problem with this system. It doesn't handle whitespace (outside of newlines)! We need to skip whitespace and comments... so let's do that before we try to scan. At the beginning of the scan function (before switch(*scanner.position){...), we can fix that:

struct Token scan(){ // this is just here for reference, don't copy it
  for(;;) {
    switch(*scanner.position) {
      case ' ': case '\t': case '\r':
        break;

      case '/':
        if(*(scanner.position + 1) == '/') {
          while(*scanner.position != '\n') {
            scanner.position ++;
          }
          break;
        }
      
      default:
        goto done_checking_ws;
    }
  }
  done_checking_ws:

  switch(*scanner.position) { // once again just for reference

Now yes, I know I've just incurred death by raptor, but I think this is one of the few good uses of goto.
The idea here is simple; loop forever, checking the current character:

  • If it's , \t, or \r, skip it and keep going.
  • If it's / and the next one is /, skip those until you reach a newline (skipping a single line comment).
  • If it's none of those, break out of the loop.

You'll notice that I don't have a break after the / case, only inside the if — that's because I want it to fall through to the default (break out of the loop) if the second character isn't /.

Testing the lexer

Now the moment of truth — testing. We need to make another function to print out a token, so let's do that first. In lexer.h, we add its declaration after initializeScanner:

void printToken(struct Token);

And now, at the bottom of lexer.c we implement it:

void printToken(struct Token t) {
  static char* names[] = {"VAR", "WHILE", "DO", "END", "PRINT", "ID", "NUM", "LPAR", "RPAR", "PLUS", "MINUS", "EQUALS", "NEWLINE", "EOF"}; 
  printf("Token{ type=%s, value=%s, line=%d }\n", names[t.type], t.value, t.line);
}

We're using the fact that enums are just sequential numbers given nice names, so we can use the token's type (an enum value) as an index into the list of the string names of each enum value. (Read that sentence again, might make more sense the second time around.) At that point, it's just good old printf.

Now in main.c, right above our // todo — everything comment let's add some tests:

  printToken(scan());
  printToken(scan());
  printToken(scan());

Okay, almost time — in source.kitteh, let's add three tokens!

wow_a_var 123 // comment

Make sure to include a newline afterwards! (Markdown doesn't render it here.)

If you got what I got, it should print out

Token{ type=ID, value=wow_a_var, line=1 }
Token{ type=NUM, value=123, line=1 }
Token{ type=NEWLINE, value=
, line=1 }

when you run it! Mess around a bit, make sure it's working, and then we can proceed!

Parsing — Giving Meaning to Tokens

Remove the tests in main.c (the three lines of printToken(scan())), it's time to implement parsing. If you remember our assembly line, this is when we take tokens and turn them into an AST.

So now is probably a good time to implement an AST. Let's get started!

Time to set up a new file — ast.h. Make it, and we'll put in some include guards:

#ifndef ast_h
#define ast_h



#endif

Okay, time to make an AST structure. Remember that an AST is basically just a tree that represents the program's semantics. That means we need two kinds of tree nodes: statement nodes, for things like VAR x and WHILE x DO ... END, and expression nodes for things like x, 1, or x = 1.

Aside: Wait, why is x = 1 an expression?

It's often nice to have multiple assignments at once — i.e. x = y = 1. This should be parsed as x = (y = 1), which means that y = 1 needs to be an expression (which returns 1).

Additionally, sometimes you want to put assignments inside a conditional — ie, WHILE x = getchar() DO ... END. Obviously KITTEH doesn't have functions, but your language probably will.

Inside the include guards let's write our statement node definitions:

struct StmtNode {
  enum {
    VAR_DECL,
    WHILE_STMT,
    PRINT_STMT,
    EXPR_STMT
  } type;

  union {
    struct {
      char* name;
    } var_decl;

    struct {
      struct ExprNode* expression;
      struct StmtNode** statements;
      int num;
    } while_stmt;

    struct {
      struct ExprNode* expression;
    } print_stmt;

    struct {
      struct ExprNode* expression;
    } expr_stmt;
  } as;
};

Let's go through this step-by-step. We're defining a StmtNode (statement node) to have two fields — an enum which keeps track of the kind of node it is, and a union which keeps track of the node's data. The reason we have a union is that each type of node needs to contain different data, so we have to have multiple kinds of structures for our multiple kinds of nodes.

The structures themselves are as follows:

  • The structure for a var declaration (VAR x) simply keeps track of the variable's name.
  • The structure for a while statement (WHILE expr DO stmts END) keeps track of the expression (expr) and the list of statements (stmts), and how many statements there are.
    • The expression simply has one star — it's a single pointer to an expression. But the statements has two stars; why? Well, we need a pointer to a pointer to a statement node. Basically, we're going to have a list of pointers to statement nodes. We'll keep a pointer to the first pointer in the list and how many pointers there are.
  • The structure for a print statement (PRINT 1, PRINT 1 + x - 25) simply keeps track of the expression on the right hand side.
  • The structure for an expression statement (1, x, or x = 1) is just an expression. An expression statement might seem like an oxymoron, but it's important to have since we made x = 1 an expression (as most languages do) and function calls would be expressions too (if KITTEH had them).

Now that we've made our statement node structure, we need to make our expression node structure. Above the statement node definitions (right after the #define ast_h), we'll write our expression definitions:

struct ExprNode {
  enum {
    NUM_LITERAL,
    VAR_NAME,
    BINARY_OP,
    UNARY_OP
  } type;

  union {
    struct {
      long int num;
    } num_literal;

    struct {
      char* name;
    } var_name;

    struct {
      struct ExprNode* left;
      struct ExprNode* right;
      char binop_type;
    } binary_op;

    struct {
      struct ExprNode* left;
      char unop_type;
    } unary_op;
  } as;
};

This is much the same as before — we're just defining our expression types this time. We have a number literal type (1234), a variable name type (foo), a binary operator type (foo - 1234), and a unary operator type (-foo). Note that we'll only implement +, - and = (assignment) operators for the sake of brevity.

Finally, we need to create a special helper struct, Program, which will keep track of our program. A program in KITTEH is just a list of statements, so Program just stores a list of statements and its length (refer to the while_stmt struct a while ago if you're unsure of what's going on with the double pointer). Let's add that now, under our ExprNode definition:

struct Program {
  struct StmtNode** statements;
  int stmtnum;
};

Let's get parsing already

Okay, it's time to actually do the aforementioned parsing. Time for — you guessed it — another file! (I know, it's monotonous, but separating your project into modular file "units" is really helpful.) Let's set up parser.h:

#ifndef parser_h
#define parser_h

#include "ast.h"

struct Program parse(void);

#endif

Time to make parser.c — let's set it up as well:

#include <stdio.h>
#include <stdlib.h>

#include "parser.h"
#include "lexer.h"
#include "ast.h"

struct Program parse() {
  
}

We're including the lexer since we'll need its scan function.

Now, there are a plethora of ways to go about parsing. One way is to use something called a Pratt parser, which is a really interesting method but it's beyond the scope of this tutorial. Parser combinators are another way, but they too are beyond this tutorial's reach. (If you're interested, feel free to research those a bit.) So, we'll just be implementing a simple recursive descent parser.

The basic idea of a recursive descent parser is that you have a function for each type of thing you want to parse. So we'll have a function for parsing a program (list of statements), a function for parsing a single statement, a function for parsing a binary operator, and a function for parsing a "simple expression" (number literal or variable name). Then, each of those functions recursively calls other functions — so our statement parser might call the binary operator parser, which will call the simple expression, etc.

You can implement precedence by having different function "levels" — for instance, you could have a function which only parses the binary operators + and -, and then that function calls a function which only parses * and /.

Now that was all pretty theoretical, so it might be confusing at first. Let's start implementing it, I think it will make better sense that way.

I said, let's get parsing already!

Okay, okay! Remember how we implemented our lexer as "lex-as-you-go"? Well, there is one downside to that — how do we "peek" ahead to check what the next token is? We can't call scan() since that actually consumes the token... so how do we get around this?

What we can do is store two tokens in a "parser" struct — the next token and the current one. That way, we can peek ahead by just looking at the next token rather than consuming it.

So after our #include's in parser.c, let's create this parsing struct:

struct {
  struct Token next;
  struct Token current;
} parser;

After our struct, let's define some helper functions:

void advance() {
  parser.current = parser.next;
  parser.next = scan();
}

void consume(enum TokenType type, char* errmessage) {
  if(parser.next.type == type)
    advance();
  else {
    fprintf(stderr, "[line %d] Error at ", parser.next.line);
    if(parser.next.type == T_EOF)
      fprintf(stderr, "end");
    else if(parser.next.type == T_SEP)
      fprintf(stderr, "<newline>");
    else
      fprintf(stderr, "\"%s\"", parser.next.value);
    fprintf(stderr, ": %s\n", errmessage);
    exit(1);
  }
}

The advance function will move the parser one step forward in our token stream. We can then get the current token with parser.current or peek ahead to the next one with next.

The consume function is a little more complex. Essentially, it checks if the next token is of the given type. If not, it throws an error in the form of "[line <the line number>] Error at <the token>: <the error message>". The switching logic is to make newlines or EOF tokens look nicer.

Armed with our new functions, we can finally get parsing! Isn't this thrilling? ;)

Let's start by implementing our parse function (replace the dummy definition):

struct Program parse() {
  advance();

  struct Program prog = { .stmtnum = 0, .statements = (struct StmtNode**)malloc(1 * sizeof(struct StmtNode*)) };
  int initcap = 1;

  while(parser.next.type != T_EOF) {
    prog.statements[prog.stmtnum] = parse_statement();
    prog.stmtnum ++;

    if(prog.stmtnum >= initcap) {
      initcap *= 2;
      prog.statements = (struct StmtNode**)realloc(prog.statements, initcap * sizeof(struct StmtNode*));
    }
  }
  
  prog.statements = (struct StmtNode**)realloc(prog.statements, prog.stmtnum * sizeof(struct StmtNode*));

  return prog;
}

The first thing we do is advance(). This is to "prime" our parser struct — it starts out empty, so by advancing we put something in the next token slot.

Then we set up our program struct. We're allocating a statements array with enough space to hold 1 StmtNode*, and we record this in initcap.

Recall that a program is just a list of statements. So, we parse a statement (we have yet to define parse_statement but we'll do that next) and put it in our statements array. We increment our count of how many statements we have; if this is greater than or equal to our array capacity, we grow the array. We keep going until we see an EOF, at which point we shrink the array down to the bare minimum size and return the program struct.

Okay, now let's implement the aforementioned parse_statement() function (put this right before the parse() function definition):

struct StmtNode* parse_statement() {
  if(parser.next.type == T_VAR) {
    advance(); // VAR token

    consume(T_ID, "Expect identifier as variable name!");
    char* id = parser.current.value;
    consume(T_SEP, "Expect newline after statement!");

    struct StmtNode* varstmt = (struct StmtNode*)malloc(sizeof(struct StmtNode));
    *varstmt = (struct StmtNode){ .type = VAR_DECL, .as.var_decl = { .name = id } };
    return varstmt;
  }
  else if (parser.next.type == T_PRINT) {
    advance(); // PRINT token

    struct ExprNode* expression = parse_expression();
    consume(T_SEP, "Expect newline after statement!");
    
    struct StmtNode* printstmt = (struct StmtNode*)malloc(sizeof(struct StmtNode));
    *printstmt = (struct StmtNode){ .type = PRINT_STMT, .as.print_stmt = { .expression = expression } };
    return printstmt;
  }
  else if (parser.next.type == T_WHILE) {
    advance(); // WHILE token

    struct ExprNode* condition = parse_expression();
    consume(T_DO, "Expect DO after while statement condition!");
    if(parser.next.type == T_SEP) advance(); // optional newline

    struct StmtNode** statements = (struct StmtNode**)malloc(1 * sizeof(struct StmtNode*));
    int capacity = 1;
    int len = 0;

    while(parser.next.type != T_END) {
      statements[len] = parse_statement();
      len ++;

      if(len >= capacity) {
        capacity *= 2;
        statements = (struct StmtNode**)realloc(statements, capacity * sizeof(struct StmtNode*));
      }
    }

    statements = (struct StmtNode**)realloc(statements, len * sizeof(struct StmtNode*));

    advance(); // END
    consume(T_SEP, "Expect newline after while statement END!");

    struct StmtNode* whilestmt = (struct StmtNode*)malloc(sizeof(struct StmtNode));
    *whilestmt = (struct StmtNode){ .type = WHILE_STMT, .as.while_stmt = { .expression = condition, .statements = statements, .num = len } };
    return whilestmt;
  }
  else {
    struct ExprNode* expression = parse_expression();
    consume(T_SEP, "Expect newline after statement!");
    
    struct StmtNode* exprstmt = (struct StmtNode*)malloc(sizeof(struct StmtNode));
    *exprstmt = (struct StmtNode){ .type = EXPR_STMT, .as.expr_stmt = { .expression = expression } };
    return exprstmt;
  }
}

We're really getting into it now, so let's take this slowly.

  • First, we check if the next token is VAR. If it is, we know we're parsing a variable declaration. We skip past the VAR token and then expect an identifier (a name) and then a newline. Finally, we allocate a StmtNode and fill in its values (we're using *varstmt = { ... } because varstmt is a pointer, so we need to dereference it). We return our newly allocated node.
  • If it wasn't, we check if the next token is PRINT. The procedure here is much the same as before, except instead of accepting only an identifier we accept any kind of expression (since you should be able to print anything).
  • If it wasn't that as well, we check if the next token was WHILE. If so, we skip past the WHILE token and parse an expression as the while loop's condition. Then we expect a DO token to begin the statement block. The newline after the DO is optional — that way WHILE 1 DO PRINT 42 END is valid. Now we set up an array for our statements. You'll notice the code here is much the same as it was in the parse() function; that's because it's doing the same thing just with different variable names. So, we parse an array of statements up until the END token, and afterwards we expect a newline. Finally we fill in all the info into the AST node.
  • Finally, if it was none of those then we know it must be an expression statement. So, we simply parse an expression and plop it into an expression statement AST node.

That was a pretty big leap for parsing. Now all we need to do is parse expressions.

We can think of our expression parser as an onion with a bunch of layers. The innermost "core" is atomic expressions, like foo or 123 or (1 + 2 + 3). If you're wondering about the last one, that's considered an atomic expression because the parentheses make it one thing. All of them are one unit. The next layer is unary operators — we only have one, -, so that's things like -1 or -(2 + 3). Notice how the unary operator expression contains an atomic expression in it (the thing on the right), thus my onion analogy. Outside of that layer we have binary operators, like 1 + 2 or -foo + (1 + -2). The second example is important — notice how there's only one binary operator at the "top" level; -foo and (1 + -2) are both "lower-level" expressions. The final and outermost layer is the binary operator =, for x = 1 or x = 1 + -2. We put this operator by itself outside of the others because we want x = 1 + 2 to parse as x = (1 + 2) not (x = 1) + 2. (Think of it like this: 1 + 2 should have a higher level of "grouping" than x = 1 should.)

Aside: Other operators

If we had * and /, they would be in between the layer for +/- and unary operators. Why? Well, 1 + 2 * 3 should be parsed as 1 + (2 * 3), and that will only happen if * is a lower-level expression than +.
In general, operators with higher precedence should be lower-level (in our onion, closer to the core). Operators with lower precedence should be higher-level (in our onion, closer to the outer layer). This fits with where we put =.

The nice part about this Matryoshka doll approach is that it's modeled very nicely by functions. We can have a function for our binary operators (our outer layer) which can then call our function for unary operators (one layer lower) which can call our function for atomic expressions (one layer even lower). So let's get implementing!

Our outermost layer is =, so we'll implement that first. Above our definition for parse_statement, we'll define parse_expression:

struct ExprNode* parse_expression() {
  struct ExprNode* expr = parse_binop_expression();
  if(parser.next.type == T_EQ){
    char type = '=';
    advance();
    struct ExprNode* left = expr;
    struct ExprNode* right = parse_expression();
    
    expr = (struct ExprNode*)malloc(sizeof(struct ExprNode));
    *expr = (struct ExprNode){ .type = BINARY_OP, .as.binary_op = { .left = left, .right = right, .binop_type = type } };
  }
  return expr;
}

Here we parse a lower-level binop expression (a + or - one) and then check if the next token is an equals sign. If it is, we know we're parsing an assignment, so we set up an AST node and parse another expression as the right-hand-side. We're calling the function recursively since x = y = z is valid, and it should parse to x = (y = z).

Let's move on to + and -. Above parse_expression, we'll define parse_binop_expression:

struct ExprNode* parse_binop_expression() {
  struct ExprNode* expr = parse_unary_op_expression();
  while(parser.next.type == T_PLUS || parser.next.type == T_MINUS){
    char type = parser.next.type == T_PLUS ? '+' : '-';
    advance();
    struct ExprNode* left = expr;
    struct ExprNode* right = parse_unary_op_expression();
    
    expr = (struct ExprNode*)malloc(sizeof(struct ExprNode));
    *expr = (struct ExprNode){ .type = BINARY_OP, .as.binary_op = { .left = left, .right = right, .binop_type = type } };
  }
  return expr;
}

Here we parse a unary op expression (which is one level lower down) as our left hand side and call it expr. If there is in fact a binary operator (+ or -), then we store which kind it is and skip past it. Then we parse the right-hand-side, and update expr to be the binary operator node we just parsed. We keep doing this for as long as there's +'s or -'s, so that way 1 + 1 + 1 + 1 will parse as ((1 + 1) + 1) + 1 (which is correct).

Notice how we're using a while loop here rather than recursion as with =. This is because = is right-associative (x = y = z is x = (y = z)), while +/- is left-associative (1 - 1 - 1 is (1 - 1) - 1). The while loop mechanism represents left association best while the recursion mechanism represents right association best.

Time to implement parsing unary operators. Above our parse_binop_expression definition let's define parse_unary_op_expression:

struct ExprNode* parse_unary_op_expression() {
  struct ExprNode* expr = NULL; // dummy node
  while(parser.next.type == T_MINUS) {
    advance();
    struct ExprNode* oldexpr = expr;
    expr = (struct ExprNode*)malloc(sizeof(struct ExprNode));
    *expr = (struct ExprNode){ .type = UNARY_OP, .as.unary_op = { .left = oldexpr, .unop_type = '-' } };
  }

  if(expr == NULL) expr = parse_atomic_expression();
  else {
    struct ExprNode* traversal = expr;
    while(traversal->as.unary_op.left != NULL) {
      traversal = traversal->as.unary_op.left;
    }
    traversal->as.unary_op.left = parse_atomic_expression();
  }

  return expr;
}

Here, we first set up a dummy node since the unary operator comes before the actual expression. We then build our node by repeatedly checking if the next token is - and working recursively outwards. (Since there is only one type of unary operator, -, we just automatically set the type to that. If you add more, you'll want to check which type it is just like we did for binary operators.)

After there are no more -'s left, we do some logic to replace our dummy NULL node with the actual expression that comes next. Then we return our built node.

There's just one parsing task left — parsing atomic expressions, the "core" we talked about before. Let's do that right above parse_unary_op_expression:

struct ExprNode* parse_atomic_expression() {
  advance();
  switch(parser.current.type) {
    case T_ID: {
      struct ExprNode* varexpr = (struct ExprNode*)malloc(sizeof(struct ExprNode));
      *varexpr = (struct ExprNode){ .type = VAR_NAME, .as.var_name = { .name = parser.current.value } };
      return varexpr;
    }

    case T_NUM: {
      struct ExprNode* numexpr = (struct ExprNode*)malloc(sizeof(struct ExprNode));
      *numexpr = (struct ExprNode){ .type = NUM_LITERAL, .as.num_literal = { .num = atol(parser.current.value) } };
      return numexpr;
    }

    case T_LPAR: {
      struct ExprNode* expr = parse_expression();
      consume(T_RPAR, "Expect right parenthesis to end parenthesized expression!");
      return expr;
    }

    default:
      consume(-1, "Expect expression!");
      exit(1);
  }
}

We check which kind of token we're encountering. If it's a variable name, we construct an AST node accordingly. The process for a number literal is much the same, but note that atol converts a string to a long int (which is what we defined our number type to be). Finally, we have the logic for parentheses. Notice they don't actually make an AST node — they exist solely for parsing (all they do is grouping).

If it was none of those options, we raise an error. I'm adding the exit(1) so that clang doesn't complain about control reaching the end of a non-void function — however, as our consume function will throw an error regardless since the type is -1 (outside the bounds of our enum) we know this isn't necessary.

Speaking of parentheses, notice we used parse_expression? That's a recursive dependency, so right above our definition of parse_atomic_expression we need to add a forward declaration:

struct ExprNode* parse_expression(void);

Hooray! We've just finished our parser. That was quite a bit, so take a second to relish your achievement.

Let's test it out now. Back in main.c, let's update our main function (replace the old one):

int main() {
  char* filename = "source.kitteh";
  char* contents = read_file(filename);
  initializeScanner(contents);

  parse();

  return 0;
} 

We're ignoring the return value (which is the generated AST) for now, we'll use it in a bit. First we need to include our parser file — at the top let's add that to our include list (other includes shown for reference):

#include <stdlib.h>
#include <stdio.h>

#include "lexer.h"
#include "parser.h"

Okay, let's fire it up! Go ahead and run the program — if all goes well, the program should compile fine and it should produce nothing. This is because source.kitteh is empty! Let's fix that, and add the following:

VAR x // comment

WHILE x DO
  PRINT 1
  x - 2
END

x = (1 + 2) + -3

This should test the entire grammar. Copy and paste it into source.kitteh and hopefully the parser will be silent (not throw any parsing errors)!

IMPORTANT: Our simple parser requires a blank line after the last statement (since every statement needs a newline to end it). This is something you'd want to fix in your own language, but for brevity and simplicity I've chosen not to implement that. Therefore, make sure to add (at least one) blank newline at the end of your source.kitteh file from now on! (If you forget to, our handy parser error system will remind you.)

Now let's try deliberately introducing an error. Change the program to the following:

VAR x // comment

WHILE x DO
  PRINT 1 1
  x - 2
END

x = (1 + 2) + -3

Notice the duplicate 1 after the print statement. Let's see what happens:

[line 4] Error at "1": Expect newline after statement!

Great, our error system works! It knows that the 1 shouldn't belong there, and it knows which line it's on. Change the program back to the valid version (remove the 1) and let's see what AST our parser is building.

To do this, we'll need a way of pretty-printing an AST structure. Where better to implement this than a new file, ast.c:

#include <stdio.h>

#include "ast.h"

void print_expr_node(struct ExprNode node) {
  switch(node.type) {
    case NUM_LITERAL:
      printf("%ld", node.as.num_literal.num);
      break;
    case VAR_NAME:
      printf("ID ( %s )", node.as.var_name.name);
      break;
    case BINARY_OP:
      print_expr_node(*node.as.binary_op.left);
      printf(" %c ", node.as.binary_op.binop_type);
      print_expr_node(*node.as.binary_op.right);
      break;
    case UNARY_OP:
      printf("%c", node.as.unary_op.unop_type);
      print_expr_node(*node.as.unary_op.left);
      break;
  }
}

void print_stmt_node(struct StmtNode node) {
  switch(node.type) {
    case VAR_DECL:
      printf("VAR DECLARATION ( name = %s )\n", node.as.var_decl.name);
      break;
    case WHILE_STMT:
      printf("WHILE LOOP ( condition = ");
      print_expr_node(*node.as.while_stmt.expression);
      printf(" ) DO\n");
      for(int i = 0; i < node.as.while_stmt.num; i ++) {
        print_stmt_node(*node.as.while_stmt.statements[i]);
      }
      printf("END\n");
      break;
    case PRINT_STMT:
      printf("PRINT STATEMENT ( expr = ");
      print_expr_node(*node.as.print_stmt.expression);
      printf(" )\n");
      break;
    case EXPR_STMT:
      print_expr_node(*node.as.expr_stmt.expression);
      putchar('\n');
      break;
  }
}

void print_program(struct Program program) {
  for(int i = 0; i < program.stmtnum; i ++) {
    print_stmt_node(*program.statements[i]);
  }
}

There's nothing really magical going on here, we're just handling each type of node and each subtype. In ast.h, after our Program struct, let's make these functions public:

void print_program(struct Program);
void print_stmt_node(struct StmtNode);
void print_expr_node(struct ExprNode);

Now, let's change our main() function in main.c to print out the program. Replace the parse(); line with the following:

print_program(parse());

And at the top of the file, let's update our includes (old ones shown for reference):

#include <stdlib.h>
#include <stdio.h>

#include "lexer.h"
#include "parser.h"
#include "ast.h"

Okay, the time has come — run the program! If everything goes well, you should get this:

VAR DECLARATION ( name = x )
WHILE LOOP ( condition = ID ( x ) ) DO
PRINT STATEMENT ( expr = 1 )
ID ( x ) - 2
END
ID ( x ) = 1 + 2 + -3

Looking through that, it looks like it parsed the program correctly! The first statement is declaring a variable, x. The second statement is a while loop, with the condition being x, and the inner statements are a print statement (with 1 being the thing printed) and an expression statement (x - 2). Finally, after the loop we have an assignment, x being assigned to 1 + 2 + -3.

So pat yourself on the back, since you've just created a fully working parser!

That concludes part one of this tutorial. Stay tuned for part 2, where we'll be finishing our language by implementing bytecode generation and a way to run the bytecode!

All code can be found in the attached repl. If you have any questions, feel free to ask!

@ellipie this is my entry for the jam
Commentshotnewtop
realTronsi (782)

hmm always interesting to see how other people do it. I feel like your way is more of an "OOP" way if you know what I mean

DynamicSquid (4398)

@realTronsi ive always used functions, never thought of oop actually lol

fuzzyastrocat (1297)

@DynamicSquid @realTronsi This way is (in my opinion) pretty idiomatic C. You have your data structures (the AST, later will be the Value container, etc) as structs in the headers and your algorithms (the AST printer, the parsing mechanism) as functions in the implementation files.

I can see where you draw the parallel to OOP, because many of the algorithms are directly tied into the data structures (i.e. the AST pretty-printer). However I wouldn't consider this a full OOP approach since some algorithms (i.e. the parser mechanism) have no analogous data structure.

realTronsi (782)

@fuzzyastrocat not full OOP but kind of gave me that feel

jort57 (14)

Cool project! I was just wondering how you could connect the code to a text file inside the repl, where you would code, like you version of Python in C. How would that work? (And is there a way to do that in python? Because I suck at C Because I'm more fluent in python.) Thanks!

fuzzyastrocat (1297)

@jort57 Thank you! I'm very confused by what you mean by

I was just wondering how you could connect the code to a text file inside the repl, where you would code, like you version of Python in C.

Could you perhaps elaborate?

And everything I've done with this tutorial can be done in Python — Python will (in general) be a lot easier than C since it's much more abstracted.

jort57 (14)

@fuzzyastrocat what i meant was the text file in your C project where you would type in your python code. how do you do that? does it just magically connect?

fuzzyastrocat (1297)

@jort57 There's two things I think you could be asking here, so I'll answer both.

  • "How do you have a text file in a C repl?" Repls are basically like little mini computers, so just because I have a C repl doesn't mean I can't have a text file in it. And just like a normal computer, the C code will be able to access the text file as if it's in the same directory. There's no limits on what kind of files you can put in a repl.
  • "How does the C code read the text file?" This is accomplished with the read_file function in the "Further Ado" section near the beginning. I don't explain the function's workings in this tutorial, since it's a short function and it's irrelevant to the subject of the tutorial. If you're familiar with C you should be able to figure out how it works; if not, check out this similar function on a stackoverflow answer.
Bookie0 (4607)

(For those unacquainted, "kitteh" is lolspeak for "kitten".

OH WOWZ KITTEHS!

pretty complicated (for me lol), but nevertheless interesting! :)

fuzzyastrocat (1297)

@Bookie0 HAO CUUD I REZIST?!?

Thank you! Yeah, it's a bit complicated, but that's because making a language is a complex task :)

Bookie0 (4607)

@fuzzyastrocat

CHEZBURGERZZZ IN SIGHT ATTAK!!

yea lol, I better stick with my simpler math stuff/tutorials :)

DynamicSquid (4398)

Quick question, what is the visitor pattern for an AST? I can't seem to find about it on the internet, but I've seen a lot of languages that have it

fuzzyastrocat (1297)

@DynamicSquid I'm not sure what you mean by the "visitor pattern". Could you elaborate?

DynamicSquid (4398)

@fuzzyastrocat It's for travelling an AST. Because right now I just visit the children node recursively, but is there a better way?

fuzzyastrocat (1297)

@DynamicSquid That's the way to travel a tree.

DynamicSquid (4398)

@fuzzyastrocat Oh, it's probably what I do with each node. Like it should look like this right:

struct Node // interface
{
  Data data;

  Value Eval(); // evaluates children
};

struct Binary : public Node
{
  Binary left, right;

  Value Eval()
  {
     switch (data)
     {
       case PLUS:
         return left.Eval() + right.Eval();
     }
  }
};
fuzzyastrocat (1297)

@DynamicSquid Yep. The way the tree is traversed should depend on the type of node, just like you're doing there.

DynamicSquid (4398)

@fuzzyastrocat Oh okay. Oh you know what surprises me? Night v3.2 had 2500 lines, but Night v4 only has 2200 right now, yet it has way more features, a lot more error checking, and I've split the code up into header files as well... I guess good code is also short code

fuzzyastrocat (1297)

@DynamicSquid
Good code has power
Clean code will be elegant
Good code is clean code
- trashy haiku by me, 2020

DynamicSquid (4398)

@fuzzyastrocat Hmm... how am I supposed to type check function parameters?

fuzzyastrocat (1297)

@DynamicSquid With your dynamically typed language? This is where the thonking complex part comes in. I can't tell you exactly how to implement it since I don't know how you've structured the project and I'm not exactly sure how your language will work, but I can tell you what to do.

Let's do this by example:

var x = 1;
if(<condition>)
  x = "hi";
function(x);

The task here is for your program to deduce that at the point of the function call, x has either type string or int. You could do this by adding types to x's "list of types" as you parse through... there's other ways too. So then you typecheck — does the first parameter in function accept both string and int?

Let's look at a more complex one.

def function(x) {
  if x < 2 {
    return 1;
  } else {
    return "hi";
  }
}

z = function("hi")
q = function(1)

another_function(q)

We have a few things here. First, when we parse through the function definition we need to figure out two things:

  • What kind of parameters can it accept?
  • What can it return?

The return value is clearly either int or string, so that's fairly easy. But what is its parameter type? What we can do is a form of duck typing, where an object is judged not by its exact type but what you can do with it. In our function, we see that we use x < 2 — this means that x has to be some type for which <(x, int) is defined. Since that's all we have on x, that's what we should record as function's parameter type.

So now moving on to the calls. We first assign z to function("hi"). Typechecking, we see: is <(string, int) (since the parameter is of type string) defined? I'm guessing it's not, so we fail there.

Assuming we didn't fail, let's continue on. Typechecking q's assignment — is <(int, int) defined? Yes, it is, so we're fine there. Now we know the return value is int or string, so q now has type union { int, string }.

Now, we can typecheck the call of another_function. The important part is that we need to typecheck the parameter with both int and string. So first we check if the parameter works being an int, then we check if it works being a string.

So you can see how complex this gets even for just a simple example. But that's the difficulty in typechecking a dynamically typed language.

DynamicSquid (4398)

@fuzzyastrocat Yikes :/ I'll see how it works out, thanks!

fuzzyastrocat (1297)

@DynamicSquid Now you see why I was talking about no one ever doing this before :)

Hope it works for you!

EpicGamer007 (801)

yesssssss. frickin high quality. i want a C++ one tho, it should be fairly similar, correct?

fuzzyastrocat (1297)

@DynamicSquid @EpicGamer007 All of this code is valid C++ code (I'm not using any C-specific features), so theoretically this tutorial works for C++ too. However the idiomatic C++ approach would be a lot more object-oriented.

xxpertHacker (555)

Wait, there's a "TutorialJam"!?

elipie (306)

alr, I kinda feel bad by giving you second place, should I give you one month of discord nitro too?

fuzzyastrocat (1297)

@elipie I don't use discord, so don't worry :D

Coder100 (11135)

Nice! Just a question: why is not using

typedef struct E { ... } E;

is there a problem with typedefing?

fuzzyastrocat (1297)

@Coder100 Thanks! I generally use struct X because you can't reference a typedef inside its body and that can just get confusing in the context of a tutorial. (Why is it just X some places but struct X in others?)