Learn to Code via Tutorials on Repl.it!

← Back to all posts
Making your own programming language with Python
h
Scoder12 (646)

Making your own programming language with Python

Why make your own language?

When you write your own programming language, you control the entire programmer experience.
This allows you to shape exact how each aspect of your language works and how a developer interacts with it.
This allows you to make a language with things you like from other languages and none of the stuff you don't.
In addition, learning about programming language internals can help you better understand the internals of programming languages you use every day, which can make you a better programmer.

How programming languages work

Every programming language is different in the way it runs, but many consist of a couple fundamental steps: lexing and parsing.

Introduction to Lexing

Lexing is short for LEXical analysis.
The lex step is where the language takes the raw code you've written and converts it into an easily parsable structure.
This step interprets the syntax of your language and turns next into special symbols inside the language called tokens.
For example, let's say you have some code you want to parse. To keep it simple I'll use python-like syntax, but could be anything. It doesn't even have to be text.

# this is a comment
a = (1 + 1)

A lexer to parse this code might do the following:

  • Discard all comments
  • Produce a token that represents a variable name
  • Produce left and right parenthesis tokens
  • Convert literals like numbers or strings to tokens
  • Produce tokens for nath operations like + - * / (and maybe bitwise/logical operators as well)

The lexer will take the raw code and interpret it into a list of tokens.
The lexer can also be used to insure that two pieces of code that may be different, like 1 + 1 and 1+1 are still parsed the same way.

For the code above, it might generate tokens like this:

NAME(a) EQUALS LPAREN NUMBER(1) PLUS NUMBER(2) RPAREN

Tokens can be in many forms, but the main idea here is that they are a standard and easy to parse way of representing the code.

Introduction to Parsing

The parser is the next step in the running of your language.
Now that the lexer has turned the text into consistent tokens, the parser simplifies and executes them.
Parser rules recognize a sequence of tokens and do something about them.
Let's look at a simple example for a parser with the same tokens as above.

A simple parser could just say:

  • If I see the GREET token and then a NAME token, print Hello, and the the name.

A more complicated parser aiming to parse the code above might have these rules, which we will explore later:

  • Try to classify as much code as possible as an expression. By "as much code as possible" I mean the parser will first try to consider a full mathematical operation as an expression, and then if that fails convert a single variable or number to an expression. This ensure that as much code as possible will be matched as an expression. The "expression" concept allows us to catch many patterns of tokens with one piece of code. We will use the expression in the next step.
  • Now that we have a concept of an expression, we can tell the parser that if it sees the tokens NAME EQUALS and then an expression, that means a variable is being assigned.

Using PLY to write your language

What is PLY?

Now that we know the basics of lexing and parsing, lets start writing some python code to do it.
PLY stands for Python Lex Yacc.
It is a library you can use to make your own programming language with python.
Lex is a well known library for writing lexers.
Yacc stands for "Yet Another Compiler Compiler" which means it compiles new languages, which are compilers themself.

This tutorial is a short example, but the PLY documentation is an amazing resource with tons of examples. I would highly recommend that you check it out if you are using PLY.

For this example, we are going to be building a simple calculator with variables. If you want to see the fully completed example, you can fork this repl: [TODO!!]

Lexing with PLY lex

Lexer tokens

Lets start our example! Fire up a new python repl and follow along with the code samples.
To start off, we need to import PLY:

from ply import lex, yacc

Now let's define our first token. PLY requires you to have a tokens list which contains every token the lexer can produce. Let's define our first token, PLUS for the plus sign:

tokens = [
    'PLUS',
]

t_PLUS = r'\+'

A string that looks like r'' is special in python. The r prefix means "raw" which includes backslashes in the string. For example, to make define the string \+ in python, you could either do '\\+' or r'\+'. We are going to be using a lot of backslashes, so raw strings make things a lot easier.

But what does \+ mean?
Well in the lexer, tokens are mainly parsed using regexes.
A regex is like a special programming language specifically for matching patterns in text.
A great resource for regexes is regex101.com where you can test your regexes with syntax highlighting and see explanations of each part.

I'm going to explain the regexes included in this tutorial, but if you want to learn more you can play around with regex101 or read one of the many good regex tutorials on the internet.

The regex \+ means "match a single character +".
We have to put a backshlash before it because + normally has a special meaning in regex so we have to "escape" it to show we want to match a + literally.

We are also required to define a function that runs when the lexer encounters an error:

def t_error(t):
    print(f"Illegal character {t.value[0]!r}")
    t.lexer.skip(1)

This function just prints out a warning when it hits a character it doesn't recognize and then skips it (the !r means repr so it will print out quotes around the character).
You can change this to be whatever you want in your language though.

Optionally, you can define a newline token which isn't produced in the output of the lexer, but keeps track of each line.

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

Since this token is a function, we can define the regex in docstring of the function instead.
The function takes a paramater t, which is a special object representing the match that the lexer found. We can access the lexer using the the t.lexer attribute.

This function matches at least one newline character and then increases the line number by the amount that it sees. This allows the lexer to known what line number its on at all times using the lexer.lineno variable.

Now we can use the line number in our error function:

def t_error(t):
    print(f"Illegal character {t.value[0]!r} on line {t.lexer.lineno}")
    t.lexer.skip(1)

Let's test out the lexer!
This is just some temporary code, you don't have to know what this code does, because once we implement a parser, the parser will run the lexer for you.

lexer = lex.lex()
lexer.input('+')

for token in lexer:
    print(token)

Play around with the value passed to lex.input.
You should notice that any character other than a plus sign makes the error message print out, but doesn't crash the program.
In your language, you can make it gracefully ignore lex errors like this or make it stop running by editing the t_error function.
If you add more lines to the input string, the line number in the error message should change.

More complicated tokens

Let's delete the test token add some more complicated tokens.
Replace your tokens list and the t_PLUS line with the following code:

reserved_tokens = {
    'greet': 'GREET'
}

tokens = list(reserved_tokens.values()) + [
    'SPACE'
]

t_SPACE = r'[ ]'

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    if t.value in reserved_tokens:
        t.type = reserved_tokens[t.value]
    else:
        t.type = 'NAME'
    return t

Let's explore the regex we have in the t_ID function.
This regex is more complicated that the simple ones we've used before.

First, we have [a-zA-Z_]. This is a character class in regex. It means, match any lowercase letter, uppercase letter, or underscore.
Next we have [a-zA-Z0-9_]. This is the same as above except numbers are also included.
Finally, we have *. This means "repeat the previous group or class zero to unlimited times".

Why do we structure the regex like this?
Having two separate classes makes sure that the first one must match for it to be a valid variable.
If we exclude numbers from the first class, it not only doesn't match just regular numbers, but makes sure you can't start a variable with a number.
You can still have numbers in the variable name, because they are matched by the second class of the regex.

In the code, we first have a dictionary of reserved names.
This is a mapping of patterns to the token type that they should be.
The only one we have says that greet should be mapped to the GREET token.

The code that sets up the tokens list takes all of the possible reserved token values, in this is example its just ['GREET'] and adds on ['SPACE'], giving us ['GREET', 'SPACE'] automatically!

But why do we have to do this? Couldn't we just use something like the following code?

# Don't use this code! It doesn't work!
t_GREET = r'greet'
t_SPACE = r'[ ]'
t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'

Actually, if we used that code, greet would never be matched! The lexer would match it with the NAME token. In order to avoid this, we define a new type of token which is a function. This function has the regex as its docstring and is passed a t paramater. This paramater has a value attribute which is the pattern matched.
The code inside this function simply checks if this value is one of the special reserved names we defined before. If it is, we set the special type attribute of the t paramter. This type controls the type of token which is produced from the pattern. When it sees the name greet, it will see greet is in the reserved names dictionary and produce a token type of GREET because that is the corresponding value in the dictionary. Otherwise, it will produce a NAME token because this is a regular variable.

This allows you to add more reserved terms easily later, its as simple as adding a value to the dictionary.

If needed, you could make also make the key of the reserved names dicitonary a regex and the match each regex against t.value in the function.

If you want to change these rules for your language, feel free!

Parsing with PLY yacc

Fair warning: Yacc can sometimes be hard to use and debug, every if you know python well.
Keep in mind, you don't have to use both lex and yacc, if you want you can just use lex and then write your own code to parse the tokens.
With that said lets get started.

Yacc basics

Before we get started, delete the lexer testing code (everything from lexer.input onward).
When we run the parser, the lexer is automatially run.

Let's add our first parser rule!

def p_hello(t):
    'statement : GREET SPACE NAME'
    print(list(t))
    print(f"Hello, {t[3]}")

Let's break this down.
Again, we have information on the rule in the docstring.
This information is called a BNF Grammar. A statement in BNF Grammar consists of a grammar rule known as a non-terminal and terminals.
In the example above, statement is the non-terminal and GREET SPACE NAME are terminals.
The left-hand side describes what is produced by the rule, and the right-hand side describes what matches the rule.
The right hand side can also have non-terminals in it, just be careful to avoid infinite loops.

Basically, the yacc parser works by pushing tokens onto a stack, and looking at the current stack and the next token and seeing if they match any rules that it can use to simplify them. Here is a more in-depth explanation and example.

Before the above example can run, we still have to add some more code.
Just like for the lexer, the error handler is required:

def p_error(t):
    if t is None: # lexer error, already handled
        return
    print(f"Syntax Error: {t.value!r}")

Now let's create and run the parser:

parser = yacc.yacc()

parser.parse('greet replit')

If you run this code you should see:

[None, 'greet', ' ', 'replit']
Hello, replit

The first line is the list version of the object passed to the parser function.
The first value is the statement that will be produced from the function, so it is None.
Next, we have the values of the tokens we specified in the rule.
This is where the t[3] part comes from. This is the third item in the array, which is the NAME token, so our parser prints out Hello, replit!

Note: Creating the parser tables is a relatively expensive operation, so the parser creates a file called parsetab.py which it can load the parse tables from if they haven't changed.
You can change this filename by passing a kwarg into the yacc initialization, like parser = yacc.yacc(tabmodule='fooparsetab')

More complicated parsing: Calculator

This example is different from our running example, so I will just show a full code example and explain it.

from ply import lex, yacc

tokens = (
    'NUMBER',
    'PLUS', 'MINUS', 'TIMES', 'DIVIDE',
    'LPAREN', 'RPAREN',
)

t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'


def t_NUMBER(t):
    r'\d+'
    try:
        t.value = int(t.value)
    except ValueError:
        print(f"Integer value too large: {t.value}")
        t.value = 0
    return t

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

def t_error(t):
    print(f"Illegal character {t.value[0]!r} on line {t.lexer.lineno}")
    t.lexer.skip(1)

t_ignore = ' \t'

lexer = lex.lex()

# Parsing

def p_expression_binop(t):
    '''expression : expression PLUS expression
                  | expression MINUS expression
                  | expression TIMES expression
                  | expression DIVIDE expression'''
    if t[2] == '+'  : t[0] = t[1] + t[3]
    elif t[2] == '-': t[0] = t[1] - t[3]
    elif t[2] == '*': t[0] = t[1] * t[3]
    elif t[2] == '/': t[0] = t[1] / t[3]

def p_expression_group(t):
    'expression : LPAREN expression RPAREN'
    t[0] = t[2]

def p_expression_number(t):
    'expression : NUMBER'
    t[0] = t[1]

def p_error(t):
    if t is None: # lexer error
        return
    print(f"Syntax Error: {t.value!r}")

parser = yacc.yacc()

if __name__ == "__main__":
    while True:
        inp = input("> ")
        print(parser.parse(inp))

First we start off with the tokens: numbers, mathematical operations, and parenthesis.
You might notice that I didn't use the reserved_tokens trick, but you can implement it if you want.

Next we have a simple number token which matches 0-9 with \d+ and then converts its value from a string to an integer.

The next code we haven't used before is t_ignore.
This variable represents a list of all characters the lexer should ignore, which is \t which means spaces and tabs.
When the lexer sees these, it will just skip them. This allows users to add spaces without it affecting the lexer.

Now we have 3 parser directives.

The first is a large one, producing an expression from 4 possible input values, one for each math operation.
Each input has an expression on either side of the math operator.
Inside this directive, we have some (pretty ugly) code that performs the correct operation based on the operation token given.
If you want to make this prettier, consider a dictionary using the python stdlib operator module.

Next, we define an expression with parenthesis around it as being the same as the expression inside.
This makes parenthesis value be substituted in for them, making them evaluate inside first.
With very little code we created a very complicated rule that can deal with nested parenthesis correctly.

Finally, we define a number as being able to be an expression, which allows a number to be used as one of the expressions in rule 1.

For a challenge, try adding variables into this calculator!
You should be able to set variables by using syntax like varname = any_expression and you should be able to use variables in expressions.
If you're stuck, see one solution from the PLY docs.

Thats it!

Thanks for reading! If you have questions, feel free to ask on the Replit discord's #help-and-reviews channel, or just the comments.
Have fun!

Commentshotnewtop
generationXcode (162)

I'm not sure I'll end up doing this, but this is amazing. Great job! epic

sugarfi (486)

If anyone is interested, I would really recommend checking out sly. It's a great lexing+parsing library for Python, and it lets you skip a lot of the harder parts of lang dev.

HaydenAmundrud (0)

@sugarfi I took a look at sly and Its really useful im still trying to decide between it or ply

AbhayBhat (246)

Amazing job! Wish there was something like this in java.

Scoder12 (646)

@AbhayBhat I also made one for javascript if you like that better

AbhayBhat (246)

@Scoder12, Lol. I know more python than js. So this would probably be better lol

Jakman (375)

@AbhayBhat java is real slow. If you made another language in it just imagine how slow it would be.

AbhayBhat (246)

@Jakman, Uhhh.... Python is actually slower. Search it up

Jakman (375)

@AbhayBhat I knew that. I meant that interpreting another lang in java would make this lang slow

Leroy01010 (55)

i am not able to do it because of multiple errors

HaydenAmundrud (0)

When I try the one with the reserved tokens the token name is not defined.

Scoder12 (646)

@HaydenAmundrud you have to define it first

HaydenAmundrud (0)

@Scoder12 so you have to add name to reserved tokens

Scoder12 (646)

@HaydenAmundrud you have to add a token called name

HaydenAmundrud (0)

@Scoder12 thank you for all the help im really new to this subject but ive challenged myself to make a programming language. Can you explain what I could do for varibles and functions?

generationXcode (162)

is there something like this in JULIA?????

generationXcode (162)

The ID part of the lexer may not work... You have to add 'ID' to the tokens

generationXcode (162)

the ply docs are in python 2.7. Good old python 2 haha

HaydenAmundrud (0)

Maybe im missing something but on the lexer.input line it throws the error undefined name 'lexer'.

Scoder12 (646)

@HaydenAmundrud try adding lexer = lex.lex() before that

HarperframeInc (276)

The documentation is very unclear, maybe you could help by providing examples yourself?

On top of this, it seems like there are better lexer/parser packages like ply with better documentation.

ApoorvAgrawal (39)

I have a feeling that any programming language made in python will automatically be horrendously slow

Warhawk947 (523)

Why make your own language?

Because you get a chance to win 10000 fricking dollars

HahaYes (1036)

yay a good post. but python kinda slow so I guess time to use cpp

Jakman (375)

I have long yearned for a good python post.

Jakman (375)

Don't forget this language will be very slow. Incredibly slow.