Hello guys!
As the title says, today, I will show you how to make a simple parser with Python, using Parsimonious
, a Python parser.
Disclaimer
If you are reading this tutorial, you may want to build a programming language, but this one won't do amazing stuff like what you see at other tutorials. This tutorial will just help you how to make a real parser. Means that it will only parse your code and nothing else.
Ok, if you are ready to start, let's go!
The design of our language
In this tutorial, we will make a parser for our own language, a simple language called Cotton
, the design of it will look like this:
[ x = 120 ]
[ y = "Hello world!" ]
[ print x ]
[ print y ]
Now, you know what our language will look like. Let's started!
Installation
First of all, you need to install Parsimonious
first. Type the following in your Terminal to install it:
pip install parsimonious
Now, in your directory, create a Python file called parser.py
, it will contains all of our code. Then open it using your favourite editor/IDE. Mine is Neovim.
First, on parser.py
, import the Grammar
module from parimonious.Grammar
. This helps us to make the grammar of our language.
from parsimonious.grammar import Grammar
Now, we will declare a variable called grammar
that will contain, well, our grammar.
grammar = Grammar("""
# The grammar here
""")
Replace the # The grammar here
part with our grammar:
expr = (statement / emptyline)* # Main part
emptyline = ws+ # Matches emptylines
ws = ~"\s*" # Matches whitespaces
# Classify square brackets
lpar = "[" # Matches the left one
rpar = "]" # Matches the right one
statement = lpar ws? things ws? rpar ws? # Statement
things = (print / declare)* # Commands
print = "print" ws types # the print command
declare = varname ws? equal ws? types ws? # The declare command
varname = ~"[A-Za-z_][A-Za-z0-9]*" # Matches ariable name
equal = ws? "=" ws? # Matches equal sign
types = (int / float / string / varname)* # Data types
# Int, float and string
int = ~"\d+"
float = ~"\d+\.\d+"
string = ~'"[^\"]+"'
Now, let's try it!
# Our test code
code = '''
[ x = "Hello" ]
[ y = 120 ]
[ print x ]
[ print y ]
'''
# Print the parse result
print(grammar.parse(code))
Now, when you run the program, if nothing is wrong, you should see a node tree like this:
<Node called "expr" matching "
[ x = "Hello" ]
[ y = 120 ]
[ print x ]
[ print y ]
">
<Node matching "
">
<Node called "emptyline" matching "
">
<RegexNode called "ws" matching "
">
<RegexNode called "ws" matching "">
<Node matching "[ x = "Hello" ]
">
<Node called "statement" matching "[ x = "Hello" ]
">
<Node called "lpar" matching "[">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "things" matching "x = "Hello" ">
<Node matching "x = "Hello" ">
<Node called "declare" matching "x = "Hello" ">
<RegexNode called "varname" matching "x">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "equal" matching "= ">
<Node matching "">
<RegexNode called "ws" matching "">
<Node matching "=">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node matching "">
<RegexNode called "ws" matching "">
<Node called "types" matching ""Hello"">
<Node matching ""Hello"">
<RegexNode called "string" matching ""Hello"">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node matching "">
<RegexNode called "ws" matching "">
<Node called "rpar" matching "]">
<Node matching "
">
<RegexNode called "ws" matching "
">
<Node matching "[ y = 120 ]
">
<Node called "statement" matching "[ y = 120 ]
">
<Node called "lpar" matching "[">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "things" matching "y = 120 ">
<Node matching "y = 120 ">
<Node called "declare" matching "y = 120 ">
<RegexNode called "varname" matching "y">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "equal" matching "= ">
<Node matching "">
<RegexNode called "ws" matching "">
<Node matching "=">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node matching "">
<RegexNode called "ws" matching "">
<Node called "types" matching "120">
<Node matching "120">
<RegexNode called "int" matching "120">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node matching "">
<RegexNode called "ws" matching "">
<Node called "rpar" matching "]">
<Node matching "
">
<RegexNode called "ws" matching "
">
<Node matching "[ print x ]
">
<Node called "statement" matching "[ print x ]
">
<Node called "lpar" matching "[">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "things" matching "print x">
<Node matching "print x">
<Node called "print" matching "print x">
<Node matching "print">
<RegexNode called "ws" matching " ">
<Node called "types" matching "x">
<Node matching "x">
<RegexNode called "varname" matching "x">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "rpar" matching "]">
<Node matching "
">
<RegexNode called "ws" matching "
">
<Node matching "[ print y ]
">
<Node called "statement" matching "[ print y ]
">
<Node called "lpar" matching "[">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "things" matching "print y">
<Node matching "print y">
<Node called "print" matching "print y">
<Node matching "print">
<RegexNode called "ws" matching " ">
<Node called "types" matching "y">
<Node matching "y">
<RegexNode called "varname" matching "y">
<Node matching " ">
<RegexNode called "ws" matching " ">
<Node called "rpar" matching "]">
<Node matching "
">
<RegexNode called "ws" matching "
">
That's a pretty large node tree, right?
Conclusion
From this tutorial, you have know how to make your own parser in Python. It's the end of my tutorial now, having a nice day coders! :D
Pretty cool! Im gonna have to try this!
Nice! Could you explain the grammar part a littlr more?
i thought squids didnt use grammar @DynamicSquid
@DynamicSquid Sure.
The "/" part that you see in some rules is the same meaning as "|" in Regex. It means "or" and will matches the next rule if the first one doesn't matched.
The "~" means "This grammar part is a Regex rule". Followed by a string that contains the Regex rule.
The quoted part in some grammar rules are called literals (according to the documentation).
The "?" at the end of the ws
part in some grammar rules means "The ws part are optional".
For other syntax (like "()") are pretty the same as Regex. You can check out the documentation for more information.
@Wumi4 ah okay, cool!
Mine just stops, see here:
https://repl.it/@Elderosa/Cotton#parser.py