Learn to Code via Tutorials on Repl.it!

← Back to all posts
Making your own programming language with NodeJS
h
Scoder12 (646)

Making your own programming language with NodeJS

What is Ohm?

Ohm is a javascript library for creating parsers. It uses its own grammar in an ohm file combined with javascript logic for parsing and evaluation.

Introduction to the Ohm Grammar

The Ohm Grammar is a variant of "Parsing Expression Grammars" (PEGs). The Ohm grammar is how you specify how code in your language should be parsed.

Here is a handy reference for the syntax of such files.
Also, the Ohm interactive editor is super helpful for visualizing and debugging your grammars.

Syntax of Ohm Grammars

Here is a simple example of an Ohm grammar:

Arithmetic {
  Expr = "1 + 1"
}

The syntax does have many features, which I am about to explain.
If you just want to see a practical example, skip to the next section.

This example represents a grammar called "Arithmetic", which has a rule called "Expr".

The left hand side of the rule is the "rule name".
In the rule name, capitalization matters!
If the rule name starts with an uppercase letter, then it is called a "Syntactic rule" and whitespace such as spaces and tabs are ignored inside the rule.
Otherwise if it starts with a lowercase letter, it is called a "lexical rule".
Keep this in mind if you want to parse whitespace yourself.

If you want your grammar to inherit rules from a supergrammar, you use the syntax grammarName <: superGrammar { ... }.
Inheritance is useful because it allows you to reuse and extend rules.

You can also define a description of the rule using (), like this:

ident (an indentifier)
    = letter

The description will be used to show users error messages.
You can also define the description inline using -- description after the rule.

You can also define paramaters for the rule, like ruleName<arg1, ..., argN>, for example: Repeat<x> = x x

Now lets break down the definiton of a rule:

The middle is an = sign, which defines a new rule.
If its okay for it to already exist, you must use := instead, and there is += which is same as expr | oldValue (expr OR oldvalue).
These symbols are useful when inheriting.

The right hand side is the "rule body", consisting of "parsing expressions" which are used to match the rule to text.

There are a few types of parsing expressions that you can use in your rules:

First there are terminals, which are matched literally. They have to be in quotes like "hello there".

There are also "terminal ranges", which are ranges from one terminal character to another. For example, "a".."c" matches a, b, or c.

You can also have a rule be composed of other rules by invoking the using ruleName.
Rules can also sometimes have parameters of parsing expressions, which allows you to reuse the same rule in multiple cases. You can pass a rule paramaters the same way they are defined, for example: ListOf<field, ";">.

Rules can also have "repetition operators", much like those in regex.
Just like is regex, * is zero or more, + is one or more, | means alternation, and ?, is optional or succeeds whether or not the expression matches or not.

Continuing the similarity to regex, ohm has lookaheads.
A lookahead is defined with & expr, and is the same as expr except it doesn't consume characters.
A negative lookahead only succeeds if expr cannot be matched, and also doesn't consume any characters.
You can define a negative lookahead using ~ expr.
A good example is the builtin end rule, which is the same as ~any.
This can only match at the end because it checks that any character will NOT match, which can only happen if no charcters are left to be consumed.

You can chain these rules together in sequence by separating them with spaces.

Finally, you can match a rule in a "lexical context", meaning, don't skip spaces, using the syntax # rule.

Ohm has many built in rules, such as any character, letter and more specific upper and lower case versions, space, end of the string, digit for 0-9 and hexDigit for 0-9 A-F.
Also, there is alnum or ALpha NUMeric, which is letter | digit (letter or digit).

You can read more about the grammar rule syntax and other builtin rules in the syntax manual.

A basic Ohm Grammar

Now that we know the syntax of Ohm grammars, lets implement our own and the associated nodejs code to run it.
For this example, I will be doing the same thing as the "Basic example" in the Ohm Readme: parsing CSV.

If you're not aware, CSV is a simple textual format for defining data that is in spreadsheet-like columns.
Sometimes the first row is used to describe what each column's value is, but this isn't always used.
Here is an example of a basic CSV data using the first row as description:

user,repls,cycles
Bob,15,7
Alice,20,35
Jane,50,22

As you can see, each line in the file is a row and columns are separated by commas.
The CSV format's simplicity makes it perfect for our first Ohm parser.

First we will define the grammar, then the nodejs code to run it.

Open up the Ohm interactive editor.
The editor has a window on the left for editing the grammar, a window on the right for tests for the grammar, and a window on the bottom for a visualization.

Paste this definition into the editor:

CSV {
  csv = row (eol ~end row)* eol?
  row = col ("," col)*
  col = colChar*
  colChar = ~(eol | ",") any
  eol = "\r"? "\n"
}

The order we write the rules in matters, because that is the order in which Ohm will try to match them: it starts with the top one and moves downward till it finds a match.
The order that they are arranged in doesn't make as much sense to explain because the rules at the top use the ones at the bottom, so I will show the full example then explain the rules in reverse.

First, we have the eol or "end of line" rule, which matches the end of a line.
It optionally consumes an \\r or carriage return character before the \\n because windows uses CRLF (\\r\\n) while unix uses just \\n.

Next we have the rule colChar which matches characters that are valid within a column.
This should be any character, but it can't match the column separators (","s) or ends of lines, otherwise it would overflow and match more than it should.

We also have the col rule, which is pretty straightforward.
It matches zero to unlimited colorChars.

Next, we can make a row out of at least one col and then zero to unlimited more instances of , then col.
This matches both a single column and unlimited more columns after it.

Finally, we can define the whole csv rule.
This one is a little more complicated.
It must have at least one row first.
After that, we repeat the following pattern zero to unlimited times:
An eol after the row, then if its not the end of the input, ~end, match another row.
Finally, the file can optionally end with a single eol.

This rule uses ~end inside the repeated part to ensure that the parser doesn't try to match another row after the final eol in the file.

An important thing to keep is mind is that all of the rule names in this example are in lower case.
Like I said before, this is because in CSV, spaces matter because they are included inside the column that they are a part of, and rules that are capitalized skip all spaces.

Try adding some test cases on the right side of the editor to make sure the parser works right.

Extra challenge:
CSVs also use quotes to encode values that contain commas. Heres an example:

value one,"this value has commas: apple, banana, orange",value 3

See if you can add this to your parser.

Now lets write up the code to run this parser.
Open up a new Nodejs repl and make a new file called csv.ohm.

Using the Ohm grammar with nodejs

Now lets write some code to use our brand new Ohm Grammar.

Open up a new Nodejs repl, and make a new file called csv.ohm, and copy and paste the grammar you just used into it.
If you need a reminder, here it is:

CSV {
  csv = row (eol ~end row)* eol?
  row = col ("," col)*
  col = colChar*
  colChar = ~(eol | ",") any
  eol = "\r"? "\n"
}

Now in index.js, lets import the node filesystem library and ohm, and load our grammar:

const fs = require('fs')
const ohm = require('ohm-js')

// load grammar
const grammarData = fs.readFileSync('csv.ohm')
const grammar = ohm.grammar(grammarData)

Now that we have the grammar loaded, lets try parsing something:

const input = "1,2"
const result = grammar.match(input)

if (result.succeeded()) {
    console.log("Matching Succeeded")
} else {
    console.log("Matching Failed")
}

If you run the repl, it should print out "succeeded".
Because this grammar is so simple, it probably won't ever fail, but it is important to handle errors in your own language when they occur.

What is this match result?
The match result returned by grammar.match() is an object representing how the input was matched according to the rules given.
The match result divides the input into a tree that looks like the one you see in the visualizer:

The way that we parse this tree of nodes is by adding "semantic actions".
Semantic actions define what to do with valid input parsed with grammar.match().

Lets go through an example of a semantic action. Say you have a CSV file and want to turn it into a 2d array of rows and columns.
Lets make a semantic action called toArray that will do this.

const semantics = grammar.createSemantics()

semantics.addOperation(
    'toArray',
    {
        csv: () => {
            console.log('test')
        }
    }
)

console.log(semantics(result).toArray())

What does this do?

It adds a new action, called toArray, that you can use on matches.
We pass in the name of the operation and a mapping of rule names to handler functions to semantics.addOperation.

Then, we can register use the semantics with our match using semantics(result) and then calling our action name as a function.

Lets give the code some more interesting input and run it.

Change input to be one,two\nthree,four\n and run the code.

We get this confusing error:

Error: Found errors in the action dictionary of the 'toArray' operation:
- Semantic action 'csv' has the wrong arity: expected 4, got 0

What this really means is that the handler function for the csv rule should have exactly 4 arguments. Lets see what those arguments are:

...
csv: (a, b, c, d) => {
    console.log(a, b, c, d)
}
...

If we run this, we get a lot of text!
We can see that the first argument is a row rule from the ctorName of its node, so lets name it as such.
Also, lets filter the output to just be the underlying "_node" of the remaning mystery arguments:

...
csv: (firstRow, b, c, d) => {
    console.log(b._node, c._node, d._node)
}
...

Now we can see that the second and last arguments have only one child which is an eol.
Also, the third argument has a child which is a row. So now lets name all the arguments:

...
csv: (firstRow, firstEol, rowNodes, lastEol) => {

}
...

The arguments do make sense, considering the definition for this rule: row (eol ~end row)* eol?.

Actions are also recursive. This means that we can just call .toArray() on each row instead of doing all of the logic in the CSV handler:

...
csv: (firstRow, firstEol, rowNodes, lastEol) => {
    const rows = [firstRow].concat(rowNodes.children)
    return rows
},
row: () => {

},
...

Now we repeat the debugging process of finding each argument with row. I'll spare you the trouble and give you the code. Its very simple to the last rule, because we are again passing down to the col rule:

...
row: (firstCol, comma, otherColNodes) => {
    const cols = [firstCol].concat(otherColNodes.children)
    return cols.map(col => col.toArray())
},
...

For the column code, all we want to do is get the raw text that the parser matched.
We can only do this for the "terminal" nodes, which are the lowest level. In order to get all of the text from a non-terminal node, we can use a recursive function that descends down the node's children and gets their primitiveValue attribute, which is the raw text, if they are of the terminal type.

Define this above the call to addOperation:

// recursive function to get the source of a non-terminal node
const node_to_source = node => {
    if (node.ctorName == "_terminal") {
        return node.primitiveValue
    } else {
        return node.children.map(n => node_to_source(n)).join('')
    }
}

Now, in the column handler, we can call this function:

...
col: (charNode) => {
    return node_to_source(charNode._node)
}
...

This completes the example. Here is the full code:

const semantics = grammar.createSemantics()

// recursive function to get the source of a non-terminal node
const node_to_source = node => {
    if (node.ctorName == "_terminal") {
        return node.primitiveValue
    } else {
        return node.children.map(n => node_to_source(n)).join('')
    }
}

semantics.addOperation(
    'toArray',
    {
        csv: (firstRow, firstRowEol, otherRowNodes, eol) => {
            const rows = [firstRow].concat(otherRowNodes.children)
            return rows.map(row => row.toArray())
        },
        row: (firstCol, comma, otherColNodes) => {
            const cols = [firstCol].concat(otherColNodes.children)
            return cols.map(col => col.toArray())
        },
        col: (charNode) => {
            return node_to_source(charNode._node)
        }
    }
)

console.log(semantics(result).toArray())

It should print this in the console:

[ [ 'one', 'two' ], [ 'three', 'four' ] ]

Congratulations! You made your first parser with Ohm!

To add on to the parser, try making it read from a csv file on disk or ask the user for input using the prompt function.

More advanced example

If you want to see a calculator example, refer to the repl attached at the bottom of this post.

It is taken from the awesome Ohm example which has even more examples such as tree and lisp generation!

Wrapping it up

Thanks for reading! If you have questions, feel free to ask on the Replit discord's #help-and-reviews channel, or just the comments.
Have fun!

Commentshotnewtop
DynamicSquid (2629)

Me: write that down write that down