Learn to Code via Tutorials on Repl.it!

← Back to all posts
Make your own language, in whatever language you choose: Pt 2. The Lexer
doineednumbers (16)

Let's lex! A lexer (Or scanner, or whatever you want to call it) transforms this: print('hello world') into this:
[Type: identifier, contents: "print"], [Type: left parenthesis, contents: "("] [Type: right parenthesis, contents: ")"]. It is not nessesary, but it makes it a lot easier to parse code. Let's write some base code.
Side note: Our pseudocode language has classes, enums, and builtin strings, but you can try to translate this to another language however you think works.

enum TokenType
    ID,
    LPAREN,
    STRING,
    RPAREN,
    // etc
class Token
    public
        TokenType type
        string contents
        function Token(TokenType _type, string _contents)
            type = _type
            contents = _contents

Ok, have you implemented the class in your language of choice with whatever tokens are nessesary? Here's an example for Shimmer, a programming language I'm working on with @beaver700nh:

enum TokenType
    ID,
    STRING,
    INT,
    LPAREN,
    LBRACE,
    RPAREN,
    RBRACE,
    COMMA

Shimmer is pretty minimalistic, but you get the idea. You want regexes in the language of your choice, because our lexer will use them. Our lexer is a simple state machine. This code will cover strings, ints, identifiers, and characters (like '('). mk_token uses case/switch, which is easily replacable, and it throws an error, but this is optional

enum State
    NONE,
    ID,
    STRING,
    INT
function mk_token(State state, string contents)
    switch state 
        case ID
            return Token(ID, contents)
        case STRING
            return Token(STRING, contents)
        case INT
            return Token(INT, contents)
        case NONE
            throw("Error: Cannot create a token without a proper state")
function lex(string to_lex) -> Token[]
    State state
    string contents
    Token[] tokens
    for string i in to_lex
        if state == STRING
            if i == '"'
                tokens.push(Token(STRING, contents))
                contents = ""
                state = NONE
            else
                contents.push(i)
        else if state == ID && matches("[a-zA-Z0-9_]", i)
            contents.push(i)
        else if state == INT && matches("[0-9]", i)
            contents.push(i)
        else if matches("[a-zA-Z]", i)
            state = ID
            contents.push(i)
        else if matches("[0-9]", i)
            state = INT
            contents.push(i)
        else if i == "("
            if state != NONE
                tokens.push(mk_token(state, contents))
            contents = ""
            tokens.push(LPAREN, contents)
        else if i == ")"
            if state != NONE
                tokens.push(mk_token(state, contents))
            contents = ""
            tokens.push(RPAREN, contents)
        // You get the gist of it. We can easily add on more characters, and comments won't be hard either

To recap: We made a class for tokens, and created a function called lex that turns a string and picks out tokens from it. To accomplish this, we wrote a state machine and a helper function that crafted a token from a state and contents.

Commentshotnewtop
JBYT27 (1322)

Pretty cool tutorial!

doineednumbers (16)

@JBYT27 Are you following along? If so, what language are you using to implement it?

JBYT27 (1322)

kinda, not fully XD. Well, i'm planning on using python @doineednumbers

doineednumbers (16)

Add a comment if you encounter any bugs or have a question.