Make your own language, in whatever language you choose: Pt 2. The Lexer
Let's lex! A lexer (Or scanner, or whatever you want to call it) transforms this:
print('hello world') into this:
[Type: identifier, contents: "print"], [Type: left parenthesis, contents: "("] [Type: right parenthesis, contents: ")"]. It is not nessesary, but it makes it a lot easier to parse code. Let's write some base code.
Side note: Our pseudocode language has classes, enums, and builtin strings, but you can try to translate this to another language however you think works.
enum TokenType ID, LPAREN, STRING, RPAREN, // etc class Token public TokenType type string contents function Token(TokenType _type, string _contents) type = _type contents = _contents
Ok, have you implemented the class in your language of choice with whatever tokens are nessesary? Here's an example for Shimmer, a programming language I'm working on with @beaver700nh:
enum TokenType ID, STRING, INT, LPAREN, LBRACE, RPAREN, RBRACE, COMMA
Shimmer is pretty minimalistic, but you get the idea. You want regexes in the language of your choice, because our lexer will use them. Our lexer is a simple state machine. This code will cover strings, ints, identifiers, and characters (like '('). mk_token uses case/switch, which is easily replacable, and it throws an error, but this is optional
enum State NONE, ID, STRING, INT function mk_token(State state, string contents) switch state case ID return Token(ID, contents) case STRING return Token(STRING, contents) case INT return Token(INT, contents) case NONE throw("Error: Cannot create a token without a proper state") function lex(string to_lex) -> Token State state string contents Token tokens for string i in to_lex if state == STRING if i == '"' tokens.push(Token(STRING, contents)) contents = "" state = NONE else contents.push(i) else if state == ID && matches("[a-zA-Z0-9_]", i) contents.push(i) else if state == INT && matches("[0-9]", i) contents.push(i) else if matches("[a-zA-Z]", i) state = ID contents.push(i) else if matches("[0-9]", i) state = INT contents.push(i) else if i == "(" if state != NONE tokens.push(mk_token(state, contents)) contents = "" tokens.push(LPAREN, contents) else if i == ")" if state != NONE tokens.push(mk_token(state, contents)) contents = "" tokens.push(RPAREN, contents) // You get the gist of it. We can easily add on more characters, and comments won't be hard either
To recap: We made a class for tokens, and created a function called lex that turns a string and picks out tokens from it. To accomplish this, we wrote a state machine and a helper function that crafted a token from a state and contents.