Let's lex! A lexer (Or scanner, or whatever you want to call it) transforms this: print('hello world') into this: [Type: identifier, contents: "print"], [Type: left parenthesis, contents: "("] [Type: right parenthesis, contents: ")"]. It is not nessesary, but it makes it a lot easier to parse code. Let's write some base code. Side note: Our pseudocode language has classes, enums, and builtin strings, but you can try to translate this to another language however you think works.
enum TokenType
ID,
LPAREN,
STRING,
RPAREN,
// etc
class Token
public
TokenType type
string contents
function Token(TokenType _type, string _contents)
type = _type
contents = _contents
Ok, have you implemented the class in your language of choice with whatever tokens are nessesary? Here's an example for Shimmer, a programming language I'm working on with @beaver700nh:
Shimmer is pretty minimalistic, but you get the idea. You want regexes in the language of your choice, because our lexer will use them. Our lexer is a simple state machine. This code will cover strings, ints, identifiers, and characters (like '('). mk_token uses case/switch, which is easily replacable, and it throws an error, but this is optional
enum State
NONE,
ID,
STRING,
INT
function mk_token(State state, string contents)
switch state
case ID
return Token(ID, contents)
case STRING
return Token(STRING, contents)
case INT
return Token(INT, contents)
case NONE
throw("Error: Cannot create a token without a proper state")
function lex(string to_lex) -> Token[]
State state
string contents
Token[] tokens
for string i in to_lex
if state == STRING
if i == '"'
tokens.push(Token(STRING, contents))
contents = ""
state = NONE
else
contents.push(i)
else if state == ID && matches("[a-zA-Z0-9_]", i)
contents.push(i)
else if state == INT && matches("[0-9]", i)
contents.push(i)
else if matches("[a-zA-Z]", i)
state = ID
contents.push(i)
else if matches("[0-9]", i)
state = INT
contents.push(i)
else if i == "("
if state != NONE
tokens.push(mk_token(state, contents))
contents = ""
tokens.push(LPAREN, contents)
else if i == ")"
if state != NONE
tokens.push(mk_token(state, contents))
contents = ""
tokens.push(RPAREN, contents)
// You get the gist of it. We can easily add on more characters, and comments won't be hard either
To recap: We made a class for tokens, and created a function called lex that turns a string and picks out tokens from it. To accomplish this, we wrote a state machine and a helper function that crafted a token from a state and contents.
Let's lex! A lexer (Or scanner, or whatever you want to call it) transforms this:
print('hello world')
into this:[Type: identifier, contents: "print"], [Type: left parenthesis, contents: "("] [Type: right parenthesis, contents: ")"]. It is not nessesary, but it makes it a lot easier to parse code. Let's write some base code.
Side note: Our pseudocode language has classes, enums, and builtin strings, but you can try to translate this to another language however you think works.
Ok, have you implemented the class in your language of choice with whatever tokens are nessesary? Here's an example for Shimmer, a programming language I'm working on with @beaver700nh:
Shimmer is pretty minimalistic, but you get the idea. You want regexes in the language of your choice, because our lexer will use them. Our lexer is a simple state machine. This code will cover strings, ints, identifiers, and characters (like '('). mk_token uses case/switch, which is easily replacable, and it throws an error, but this is optional
To recap: We made a class for tokens, and created a function called lex that turns a string and picks out tokens from it. To accomplish this, we wrote a state machine and a helper function that crafted a token from a state and contents.
The previous post is here: https://repl.it/talk/learn/Write-Your-Own-Language-In-Whatever-Language-You-Choose-Pt-1/78425