CCC - C++ Compiler Compiler
spyinthehole (2)

Have you ever wanted to make a programming language and do not know where to start?

I introduce CCC.

About

CCC is a transpiler built in C++ for C++. CCC is built using EasyParser and EasyLexer, both of which I made for the jam. These allow for a quick way to set up a parser and lexer. The reason CCC came around is because when starting the jam, we were using the libraries but found that it was still pretty confusing to use. CCC was inspired by JavaCC hence the name, which means C++ Compiler Compiler. Unfortunately, CCC is limited by EasyParser and therefore is only LL(1), however when I update EasyParser to LL(k) I will update CCC. Fun fact, after first writing CCC I was not happy with it and therefore rewrote it using the CCC that was already made.

Usage

CCC is very simple and has a very small amount of syntax. However, once you understand the syntax you are able to make powerful languages.

Each statement, excluding sections, need to end with ;.

Sections

CCC is made of three sections; Include, Tokens and Productions. Each of these sections have to exist in the order listed above.

They are defined like this:

Include {

}

Tokens {

}

Productions {

}

Include Section

The Include section lists all the header files you want the CCC to use in its output. The header files can be both <> and "" defined. You must include a reference to the EasyParser and EasyLexer header. In the example, and in CCC EasyLexer is included by EasyParser, however it is not with the GitHub version.

Example:

Include {
    "EasyParser/EasyParser.h";
    <map>;
    <stack>;
}

Token Section

The Token Section contains a list of token name and their regular expression which will extract them. It also contains whether the token should be ignored or not. The section is for the EasyLexer to extract tokens.

Tokens which are higher up on the list will have higher priority (this is explained in EasyLexer).

Tokens are declared with their name, followed by the regular expression that will be used to extract that token. The regular expression needs to be escaped as it is directly transpiled to C++.

To ignore a token (again is explained in EasyLexer), add the ignore keyword before declaring the token.

Example:

Tokens {
    <keyword_if> := "if";
    ignore <whitespace> := "\\s+";
}

Productions

Productions are the main part of CCC. They define how to tokens should be combined to form the language.

A production uses a mix of tokens and other productions to define the structure of the syntax.

Note: You cannot leave a production empty, it needs to contain at least 1 sequence, token or production

Sequence

A sequence is a list of tokens or productions that should be in the order given.

Example:

<sequence> :=
(
    <A> <B> <C>
);

Or

Or is a set of sequences or tokens or productions that will only have one of the paths be transversed down. When using or and sequences together, make sure that they have been seperated using brackets

Example:

<or> :=
(
    <A> | <B> | <C>
);

Epsilon

The special character ε, defines that the expected token should be nothing. Therefore, this is mainly used to mark the end of a production and move on to the next listed. This has some special behaviours which allow some neat traversal of the productions. These can be seen in the test.ccc examples. Note: if using this in an or section, make sure it is the last option

Interaction

Using the tools listed above, you should now be able to create the syntax for the language. However, you need to add some functionality.

This is where the interaction comes in. Each production can be given a function that it will run when it has been successfully parsed. This function can be given variable parameters, which are defined in the production, and can return a value which can be used in other productions.

Adding a variable

To add a variable, you just need to add a name followed by = before any token or production.

Example:

<variable> :=
(
    variable_name = <name>
);

Adding a function

After you have finished the production, to add a function you need to add -> followed by a name and a list of variables that were defined in the production. You can also add a second -> afterwards in order to add a return type. This return type follows similar rules to C++ with namespaces(::) and generics(<>), but does not support reference(&) and pointer(*), if no type is declared it will default to void. This return type is why the include section is important, as if the include is missing the output file will error.

Examples:

<basic_function> :=
(
    <A>
) -> on_basic();

<variable_function> :=
(
   a = <A> b = <B>
) -> on_variable(a,b);

<return_function> :=
(
   <A>
) -> on_return() -> std::map<std::string>;

Token Variable

If you add a variable to a token rather than a production, the type will be Token. This type is from EasyLexer and will provide information about the token.

Variables:

  • token - Provides what type the token is, used to match against the enum Tokens in the header file.
  • value - Provides the raw string value that was extracted.
  • hasNext - Provides a way to tell whether this token is valid, true if valid.
  • line_number - Provides the line number the token started on(Used for error reporting).
  • start_character - Provides the character number on the line that the token started on(Used for error reporting).

Transpiling

When you run CCC on a .ccc, two other files will be generated in the same folder as the .ccc file. These are a header file and a C++ file.

The C++ file you do not have to worry about.
The header file will contain a list of functions which you will now have to implement. These functions will be the same as the functions you listed in the production section. Also the header file contains an enum Tokens which should be used to check which token you got. Finally, the header file contains a function parse which should be called with the string you want to parse.

After transpiling, the output files are not bound to CCC and therefore you can use them anywhere without needing CCC, however you will still need to include EasyParser and EasyLexer.

Limitations / Future Work

  • When defining a variable, you have to make sure all variables with the same name in the current production is of the same type, however not doing this does not cause an error.
  • When using a production that has a return type, you must assign a variable to it, again this does not cause an error.
  • The main limitation is error messages, there are very little error messages and the ones that exist have very little information.
  • There is a very strange bug which means that comments cannot come after the Productions section. This seems to be an issue with EasyLexer so I will have to investigate further.
  • As I am using EasyParser and EasyLexer, CCC is also limited by the limitations they have too.

Examples

In the repl there is a file called test.ccc, which contains some simple code to introduce the language.
There is also a sample language which evaluates a maths expression
Also, in the repl there is a file called grammar.ccc, this is the file that was used to create CCC.

Links

Repl - CCC
Github - Thespyinthehole, ZiadAmr

Update

It has come to my attention that a bug in the lexer has resulted in very long compile times. This is currently being fixed and will be in a new version of the lexer and then in CCC soon - 11/09/2020

EasyLexer has been updated with much faster lexing. I have yet to use it in CCC however I will once the submissions have finished being looked at - 17/09/2020

You are viewing a single comment. View All
TheDrone7 (1434)

Hello there! The jam required you to work as a team of at least 2 members, could you please edit the post and mention your teammates in the description?

Thank you.

spyinthehole (2)

@TheDrone7 I have added their github link. Should I add them anywhere else?