Compiler

GitHub Source Code

Start Date: December 2022

Status: Completed as of September 2023, continuing to add updates

I designed my own Programming Language, JLang, which is compiled using the JLang Compiler.

I have always been interested in how compilers generate computer readable instructions from human readable code and I wanted to learn and experience it myself so I decided to build a compiler for my own Programming Language.

JLang Overview

JLang is very similar to C in terms of syntax and semantics. You can also view more documentation on the project's github.

Variables & Expressions

int x = 10; //declaration of x, x is 10
x = 5; //assignment of x, x is 5
printd(x); //prints 5
int x = 20; //ERROR: redeclaration of x within same scope

int z = 2 * 3 + 5 - 2 / 8; //read BACKWARDS in order (no PEMDAS), evaluates to 4
int y = 3 && z < 10; // 10 is less than z (z is 4) FALSE (0), 3 (TRUE) AND 0 (FALSE) is FALSE (0), y is 0

Functions & Control Flow

int foo(int a, int b) { //function foo, has params a and b both of type int, returns an int
    if (5 == a) { //if a is 5, run these lines
        int a = 2; //new scope, a is 2
        return a - 10; //returns the value of 10 - 2 which is 8
    }
    elseif (5 < a) { //if above condition is false, if a is less than 5, run these lines
        return a; //returns whatever a is
    }
    else { //if both conditions fail, run these lines
        a = a * b; //sets parameter a to b * a
    }
    for (int i = 0; b < i; i = i + 1) { //initialize i to 0, run this code while condition i is less than b holds, increment i by 1 after each iteration
        printd(i); //prints current value of i
    }
    //this is an example of an equivalent while loop
    //int i = 0;
    //while (i < b) {
    //   printd(i);
    //   i = i + 1; 
    //}
    return i; //ERROR variable i in for loop is no longer in scope, however the equaivalent while loop would make it in scope
    return b; //returns b, last line of function MUST be a return statement
}

Arrays

int array[5] = 2; //initializes an array of length 5, every element set to 2, array = [2, 2, 2, 2, 2]
array[0] = 10; //changes element 0 to value 10, array = [10, 2, 2, 2, 2]
int x = array[0]; //initializes a variable x to the value at array[0] which is 10

Development Process

This process was broken down into 5 major parts: Scanner, Parser, Semantics, Intermediate Representation, Code Generation. I used this textbook as a guide throughout.

int square(int num) {
    return num * num;
}

Scanner

This is the first stage of compiling where the source code is read as a stream of characters and grouped into meaningful tokens. Some examples include int (for integers) if (for conditionals) and identifiers (for variable and function names) as well as symbols used for syntax.

INPUT:

i n t   s q u a r e     (       i n t     n u m      )        {
[ 1 ]      [ 2 ]      [ 3 ]     [ 4 ]     [ 5 ]    [ 6 ]    [ 7 ]
r e t u r n     n u m       *        n u m         ;
   [ 8 ]        [ 9 ]     [ 10 ]     [ 11 ]      [ 12 ]
  }
[ 13 ]

------------------------------------------------------------------------------------------------------------------------------------------------

OUTPUT:

 [ 1 ]           [ 2 ]                    [ 3 ]              [ 4 ]           [ 5 ]                 [ 6 ]                      [ 7 ]
TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_LEFT_PARENTHESES -> TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_RIGHT_PARENTHESE -> TOKEN_LEFT_CURLY_BRACE ->
   [ 8 ]             [ 9 ]            [ 10 ]           [ 11 ]              [ 12 ]
TOKEN_RETURN -> TOKEN_IDENTIFIER -> TOKEN_MULT -> TOKEN_IDENTIFIER -> TOKEN_SEMICOLON ->
      [ 13 ]
TOKEN_RIGHT_CURLY_BRACE

Parser

This next stage uses the tokens generated from the scanner. It groups the tokens into rules for the language. If tokens are like words, rules are like sentences. If we have the following tokens: int x = 5 ; This forms a variable declaration rule. All rules that are parsed are added a syntax tree which holds the full program.

INPUT:

 [ 1 ]           [ 2 ]                    [ 3 ]              [ 4 ]           [ 5 ]                 [ 6 ]                      [ 7 ]
TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_LEFT_PARENTHESES -> TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_RIGHT_PARENTHESE -> TOKEN_LEFT_CURLY_BRACE ->
   [ 8 ]             [ 9 ]            [ 10 ]           [ 11 ]              [ 12 ]               [ 13 ]
TOKEN_RETURN -> TOKEN_IDENTIFIER -> TOKEN_MULT -> TOKEN_IDENTIFIER -> TOKEN_SEMICOLON -> TOKEN_RIGHT_CURLY_BRACE

------------------------------------------------------------------------------------------------------------------------------------------------

OUTPUT:

TYPE -> int
PARAMETERS -> TYPE identifer
EXPRESSION -> identifier * identifier
STATEMENTS -> return EXPRESSION ;
FUNCTION -> TYPE identifier ( PARAMETERS ) { STATEMENTS }

                           ________________
                          |                | 
                          |    FUNCTION    |
                          |________________|
                         /      /    |    \        
                        /      /     |     \
                       /      /      |      \
                      /      /       |       \
                   TYPE   square  PARAMETERS  STATEMENTS
                    |              |       |            \ 
                    |              |       |             \
                    |              |       |              \
                   int           TYPE     num         return EXPRESSION
                                   |                            |
                                   |                            |
                                   |                            |
                                  int                       num * num

Semantics

This stage includes 3 major parts: symbol creation, name resolution, and type checking.

Symbol creation occurs when an identifier is added to a symbol table which contains all declared variables and functions.

Name resolution happens after when identifiers are not used in the context of declaration such as assignment (when the variable is already declared) and in expressions.

Type checking occurs after all symbols and names are resolved and is responsible for determining if all symbols and constants are used without breaking any type rules.

INPUT:

                ________________
               |                | 
               |    FUNCTION    |
               |________________|
              /      /    |    \        
             /      /     |     \
            /      /      |      \
           /      /       |       \
        TYPE   square  PARAMETERS  STATEMENTS
         |              |       |            \ 
         |              |       |             \
         |              |       |              \
        int           TYPE     num         return EXPRESSION
                        |                            |
                        |                            |
                        |                            |
                       int                       num * num

------------------------------------------------------------

OUTPUT:

Symbol table:
square -- function type: int
num -- variable (param) type: int

Name resolution:
num (from Expression) -- num (variable)
num (from Expression) -- num (variable)

Type checking:
return statement:
square (type: int)  ==  [num (type: int)] * [num (type: int)]

Intermediate Representation

Once the language is known to be correct in terms of syntax and semantics translating the program into a language between the source code and assembly can begin. At this stage the syntax tree is translated into a "linear" language that consists of instructions similar to assembly. This intermediate language can still hold the types of variables and the symbols they refer to.

INPUT:
                ________________
               |                | 
               |    FUNCTION    |
               |________________|
              /      /    |      \        
             /      /     |       \
            /      /      |        \
           /      /       |         \
        TYPE   square  PARAMETERS  STATEMENTS
         |              |       |            \ 
         |              |       |             \
         |              |       |              \
        int           TYPE     num         return EXPRESSION
                        |                            |
                        |                            |
                        |                            |
                       int                       num * num


Symbol table:
square -- function type: int
num -- variable (param) type: int

Name resolution:
num (from Expression) -- num (variable)
num (from Expression) -- num (variable)

Type checking:
return statement:
square (type: int)  ==  [num (type: int)] * [num (type: int)]

-------------------------------------------------------------

OUTPUT:

Operation    Argument 1       Argument 2        Result
FUNC_DECL      square
PARAM            num
MULT             num             num            temp0
RETURN          temp0

Code Generation

In the final stage, the intermediate language is converted into assembly. Symbols are converted into addresses that a computer can recognize. The stack is being properly manipulated when functions are being called and local variables are saved.

INPUT:

Operation    Argument 1       Argument 2        Result
FUNC_DECL      square
PARAM            num
MULT             num             num            temp0
RETURN          temp0

-------------------------------------------------------------

OUTPUT:

square:
    PUSHQ %rbp
    MOVQ %rsp, %rbp
    MOVQ %rdi, -8(%rbp)
    MOVQ -8(%rbp), %rbx
    MOVQ -8(%rbp), %rax
    IMUL %rbx, %rax
    POPQ %rbp
    RET

Conclusion

You have now seen the development process for my compiler and compilers in general. If you would like to try out the language for yourself, please check out JLang and start writing some code! There are sample programs to run as well.

INPUT:

int square(int num) {
    return num * num;
}

-----------------------------------------------------

OUTPUT:

square:
    PUSHQ %rbp
    MOVQ %rsp, %rbp
    SUBQ $8, %rsp
    MOVQ %rdi, -8(%rbp)
    MOVQ -8(%rbp), %rbx
    MOVQ -8(%rbp), %rax
    IMUL %rbx, %rax
    POPQ %rbp
    RET