Start Date: December 2022
Status: Completed as of September 2023, continuing to add updates
I designed my own Programming Language, JLang, which is compiled using the JLang Compiler.
I have always been interested in how compilers generate computer readable instructions from human readable code and I wanted to learn and experience it myself so I decided to build a compiler for my own Programming Language.
JLang is very similar to C in terms of syntax and semantics. You can also view more documentation on the project's github.
int x = 10; //declaration of x, x is 10
x = 5; //assignment of x, x is 5
printd(x); //prints 5
int x = 20; //ERROR: redeclaration of x within same scope
int z = 2 * 3 + 5 - 2 / 8; //read BACKWARDS in order (no PEMDAS), evaluates to 4
int y = 3 && z < 10; // 10 is less than z (z is 4) FALSE (0), 3 (TRUE) AND 0 (FALSE) is FALSE (0), y is 0
int foo(int a, int b) { //function foo, has params a and b both of type int, returns an int
if (5 == a) { //if a is 5, run these lines
int a = 2; //new scope, a is 2
return a - 10; //returns the value of 10 - 2 which is 8
}
elseif (5 < a) { //if above condition is false, if a is less than 5, run these lines
return a; //returns whatever a is
}
else { //if both conditions fail, run these lines
a = a * b; //sets parameter a to b * a
}
for (int i = 0; b < i; i = i + 1) { //initialize i to 0, run this code while condition i is less than b holds, increment i by 1 after each iteration
printd(i); //prints current value of i
}
//this is an example of an equivalent while loop
//int i = 0;
//while (i < b) {
// printd(i);
// i = i + 1;
//}
return i; //ERROR variable i in for loop is no longer in scope, however the equaivalent while loop would make it in scope
return b; //returns b, last line of function MUST be a return statement
}
int array[5] = 2; //initializes an array of length 5, every element set to 2, array = [2, 2, 2, 2, 2]
array[0] = 10; //changes element 0 to value 10, array = [10, 2, 2, 2, 2]
int x = array[0]; //initializes a variable x to the value at array[0] which is 10
This process was broken down into 5 major parts: Scanner, Parser, Semantics, Intermediate Representation, Code Generation. I used this textbook as a guide throughout.
int square(int num) {
return num * num;
}
This is the first stage of compiling where the source code is read as a stream of characters and grouped into meaningful tokens. Some examples include int (for integers) if (for conditionals) and identifiers (for variable and function names) as well as symbols used for syntax.
INPUT:
i n t s q u a r e ( i n t n u m ) {
[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ]
r e t u r n n u m * n u m ;
[ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ]
}
[ 13 ]
------------------------------------------------------------------------------------------------------------------------------------------------
OUTPUT:
[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ]
TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_LEFT_PARENTHESES -> TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_RIGHT_PARENTHESE -> TOKEN_LEFT_CURLY_BRACE ->
[ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ]
TOKEN_RETURN -> TOKEN_IDENTIFIER -> TOKEN_MULT -> TOKEN_IDENTIFIER -> TOKEN_SEMICOLON ->
[ 13 ]
TOKEN_RIGHT_CURLY_BRACE
This next stage uses the tokens generated from the scanner. It groups the tokens into rules for the language. If tokens are like words, rules are like sentences. If we have the following tokens: int x = 5 ; This forms a variable declaration rule. All rules that are parsed are added a syntax tree which holds the full program.
INPUT:
[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ]
TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_LEFT_PARENTHESES -> TOKEN_INT -> TOKEN_IDENTIFIER -> TOKEN_RIGHT_PARENTHESE -> TOKEN_LEFT_CURLY_BRACE ->
[ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ]
TOKEN_RETURN -> TOKEN_IDENTIFIER -> TOKEN_MULT -> TOKEN_IDENTIFIER -> TOKEN_SEMICOLON -> TOKEN_RIGHT_CURLY_BRACE
------------------------------------------------------------------------------------------------------------------------------------------------
OUTPUT:
TYPE -> int
PARAMETERS -> TYPE identifer
EXPRESSION -> identifier * identifier
STATEMENTS -> return EXPRESSION ;
FUNCTION -> TYPE identifier ( PARAMETERS ) { STATEMENTS }
________________
| |
| FUNCTION |
|________________|
/ / | \
/ / | \
/ / | \
/ / | \
TYPE square PARAMETERS STATEMENTS
| | | \
| | | \
| | | \
int TYPE num return EXPRESSION
| |
| |
| |
int num * num
This stage includes 3 major parts: symbol creation, name resolution, and type checking.
Symbol creation occurs when an identifier is added to a symbol table which contains all declared variables and functions.
Name resolution happens after when identifiers are not used in the context of declaration such as assignment (when the variable is already declared) and in expressions.
Type checking occurs after all symbols and names are resolved and is responsible for determining if all symbols and constants are used without breaking any type rules.
INPUT:
________________
| |
| FUNCTION |
|________________|
/ / | \
/ / | \
/ / | \
/ / | \
TYPE square PARAMETERS STATEMENTS
| | | \
| | | \
| | | \
int TYPE num return EXPRESSION
| |
| |
| |
int num * num
------------------------------------------------------------
OUTPUT:
Symbol table:
square -- function type: int
num -- variable (param) type: int
Name resolution:
num (from Expression) -- num (variable)
num (from Expression) -- num (variable)
Type checking:
return statement:
square (type: int) == [num (type: int)] * [num (type: int)]
Once the language is known to be correct in terms of syntax and semantics translating the program into a language between the source code and assembly can begin. At this stage the syntax tree is translated into a "linear" language that consists of instructions similar to assembly. This intermediate language can still hold the types of variables and the symbols they refer to.
INPUT:
________________
| |
| FUNCTION |
|________________|
/ / | \
/ / | \
/ / | \
/ / | \
TYPE square PARAMETERS STATEMENTS
| | | \
| | | \
| | | \
int TYPE num return EXPRESSION
| |
| |
| |
int num * num
Symbol table:
square -- function type: int
num -- variable (param) type: int
Name resolution:
num (from Expression) -- num (variable)
num (from Expression) -- num (variable)
Type checking:
return statement:
square (type: int) == [num (type: int)] * [num (type: int)]
-------------------------------------------------------------
OUTPUT:
Operation Argument 1 Argument 2 Result
FUNC_DECL square
PARAM num
MULT num num temp0
RETURN temp0
In the final stage, the intermediate language is converted into assembly. Symbols are converted into addresses that a computer can recognize. The stack is being properly manipulated when functions are being called and local variables are saved.
INPUT:
Operation Argument 1 Argument 2 Result
FUNC_DECL square
PARAM num
MULT num num temp0
RETURN temp0
-------------------------------------------------------------
OUTPUT:
square:
PUSHQ %rbp
MOVQ %rsp, %rbp
MOVQ %rdi, -8(%rbp)
MOVQ -8(%rbp), %rbx
MOVQ -8(%rbp), %rax
IMUL %rbx, %rax
POPQ %rbp
RET
You have now seen the development process for my compiler and compilers in general. If you would like to try out the language for yourself, please check out JLang and start writing some code! There are sample programs to run as well.
INPUT:
int square(int num) {
return num * num;
}
-----------------------------------------------------
OUTPUT:
square:
PUSHQ %rbp
MOVQ %rsp, %rbp
SUBQ $8, %rsp
MOVQ %rdi, -8(%rbp)
MOVQ -8(%rbp), %rbx
MOVQ -8(%rbp), %rax
IMUL %rbx, %rax
POPQ %rbp
RET