A complete, teaching-oriented compiler implementation in Java that compiles C-style source code into assembly-like intermediate code. This project demonstrates all major stages of compilation: lexical analysis, parsing, semantic analysis, symbol table management, and code generation.
Disclaimer: This project is designed for educational purposes and learning compiler construction concepts.
- ✅ Complete Compilation Pipeline: From source code to generated assembly instructions
- ✅ Lexical Analysis: Tokenizes source code with support for keywords, operators, numbers, and identifiers
- ✅ Recursive Descent Parser: Builds Abstract Syntax Tree (AST) from token stream
- ✅ Symbol Table Management: Hierarchical scope-based symbol tables with offset tracking
- ✅ Type System: Support for
int,float,boolean, and array types - ✅ Code Generation: Generates stack-based assembly-like intermediate code
- ✅ File Output: Saves compiled code to
.asmfiles in acompiled/directory - ✅ Comprehensive Test Suite: 10 test files covering various language features
- Quick Start
- Compiler Architecture
- Language Features
- Project Structure
- Usage Examples
- Output Format
- Test Suite
- Development
- Java Development Kit (JDK) 11 or newer
From the project root directory:
# Create output directory
mkdir -p out
# Compile all Java source files
javac -d out src/**/*.javaNavigate to the out directory and run the compiler:
cd out
# Compile a test file
java javaCompiler.JavaCompiler ../src/test/09_mixed_operations.tstThis will:
- Parse the source code and build the AST
- Display the symbol table tree
- Generate and display intermediate code
- Save the compiled code to
compiled/09_mixed_operations.asm
This compiler implements a complete compilation pipeline with the following stages:
Location: src/lexer/
The lexer tokenizes source code into a stream of tokens. It recognizes:
- Keywords:
if,else,while,do,break,true,false - Data Types:
int,float,boolean - Operators:
+,-,*,/,<,>,<=,>=,==,!=,&&,||,! - Literals: Integer numbers, floating-point numbers, boolean values
- Identifiers: Variable names
- Delimiters:
{,},(,),[,],;,=
Key Classes:
Lexer.java- Main tokenizerToken.java- Base token classKeyword.java- Keyword tokensNum.java- Integer literalsReal.java- Float literalsTag.java- Token type constants
Location: src/parser/
Uses recursive descent parsing to build an Abstract Syntax Tree (AST). The parser implements the following grammar:
PROG → BLOCK
BLOCK → { DECLS STMTS }
DECLS → DECLS DECL | ε
DECL → TYPE id ;
TYPE → TYPE [num] | int | float | boolean
STMTS → STMTS STMT | ε
STMT → id = BOOL ; | if (BOOL) STMT | if (BOOL) STMT else STMT
| while (BOOL) STMT | do STMT while (BOOL) ;
| break ; | BLOCK
BOOL → BOOL || JOIN | JOIN
JOIN → JOIN && EQUALITY | EQUALITY
EQUALITY → EQUALITY == REL | EQUALITY != REL | REL
REL → EXPR < EXPR | EXPR <= EXPR | EXPR >= EXPR | EXPR > EXPR | EXPR
EXPR → EXPR + TERM | EXPR - TERM | TERM
TERM → TERM * UNARY | TERM / UNARY | UNARY
UNARY → !UNARY | -UNARY | FACTOR
FACTOR → (BOOL) | id | id[BOOL] | num | real | true | false
Key Features:
- Handles operator precedence correctly
- Supports nested scopes with proper symbol table management
- Error reporting with line numbers
Location: src/inter/
The AST is composed of two main categories:
Constant- Literal values (numbers, booleans)Id- Variable identifiersArith- Arithmetic operations (+, -, *, /)Rel- Relational operations (<, >, <=, >=, ==, !=, &&, ||)Not- Logical negationUnary- Unary minusAccess- Array element access
Set- Variable assignmentSetArrayElem- Array element assignmentIf- Conditional statementIfElse- Conditional with else branchWhile- While loopDoWhile- Do-while loopBreak- Break statementStmtSeq- Statement sequence
Location: src/symbol/
Manages variable declarations and type information with:
- Hierarchical Scopes: Nested block scopes with parent-child relationships
- Type Checking: Support for int, float, boolean, and array types
- Offset Tracking: Calculates memory offsets for variables
- Scope Resolution: Proper variable lookup through scope chain
Key Classes:
Env.java- Symbol table environment (one per scope)Type.java- Type representationsArray.java- Array type with dimensionsSymbolTablePrinter.java- Pretty-prints the symbol table tree
Location: src/codegen/
Generates stack-based assembly-like intermediate code with instructions:
Instructions:
PUSH <value>- Push value onto stackLOAD <var>- Load variable value onto stackSTORE <var>- Store top of stack to variableADD,SUB,MUL,DIV- Arithmetic operationsCMP <op>- Compare operationNOT- Logical negationJZ <label>- Jump if zeroJNZ <label>- Jump if not zeroJMP <label>- Unconditional jumpLABEL <name>- Label definitionLOADARR <array>- Load array elementSTOREARR <array>- Store to array element
Key Classes:
CodeGen.java- Code generator (visitor pattern)Instruction.java- Instruction representation
int i;
float x;
boolean done;
int arr[10];x = a + b * c;
y = (a - b) / 2.0;
z = -x;if (x > 5 && y < 10) { ... }
while (i < 100 || !done) { ... }
if (a == b) { ... }// If-else
if (condition) {
statement;
} else {
statement;
}
// While loop
while (condition) {
statement;
}
// Do-while loop
do {
statement;
} while (condition);
// Break
while (true) {
if (condition) break;
}{
int x;
x = 5;
{
float y;
y = 2.5;
{
boolean flag;
flag = true;
}
}
}C-Compiler/
├── README.md
├── src/
│ ├── codegen/ # Code generation
│ │ ├── CodeGen.java # Main code generator
│ │ └── Instruction.java # Instruction representation
│ ├── inter/ # Intermediate representation (AST)
│ │ ├── Node.java # Base AST node
│ │ ├── expr/ # Expression nodes
│ │ │ ├── Constant.java
│ │ │ ├── Id.java
│ │ │ ├── Access.java
│ │ │ ├── Expr.java
│ │ │ ├── Rel.java
│ │ │ ├── arith/
│ │ │ │ ├── Arith.java
│ │ │ │ └── Unary.java
│ │ │ └── logic/
│ │ │ └── Not.java
│ │ └── stmt/ # Statement nodes
│ │ ├── Stmt.java
│ │ ├── Set.java
│ │ ├── SetArrayElem.java
│ │ ├── If.java
│ │ ├── IfElse.java
│ │ ├── While.java
│ │ ├── DoWhile.java
│ │ ├── Break.java
│ │ └── StmtSeq.java
│ ├── parser/ # Parser
│ │ └── Parser.java # Recursive descent parser
│ ├── lexer/ # Lexical analyzer
│ │ ├── Lexer.java # Tokenizer
│ │ ├── Token.java # Base token
│ │ ├── Keyword.java # Keywords
│ │ ├── Num.java # Integer literals
│ │ ├── Real.java # Float literals
│ │ └── Tag.java # Token tags
│ ├── symbol/ # Symbol table & types
│ │ ├── Env.java # Environment/scope
│ │ ├── Type.java # Type system
│ │ ├── Array.java # Array types
│ │ └── SymbolTablePrinter.java
│ ├── javaCompiler/ # Main driver
│ │ └── JavaCompiler.java # Entry point
│ └── test/ # Test files
│ ├── 01_empty_block.tst
│ ├── 02_variable_declarations.tst
│ ├── 03_simple_assignment.tst
│ ├── 04_arithmetic_ops.tst
│ ├── 05_boolean_expressions.tst
│ ├── 06_if_else.tst
│ ├── 07_while_loop.tst
│ ├── 08_do_while_break.tst
│ ├── 09_mixed_operations.tst
│ └── 10_complex_boolean.tst
└── out/
├── [compiled class files]
└── compiled/ # Generated assembly files
└── *.asm
Input (test.tst):
{
int x;
int y;
int result;
x = 10;
y = 20;
result = x + y * 2;
}Run:
java javaCompiler.JavaCompiler test.tstOutput:
=== Symbol Table Tree ===
Scope Level 0
Scope Level 1
+------------+----------+--------+
| Identifier | Type | Offset |
+------------+----------+--------+
| result | int | 8 |
| y | int | 4 |
| x | int | 0 |
+------------+----------+--------+
=== Generated Code ===
PUSH 10
STORE x
PUSH 20
STORE y
LOAD x
LOAD y
PUSH 2
MUL
ADD
STORE result
=== Output saved to: compiled/test.asm ===
Input:
{
int i;
i = 0;
while (i < 5) {
i = i + 1;
}
}Generated Code:
PUSH 0
STORE i
LABEL L0
LOAD i
PUSH 5
CMP <
JZ L1
LOAD i
PUSH 1
ADD
STORE i
JMP L0
LABEL L1
The project includes 10 comprehensive test files:
| Test File | Description |
|---|---|
01_empty_block.tst |
Empty block |
02_variable_declarations.tst |
Variable declarations with different types |
03_simple_assignment.tst |
Basic assignment statements |
04_arithmetic_ops.tst |
Arithmetic operations and precedence |
05_boolean_expressions.tst |
Boolean logic and comparisons |
06_if_else.tst |
Conditional statements |
07_while_loop.tst |
While loop structures |
08_do_while_break.tst |
Do-while loops with break |
09_mixed_operations.tst |
Complex nested structures |
10_complex_boolean.tst |
Advanced boolean expressions |
Linux/macOS:
cd out
for f in ../src/test/*.tst; do
echo "--- Running: $f ---"
java javaCompiler.JavaCompiler "$f"
echo ""
doneWindows (PowerShell):
cd out
Get-ChildItem ..\src\test\*.tst | ForEach-Object {
Write-Host "--- Running: $($_.Name) ---"
java javaCompiler.JavaCompiler $_.FullName
Write-Host ""
}# Clean previous build
rm -rf out
# Create output directory
mkdir -p out
# Compile
javac -d out src/**/*.java
# Navigate to output
cd outThe codebase is designed to be extensible:
- New Operators: Add to
Lexer.javaand create corresponding AST nodes ininter/expr/ - New Statements: Create new statement classes in
inter/stmt/and updateParser.java - New Data Types: Extend
Type.javaand update the parser'stype()method - Code Generation: Add new instruction types in
Instruction.javaand updateCodeGen.java
- Single Responsibility: Each class has a clear, focused purpose
- Visitor Pattern: Code generation uses visitor pattern for AST traversal
- Recursive Descent: Parser mirrors the grammar structure
- Hierarchical Scopes: Symbol table uses parent pointers for scope chain
This compiler is organized to demonstrate key compiler concepts:
- Lexical Analysis: See
Lexer.javafor tokenization techniques - Parsing: Study
Parser.javafor recursive descent implementation - AST Construction: Review
inter/for tree node design - Symbol Tables: Examine
Env.javafor scope management - Code Generation: Analyze
CodeGen.javafor instruction generation
Potential improvements and extensions:
- Optimization: Constant folding, dead code elimination
- Arrays: Full multi-dimensional array support
- Functions: Function declarations and calls
- Structs: User-defined types
- Error Recovery: Better error handling and recovery
- Target Code: Generate actual machine code or bytecode
- Virtual Machine: Implement an interpreter for the generated code
- Debugger: Add debugging support with breakpoints
This is an educational project. Contributions that enhance learning value are welcome!
- Keep changes focused and well-documented
- Add test cases for new features
- Maintain the teaching-oriented style
- Comment complex algorithms
- Follow existing code conventions
For questions, issues, or suggestions:
- Open an issue on the repository
- Create a pull request with clear descriptions
- Reach out for collaboration opportunities
This project is unlicensed and provided as-is for educational purposes.
Happy Compiling! 🚀