Overview

Compiler for a domain-specific language (DSL). Final Project for the COMP 442 (Compiler Design) course at Concordia University with Dr. Joey Paquet. The language that I chose to use for implementation of my project is Java, because (1) I am familiar with it and (2) this is the only fully supported language during the lab.

flowchart TD
    Start([Source Code]) --> LA[Lexical Analysis]
    LA -->|Token Stream| SA[Syntactic Analysis]
    SA -->|Syntax Tree| SemA[Semantic Analysis]
    SemA -->|Annotated Tree| HLO[High-Level Optimization]
    
    HLO -->|Intermediate Representation| TCG[Target Code Generation]
    TCG -->|Target Code| LLO[Low-Level Optimization]
    LLO --> End([Optimized Target Code])
    
    subgraph Frontend ["Frontend"]
        LA
        SA
        SemA
        HLO
    end
    
    subgraph Backend ["Backend"]
        TCG
        LLO
    end
    
    style Frontend fill:#e3f2fd
    style Backend fill:#fff3e0
    style Start fill:#c8e6c9
    style End fill:#c8e6c9

Lexical Analysis

The first assignment consisted of designing and implementing a scanner for a programming language whose lexical specification was provided to us. The lexical analyzer that I designed follows a blend between (1) table-driven scanning (methods that follow the suggested algorithm in class, including the “table” function and backtracking logic) and (2) procedural methods used for validation rules (methods prefixed by “process”). My table is defined within the lexical scanner simply as individual “final int” variables (1 - 36). These are then categorized as either intermediate (1 -10) or final states (36).

classDiagram
direction BT
class LexDriver {
  + LexDriver() 
  + main(String[]) void
}
class LexicalScanner {
  + LexicalScanner(String) 
  - int STATE_FINAL_INLINECMT
  - int STATE_FINAL_EOF
  - int STATE_FINAL_MULT
  - int STATE_FINAL_ASSIGN
  - boolean endOfFile
  - int STATE_FINAL_CLOSEPAR
  - int STATE_FINAL_NOTEQ
  - int STATE_IN_BLOCK_COMMENT
  - Map~String, TokenType~ keywords
  - int STATE_IN_INLINE_COMMENT
  - List~String~ tokenErrors
  - int STATE_FINAL_CLOSECUBR
  - int STATE_FINAL_CLOSESQBR
  - int STATE_FINAL_COMMA
  - int STATE_AFTER_STAR_IN_BLOCK_COMMENT
  - int STATE_FINAL_OPENPAR
  - int STATE_FINAL_COLONCOLON
  - int lineNumber
  - int STATE_FINAL_GT
  - int STATE_AFTER_LT
  - int STATE_FINAL_OPENCUBR
  - int START_STATE
  - int STATE_AFTER_GT
  - int STATE_ID_CONTINUATION
  - int STATE_FINAL_DIV
  - int STATE_FINAL_BLOCKCMT
  - int FIRST_FINAL_STATE
  - int STATE_FINAL_DOT
  - boolean hasChar
  - int STATE_AFTER_COLON
  - int STATE_FINAL_COLON
  - int STATE_FINAL_ID
  - int STATE_FINAL_SEMI
  - int STATE_FINAL_EQ
  - char currentChar
  - BufferedReader characterReader
  - int STATE_AFTER_EQUALS
  - int STATE_FINAL_LT
  - int STATE_FINAL_GEQ
  - int STATE_FINAL_PLUS
  - int STATE_AFTER_SLASH
  - int STATE_FINAL_MINUS
  - int STATE_FINAL_LEQ
  - int STATE_FINAL_OPENSQBR
  + nextToken() Token
  - backtrackChar() void
  - continueProcessingFloatAfterDecimal(StringBuilder, int) Token
  + close() void
  - mustBacktrackInStateTable(int) boolean
  - consumeOperatorTokenOrInvalidToken(int) Token
  - processNumber(int) Token
  - processNumberStartingWithZero(int) Token
  - getLineNumber() int
  - isFinalState(int) boolean
  - nextChar() char
  - table(int, char) int
  + getTokenErrors() List~String~
  - processInvalidId(int) Token
  - processKeywordOrID(int) Token
  - createToken(int, String, int) Token
  - decrementLineNumber() void
  - continueProcessingExponent(StringBuilder, int) Token
  - initializeKeywords() void
  - incrementLineNumber() void
  - isEndOfFile() boolean
}
class Token {
  + Token(TokenType, String, int) 
  - String lexeme
  - int location
  - TokenType type
  + type() TokenType
  + lexeme() String
  - handleSpecialJavaCharacters(String) String
  + toString() String
  + location() int
}
class TokenType {
<<enumeration>>
  + TokenType() 
  +  LEQ
  +  COMMA
  +  INTEGER
  +  SEMI
  +  MAIN
  +  EQ
  +  CLOSECUBR
  +  DO
  +  COLONCOLON
  +  PRIVATE
  +  LOCAL
  +  WRITE
  +  VOID
  +  OPENCUBR
  +  READ
  +  MINUS
  +  OPENPAR
  +  INVALIDNUM
  +  ASSIGN
  +  GT
  +  ELSE
  +  DIV
  +  INHERITS
  +  DOT
  +  AND
  +  LT
  +  COLON
  +  FLOAT
  +  NOT
  +  INVALIDID
  +  THEN
  +  INTNUM
  +  BLOCKCMT
  +  INVALIDCHAR
  +  OPENSQBR
  +  CLOSESQBR
  +  CLOSEPAR
  +  CLASS
  +  ID
  +  IF
  +  PUBLIC
  +  OR
  +  EOF
  +  WHILE
  +  NOTEQ
  +  RETURN
  +  END
  +  FLOATNUM
  +  INLINECMT
  +  PLUS
  +  GEQ
  +  MULT
  + valueOf(String) TokenType
  + values() TokenType[]
}

Use of Tools

Flaci was used at first to verify the correctness of my models and tables for simple DFAs to represent integers, identifiers, floating point numbers, and fractions. Afterwards, when working towards implementing a transition table in Java for my lexical analyzer, I used the provided test .src files and my own test cases to verify that my implementation was correct.

Robert Nystrom’s ‘Crafting Interpreters’ book was consulted to help recognize lexemes and ‘end-of-file’ token types. The book also provided an alternative approach to backtracking which was to “peek” or “look ahead” in the sequence of characters. Testing consisted of first using the provided files from assignment, verifying that they produced the expected outputs, and then moving onto expanding the test files with my own test cases. Adding my own tests and applying fixes/updates to my LexicalScanner was part of the last phase of the assignment. For example, I made sure that comments will include distinct whitespace characters like \t, \r, and \n as provided in Java, and single-line comments won’t provide invalid tokens if an empty line isn’t provided at the end of a file.

Finally, for documentation purposes I want to mention a tool that I tried using (and justify why I did not end up using it). Java’s built-in Tokenizer seemed helpful at first, but it would have required a proper way to verify what exceptions Java’s Tokenizer allows/forbids (and then address any issues). For instance, Java’s Tokenizer allows trailing zeroes, which contradicts the original lexical specification. Given these discrepancies, implementing my own table-driven tokenizer was more straightforward (and potentially less time-consuming) than updating Java’s existing implementation to match the lexical specifications.

Author

Giuliano Verdone

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
test_input/assignment_1		test_input/assignment_1
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Table of Contents

Lexical Analysis

Use of Tools

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Table of Contents

Lexical Analysis

Use of Tools

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages