-
Notifications
You must be signed in to change notification settings - Fork 0
Semantic Analysis
NOTE: Semantic Analysis is way out of data and doesn't represent the current semantic analysis scheme whatsoever.
Semantic Analysis for MiniPascal doesn't make use of a ABS (abstract syntax tree). That makes implementing all the checks a bit more difficult than usual I think. I've gone ahead and made the following bunch of supporting constructs for the semantic analysis.
-
mptypes.h: A header file to contain the types that thempascal.yfile will use, as well as it's own token-types. -
debug.c/debug.h: (Not technically new) A set of files that buffer the line being read, and provide formatted printing methods for errors and warnings. -
numtab.c/numtab.h: A table of constants. Seems a bit silly, but I wanted one of the fields used in the type returned within thempascal.yfile to have a "pointer" to a number (if it was a constant). I didn't literally want to store the number there because then there's no way to signal "no-number". It's now an index into the symbol table, and when the index is-1orNILit indicates there is no constant provided. -
strtab.c/strtab.h: A table that holds strings. Mostly used bysymtab.h. -
symtab.c/symtab.h: The symbol table. Essentially stores a token-type (tt) along with anidentifier. Mostly used bympascal.yto check if an identifier exists or not (Yes, there is no constant field stored within the symbol table. This means that if you assignxthe value1and then referencexlater in an expression, it will knowxis of typeTT_INTEGER, but not of the constant value). This could be changed later, maybe.
It's necessary to define exactly what each nonterminal rule can return in Bison. An example of this was given in the calculator tutorial exercise.
%union {
float fval;
char *strtabptr;
}
This union type makes sense for a calculator, but not so much for a compiler. For one, a union only allows you to return one type at a time. We need more. The factor in MiniPascal may be a number, or an identifier. This means it must be able to return both at any time. Hence, defining the token factor as %token <fval> factor, or %token <strtabptr> factor is not complete.
Anyways, I only plan on implementing the expression rule for the meantime (and all rules that expression leads to). It then suffices to build a struct that has all the fields you might need for an expression and its sub-rules. I made the following struct:
typedef struct {
unsigned tt; // Token-Type.
int vi; // Value-Index: Index of constant value in numtab.
} exprType;
A constant like 3 gets the token-type TT_INTEGER, and a index into the number table which holds the value 3. Alternatively, an identifier x declared as TT_INTEGER gets the exact same struct, but with a value-index to NIL. Like I mentioned before, only constants keep their literal values. The moment you introduce a variable or index into a array, then it has to be saved to the symbol table (which doesn't store any literals) and the value is lost.
In any case, the union type for our parser is:
%union {
exprType expr;
}
When we expand beyond expressions, we can include additional fields like: exprListType exprList, and assignType assignment for the other rules (made those up).
As an example, I'll just explain how the factor rule works. Before this, I'll list what the token-types are.
Yes, we already have tokens returned by the mpascal.lex lexer (although defined in mpascal.y). But they aren't suitable for the semantic analysis stuff. Here's the list of new tokens so far (mptypes.h).
// Token-Types: Primitives.
#define TT_UNDEFINED 0
#define TT_INTEGER 1
#define TT_REAL 2
// Token-Class: Array Mask.
#define TC_ARRAY 4
// Token-Types: Arrays.
#define TT_ARRAY_INTEGER 4
#define TT_ARRAY_REAL 5
// Token-Class: Function Mask.
#define TC_FUNCTION 8
// Token-Types: Functions.
#define TT_FUNCTION_INTEGER 8
#define TT_FUNCTION_REAL 9
// Token-Types: Boolean.
#define TT_BOOLEAN 16
You might wonder what this "token classes" do. Well, that will be explained next. After that I'll get back to the factor rule implementation. For now though, I'll just show the rule so I can reference it.
factor : MP_ID { $$ = (exprType){.tt = getTypeElseInstall(yytext, TT_UNDEFINED), .vi = NIL}; }
| MP_ID MP_POPEN expressionList MP_PCLOSE { $$ = (exprType){.tt = resolveTypeClass(yytext, TC_FUNCTION), .vi = NIL}; }
| MP_ID MP_BOPEN expression MP_BCLOSE { $$ = (exprType){.tt = resolveTypeClass(yytext, TC_ARRAY), .vi = NIL}; }
| MP_INTEGER { $$ = (exprType){.tt = TT_INTEGER, .vi = installNumber(atof(yytext))}; }
| MP_REAL { $$ = (exprType){.tt = TT_REAL, .vi = installNumber(atof(yytext))}; }
| MP_POPEN expression MP_PCLOSE { $$ = $2; }
;
A factor may be a identifier like x, or a function call like: x(<expressionList>), or an array index like x[2+2]. When an array like x get declared, it obviously has a type. I.E: An array of integers. Therefore you'll give it a token like TT_ARRAY_INTEGER. However, you'll often just want to know if it's an array, and if you need, what primitive type it contains (integer/real). The classes you see next to the tokens serve to do broad categorization.
-
TT_ARRAY_INTEGERandTT_ARRAY_REALhave values 4 and 5 respectively. The mask forTC_ARRAYis 4. If you useTC_ARRAYas a mask and apply a bitwise-AND to two values, then you'll get a nonzero return value since both use have the third bit set. Notice no other types do. -
TT_FUNCTION_INTEGERandTT_FUNCTION_REALhave values 8 and 9 respectively. The mask forTC_FUNCTIONis 8. If you use it as a mask like described before, then you'll get a nonzero return value since both have the fourth bit set.
This is how you can easily determine the category of a TT_<type> using a TC_<class>.
As for "reducing" more abstract types like arrays and functions to primitives, notice that you can use the operation: x -> 1 + (x mod 2) to all of these values (except the primitives) to get one of the two primitive values.
-
Applying
x -> 1 + (x mod 2)to TT_ARRAY_REAL results in1 + (5 mod 2)which is 2 (soTT_REAL). -
Applying
x -> 1 + (x mod 2)to TT_FUNCTION_INTEGER results in1 + (8 mod 2)which is 1 (soTT_INTEGER).
This concludes my explanation of the token-type and token-classes. It's basically just an explanation for why the numbers are the way they are. The only utility they serve is to make it easier to implement the semantic checking functions.
Lets now explain each of the lines in the factor rule:
- When you encounter
MP_ID, you basically want to look up theidin the symbol table, and get the token-typett. If you can't get that, then it's clearly undefined, and you return anexprTypewith a token-typeTT_UNDEFINED. That most likely will trigger an error later up in the expression if its used. - When you encounter a function call, you need to do more than just lookup the token-type. You need to ensure that the token-type is of class
TC_FUNCTIONand then you need to deduce the return type it has (TT_INTEGERorTT_REAL). Therefore, I wrote functions to "resolve" the token-type you expect to its primitive. - When you encounter an array index, you do the same sort of steps as you do when encountering a function call. You must expect that an identifier exists, and that it has a token-type of class
TC_ARRAY. Finally, you must extract the primitive type from that and return it as the new token-type to the rule that uses it. - Encountering an
MP_INTEGERallows you to directly return aexprTypewith a token-type ofTT_INTEGERand a constant value that you interpreted from theyytextvalue. - Encountering an
MP_REALallows you to directly return aexprTypewith token-typeTT_REALand a constant value that you interpreted from theyytextvalue.
And that basically concludes how the factor rule checking is implemented. As you go higher up the grammar, you run into terms and simpleExpressions. Both of these involve arithmetic operations like addition, subtraction, division, etc. Here is where I do reductions on literal expressions and check for divisions by zero.
- We don't have any block levels in use yet.
- We probably should save constant values to the symbol table. Nobody really explicitly divides by zero...
- We need to implemented the remaining rules besides expressions.
- We need to fix a lot of the bugs with the error output. Often times the lines shown are wrong.