Compilers-PA2

Introduction

For this assignment, you are to write a lexical analyzer, also called a scanner, using a lexical analyzer generator. (The C++ tool is called flex) You will describe the set of tokens for Cool in an appropriate input format, and the analyzer generator will generate the actual code (C++) for recognizing tokens in Cool programs.

Analysis and Solution

We need to write Flex rules that match on the appropriate regular expressions defining valid tokens in Cool as described in Section 10 and Figure 1 of the Cool manual and perform the appropriate actions and handle errors.

It is sufficient for keywords,multiple-character operators and single-character to return the token type,we can solve them first. It is only worth noting that, “Except for the constants true and false, keywords are keywords are case insensitive.”[1]

We can define keywords like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
CLASS           [Cc][Ll][Aa][Ss][Ss]
ELSE [Ee][Ll][Ss][Ee]
IF [Ii][Ff]
FI [Ff][Ii]
IN [Ii][Nn]
INHERITS [Ii][Nn][Hh][Ee][Rr][Ii][Tt][Ss]
LET [Ll][Ee][Tt]
LOOP [Ll][Oo][Oo][Pp]
POOL [Pp][Oo][Oo][Ll]
THEN [Tt][Hh][Ee][Nn]
WHILE [Ww][Hh][Ii][Ll][Ee]
CASE [Cc][Aa][Ss][Ee]
ESAC [Ee][Ss][Aa][Cc]
OF [Oo][Ff]
NEW [Nn][Ee][Ww]
ISVOID [Ii][Ss][Vv][Oo][Ii][Dd]
NOT [Nn][Oo][Tt]

and define those rules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{CLASS}     { return (CLASS); }
{ELSE} { return (ELSE); }
{IF} { return (IF); }
{FI} { return (FI); }
{IN} { return (IN); }
{INHERITS} { return (INHERITS); }
{LET} { return (LET); }
{LOOP} { return (LOOP); }
{POOL} { return (POOL); }
{THEN} { return (THEN); }
{WHILE} { return (WHILE); }
{CASE} { return (CASE); }
{ESAC} { return (ESAC); }
{OF} { return (OF); }
{NEW} { return (NEW); }
{ISVOID} { return (ISVOID); }
{NOT} { return (NOT); }

In fact, for higher versions of Flex, keyword definations can be omitted and rules can also be written in the following form:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(?i:class) { return CLASS; }
(?i:else) { return ELSE; }
(?i:fi) { return FI; }
(?i:if) { return IF; }
(?i:in) { return IN; }
(?i:inherits) { return INHERITS; }
(?i:let) { return LET; }
(?i:loop) { return LOOP; }
(?i:pool) { return POOL; }
(?i:then) { return THEN; }
(?i:while) { return WHILE; }
(?i:case) { return CASE; }
(?i:esac) { return ESAC; }
(?i:of) { return OF; }
(?i:new) { return NEW; }
(?i:isvoid) { return ISVOID; }
(?i:not) { return NOT; }

We can define punctuations’[2] rules like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
"."         { return ('.'); }
"@" { return ('@'); }
"~" { return ('~'); }
"*" { return ('*'); }
"/" { return ('/'); }
"+" { return ('+'); }
"-" { return ('-'); }
"<=" { return (LE); }
"<" { return ('<'); }
"=" { return ('='); }

";" { return (';'); }
"(" { return ('('); }
")" { return (')'); }
"{" { return ('{'); }
"}" { return ('}'); }
"," { return (','); }
":" { return (':'); }

We can define multiple-character operators like this:

1
2
DARROW          =>
ASSIGN <-

and define those rules

1
2
{DARROW}		{ return (DARROW); }
{ASSIGN} { return (ASSIGN); }

The next challenges to address are boolean values, integers values, type identifiers, and object identifiers, since they all have highly fixed formats.

For boolean constants, the semantic value is stored in the field cool yylval.boolean.[3]

1
2
{TRUE}      { cool_yylval.boolean = 1; return (BOOL_CONST); }
{FALSE} { cool_yylval.boolean = 0; return (BOOL_CONST); }

For class identifiers, object identifiers, integers, and strings, the semantic value should be a Symbol stored in the field cool_yylval.symbol.[3:1]

First of all, we define the following definitions.

1
2
DIGIT [0-9]+
STR [0-9a-z_A-Z]*

What’s more, “Type identifiers begin with a capital letter; object identifiers begin with a lower case letter.”[4]

1
2
3
4
5
6
[A-Z]{STR} {  cool_yylval.symbol=idtable.add_string(yytext);
return (TYPEID);
}
[a-z]{STR} { cool_yylval.symbol=idtable.add_string(yytext);
return (OBJECTID);
}
1
2
3
4
{DIGIT} {
cool_yylval.symbol = inttable.add_string(yytext);
return (INT_CONST) ;
}s

The issue of nested comments[5] can be resolved by creating a stack.

When the scanner encounters “(*”, it enters the “COMMENT” state and sets the stack to 1, indicating the start of a comment. If the scanner encounters “(*” inside a comment, it increments the stack, indicating a nested comment. When the scanner encounters “*)”, it decrements the stack, indicating an unnesting of the comment. If the stack reaches 0, it means the comment has ended, and the scanner returns to the “INITIAL” state.

1
2
3
4
5
"(*" {BEGIN (COMMENT);stack=1;} /*switch comment mode*/
<COMMENT>"(*" {stack++;} /*nested*/
<COMMENT>"*)" { stack--; /*unnested*/
if(stack==0) BEGIN (INITIAL) ;
}

If a comment remains open when EOF is encountered, report this error with the message ‘‘EOF in comment’’. Do not tokenize the comment’s contents simply because the terminator is missing.[6]

1
2
3
4
<COMMENT><<EOF>> {  cool_yylval.error_msg = "EOF in comment"; 
BEGIN(INITIAL);
return (ERROR);
}

Finally, eating up comments.

1
2
3
<COMMENT>.   /*eat up comments*/
<COMMENT>\n { curr_lineno++; }
"--".* /*eat up a line commet*/

If you see “*)” outside a comment, report this error as ‘‘Unmatched *)’’, rather than tokenizing it as * and ).[6:1]

1
2
"*)" {  cool_yylval.error_msg = "‘Unmatched *)" ; 
return (ERROR) ; }

The most difficult part is the scanning of strings by the scanner.

When encountering start ", the code switches to the STRING state. It initializes variables, such as string_buf_ptr and strl, and clears the string_buf buffer.

1
2
3
4
5
6
\" {  BEGIN (STRING);
string_buf_ptr = string_buf;
memset(string_buf,'\0',sizeof(string_buf));
//readLine=0;
strl=0;
} /*switch comment mode*/

In the STRING state, when encountering enclosed ", the code returns to the INITIAL state. It checks if the string buffer contains any null characters and returns an error if found. If the length of the string is within MAX_STR_CONST, it adds the string to the symbol table and returns a token of type STR_CONST.

1
2
3
4
5
6
7
8
9
10
11
12
<STRING>\" { BEGIN (INITIAL);
if(strlen(string_buf)!=strl){
cool_yylval.error_msg = "String contains null character";
return (ERROR);
}
if(strlen(string_buf)<MAX_STR_CONST){
cool_yylval.symbol=stringtable.add_string(string_buf);
return (STR_CONST);
}
cool_yylval.error_msg = "String constant too long";
return (ERROR);
}

In the STRING state, if the EOF is encountered, it indicates an unterminated string constant, and the code returns an error.

1
2
3
4
<STRING><<EOF>> {  cool_yylval.error_msg = "EOF in string constant"; 
BEGIN(INITIAL);
return (ERROR);
}

In the STRING state, when encountering a backslash followed by a newline character, the code adds the newline character to the string buffer.

Note: Since I implemented this assignment on Windows, it is necessary to add a \r character.

1
2
3
4
<STRING>\\(\n|\r) { *string_buf_ptr++ = yytext[1];
strl++;
curr_lineno++;
}

Escape sequence \c is accepted for all characters c. Except for \n \t \b \f, the result is c.

1
2
3
4
<STRING>\\n  {*string_buf_ptr++ = '\n';strl++;}
<STRING>\\t {*string_buf_ptr++ = '\t';strl++;}
<STRING>\\b {*string_buf_ptr++ = '\b';strl++;}
<STRING>\\f {*string_buf_ptr++ = '\f';strl++;}

The remaining parts are simple.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<STRING>\\.  {*string_buf_ptr++ = yytext[1];strl++;}

<STRING>\n { cool_yylval.error_msg = "Unterminated string constant";
curr_lineno++;
BEGIN (INITIAL);
return (ERROR);
}

<STRING>[^\\\n\"]+ {
char *yptr = yytext;
int flag=0;
for(int i=0;i<yyleng;i++){
*string_buf_ptr++ = *yptr++;
strl++;
}
}

Result

SCORE 63/63

Here are all source code

Suggestion

As far as I’m concerned, it’s quite significant to read the PA2 ,cool-manual.pdf(Section10-Section11) and flex manual(Section3-Section11). What’s more, there is a trick. In this page, flex manual offers an example of how to match C-style quoted strings using exclusive start conditions, including expanded escape sequences. You can use this as a skeleton and make modifications to it, which is much easier than starting from scratch.


  1. cool-manual p16 10.4 Keywords ↩︎

  2. cool-manual p17 Figure 1: Cool syntax. ↩︎

  3. PA2 p6 Section5 Notes for the C++ Version of the Assignment ↩︎ ↩︎

  4. cool-manual.pdf p15 10.1 Integers, Identifiers, and Special Notation ↩︎

  5. cool-manual.pdf p15 Section10.3 Comments ↩︎

  6. PA2 p5 Section4.1 Error Handling ↩︎ ↩︎