Fundamentals on Building Customizable Compilers

[2.1] Lexical Analysis

Previous[2.0] The Structure of a Compiler Next[2.2] Syntax Analysis

Last updated 3 years ago

Was this helpful?

[2.1] Lexical Analysis

We've reached our first stage, known as Lexical Analysis, Scanning, or just Lexing. The Lexical Analyzers job is to take a stream of characters that make up the source program. This could be in any shape or form, as long as the Lexer has a constant flow of characters, to create meaningful sequences that are also known as "Lexemes", the Lexical Analyzer produces a token as a result of the analysis, a token will often look like this:

This token will later make its way to the syntax analysis phase. In the example token above, it has a name and value, the token-name of the token represents a symbol for use during the syntax analysis phase, and the attribute-value of the token is connected to an entry in what's known as a symbol table, this information is needed for semantic analysis and code generation.

Let's pick apart a statement together and see if we can't figure out how a source code becomes tokens. We'll start with something simple:

amount = paid + tip * 0.6

Before explaining in-depth, I'd like to mention that lexing is all positional based, meaning we have no information on the language itself, just a constant stream of characters to convert into tokens. Now let's analyze that statement:

amount is a lexeme that gets mapped into a new token that looks something along the lines of: <id, 1>. Id is what you'd call an abstract symbol, it doesn't directly represent "amount" but merely stands in place of it being an identifier for amount, and the value of 1 points to an entry somewhere in the symbol-table to retrieve actual information about the identifier, such as the name, type, value, and any other information the table holds.
So we've now encountered the = lexeme, which in most languages are for identifying assignment. What token would this map to though? This can be somewhat of a controversial opinion, some argue that such tokens should be an abstract symbol, something along the lines of assign or assignment, others will say that there's no need, and it can represent simply as its name. For the sake of simplicity, we'll represent <=> as a pure lexeme itself as the name of the abstract symbol.
paid is another lexeme that gets mapped into a new token, looking like <id, 2>, 2 simply points to the symbol-table entry of paid, you know the drill it's the exact same as the first lexeme.
Since we've decided to forego representing the operators as abstract symbols, our + lexeme will simply be <+>.
tip follows exactly the same as paid and amount, being mapped into <id, 3> and 3 pointing to the symbol-table entry for tip.
As always, we'll be representing the lexeme * as <*>.
0.6 is what's known as a literal, but it's also a lexeme that's going to get mapped into a token as <0.6>, just because it has no name, does not mean it can't be represented as a valid token. We can also be more correct and represent it as a number token <number, 0.6>. This gives the compiler information on its type, and how to treat it later on.

If we put all the tokens together, we'll have something that looks along the lines of this:

<id, 1> <=> <id, 2> <+> <id, 3> <*> <number, 0.6>

See? It's not that complicated, we'll be covering actual Lexing techniques later on in the book.

Previous[2.0] The Structure of a Compiler Next[2.2] Syntax Analysis

Last updated 3 years ago

Was this helpful?

Let's pick apart a statement together and see if we can't figure out how a source code becomes tokens. We'll start with something simple:

amount = paid + tip * 0.6

amount is a lexeme that gets mapped into a new token that looks something along the lines of: <id, 1>. Id is what you'd call an abstract symbol, it doesn't directly represent "amount" but merely stands in place of it being an identifier for amount, and the value of 1 points to an entry somewhere in the symbol-table to retrieve actual information about the identifier, such as the name, type, value, and any other information the table holds.
So we've now encountered the = lexeme, which in most languages are for identifying assignment. What token would this map to though? This can be somewhat of a controversial opinion, some argue that such tokens should be an abstract symbol, something along the lines of assign or assignment, others will say that there's no need, and it can represent simply as its name. For the sake of simplicity, we'll represent <=> as a pure lexeme itself as the name of the abstract symbol.
paid is another lexeme that gets mapped into a new token, looking like <id, 2>, 2 simply points to the symbol-table entry of paid, you know the drill it's the exact same as the first lexeme.
Since we've decided to forego representing the operators as abstract symbols, our + lexeme will simply be <+>.
tip follows exactly the same as paid and amount, being mapped into <id, 3> and 3 pointing to the symbol-table entry for tip.
As always, we'll be representing the lexeme * as <*>.
0.6 is what's known as a literal, but it's also a lexeme that's going to get mapped into a token as <0.6>, just because it has no name, does not mean it can't be represented as a valid token. We can also be more correct and represent it as a number token <number, 0.6>. This gives the compiler information on its type, and how to treat it later on.

If we put all the tokens together, we'll have something that looks along the lines of this:

<id, 1> <=> <id, 2> <+> <id, 3> <*> <number, 0.6>

See? It's not that complicated, we'll be covering actual Lexing techniques later on in the book.