[2.0] The Structure of a Compiler

05/07/2021 • 30 minutes to read

In the last section, we treated the compiler as a mystical black box that simply "does this and outputs that", but in these upcoming pages, we'll be going over the nitty-gritty processes that the compiler takes when going from source code all the way to executable file. When we open up this magical box, we see that there is actually not a lot going on at first glance, but this will fool you because digging deeper we find so much more! Let's talk about compiler phases briefly, there are two phases that a compiler goes through -- the Analysis Phase, which is known as the Front End -- and the -- Synthesis Phase, which is known as the Back End -- both of these phases have several steps to go through. Let's start with the front-end.

Analysis Phase

During the Analysis Phase, the compiler reads the source program, lexes the characters (tokenization), parses the tokens (AST construction), performs semantic analysis which is just a fancy way of saying the syntax tree is getting "annotated" and checked for grammatical errors. A programming-languages syntax tree is a lot like the structure of a spoken language, there are certain rules that can't be broken. The big difference is that when a programming-languages syntax is wrong you'll get a detailed error message -- in spoken languages, you'll just confuse people and get funny looks. The last step of this phase is the generation of the "Intermediate" code. If you've never read the "Assembly" code your compiler spits out then this will be a very strange concept. Intermediate code is a near one-to-one representation (not exact by any means) of the assembly that will be generated at the last stage, the biggest difference is that you have more control over how this code is structured.

whew, that was a mouthful! Don't be ashamed to read it a few more times to fully grasp the concept, we will be covering lexing, syntax analysis, semantic analysis, and intermediate code generation in the upcoming pages though!

Before moving onto the next phase, I'd like to give you a better understanding of what exactly a "Syntax Tree" even is. Let's take a basic expression and convert it into two examples of how trees are commonly depicted. a + b * c

Notice how we have a multiplication operator on the right-hand side:

BinaryExpression
├──NumberExpression
│   └──NumberToken a
├──PlusToken
└──BinaryExpression
    ├──NumberExpression
    │   └──NumberToken b
    ├──StarToken
    └──NumberExpression
        └──NumberToken c

As you can see, syntax trees work on a level of "nesting", this is what defines their precedence, most trees are built using "recursive-descent" (this is something we'll cover later). Another version of a syntax tree that's often used is:

A naive parser would evaluate this expression without honoring operator precedence, so the following expression would be turned into something like this:

BinaryExpression
├──BinaryExpression
│   ├──NumberExpression
│   │   └──NumberToken a
│   ├──PlusToken
│   └──NumberExpression
│       └──NumberToken b
├──StarToken
└──NumberExpression
    └──NumberToken c

Synthesis Phase

When we reach the Synthesis Phase all of the formalities have been done, and we've been handed a really nice intermediate representation of our source code, ready to be optimized and turned into actual assembly or machine code. The optimization stage is actually optional. Really? Yep! If the intermediate representation that we've been given from the front-end is good enough, the compiler may not even need to optimize. Consider the following: Is it worth optimizing your source code if the compiler is just going to do it for me? (got your answer? hold onto it till the end). By looking at the diagram below (see fig. 2) you'll see that we have optimization, code generation, then more optimization, that can't be right, can it? It can and is! If the compiler needs to optimize the intermediate representation from the front-end it will do so to make code generation easier, after an actual assembly language is generated the compiler can double back to make sure nothing was missed, and finish the optimizations that may be required. After all of this has been completed, and the compiler has not encountered any errors, it will finally write the executable to disk for you to run!

The last tiny bit of information I'd like to touch upon is an example of what may happen during the optimization stages.

# Question: How to optimize the given code? 
a = intToFloat(5) 
b = c * a
d = e + b
final = d

# Answer: This code can be optimized down to the following;
b = c * 5
final = e + b
# Notice how we've trimmed the identifier 'a' entirely! that's because it was a 
# constant value, the compiler can slot in the value for constants! Cool right?

Phew, is it over? Yes it is! you now know all the basic steps a compiler performs just to print your "hello world" program! In figure 3 you can see the full structure of the compiler when both front-end and back-end have been merged together. Take time to study and memorize this, because we'll be using it as a guideline throughout this book.

Summary

This was a really heavy page that covered quite a lot of content, let's go over the key points that you need to remember.

What is the output of a Lexical Analyzer? A stream of tokens, this stream consists of identifiers, keywords, separators, operators, and literals.

What is the output of a Syntax Analyzer? The parser analyzes the tokens streams against a "context-free" grammar to check for errors and generate an Abstract Syntax Tree.

What is the output of a Semantic Analyzer? An annotated syntax tree, holding information such as code comments and identifier names in a more strict structure.

What is the output of Intermediate Code Generation? A very linear representation of the syntax tree.

Why is code optimization call an optional phase? The intermediate representation may already be good enough to where optimizations aren't even needed, so this stage would be entirely pointless to try and optimize already optimized code.

What is the output of code generation? The target platform, in most cases an "executable".

Is it worth optimizing your source code if the compiler will just do it for you? Nope, it's not. Developers will often fall into the trap of preemptively optimizing their code when in reality they're performing micro-optimizations that the compiler will deal with anyways!

(Extra) What is a Pass in a compiler? A pass refers to the traversal of the program that a compiler may do, such as walking syntax trees to find specific nodes.

[2.0] The Structure of a Compiler

PreviousQuiz [1.0]Next[2.1] Lexical Analysis

Last updated 4 years ago

Was this helpful?