Compiler Notes: July 2011

Tuesday, July 5, 2011

Parser

A parser reads the token stream (mostly generated by the lexer) and matches (phrases) it against parser rules. The output is usually an Abstract Syntax Tree (AST). Each matched phrase could also result in some custom action being taken.

The AST does not contain all the information from the source, such as parentheses, etc. This information may be present in the concrete syntax tree. A parser may produce either an AST or a concrete syntax tree, based on ruled defined (?)?

ANTLR's AST processing is pretty powerful. We can write tree grammar, which will generate a tree parser which can contain custom actions, or String template output statements.

In some implementations the parsing stage may not produce an AST at all. We may build custom actions, which could be performed in the parsing stage.

Lexer

A lexer takes as input a character or binary stream, and converts it into tokens using lexer rules.

Questions:

Can lexer rules contain non terminals on the LHS (I do not think so ... )?
Specific to ANTLR, are tokens considered part of Lexer rules?

Monday, July 4, 2011

Grammar

Grammar is the formal definition of the rules of a language. The grammar which is used for defining programming languages, is known as context free grammar. The notation to denote context free grammar is BNF or Backus-Naur form. Most grammars now use the E-BNF or extended Backus Naur form.

Production rules in context free grammar are specified in the following way.

V → w

Where V is a single non terminal symbol, and w is a string of terminals and or non terminal symbols. w can also be empty. My understanding is that when some text is parsed, these rules are applied, such that 'w' can be replaced by V in the AST.

Languages that are described by context free grammars exhibit block structure. This means that sentences are recursively composed of smaller phrases, and eventually words or word elements.

A context free grammar provides a simple and mathematically precise way for describing the methods in which phrases can be built from smaller blocks, and for capturing those phrases.

Block structure was introduced in computer programming languages in the Algol project.

context free grammars allow the construction of parsing algorithms which, for a given String, can determine if and whether it can be constructed using the rules of the grammar. Notice that we talk of the String being constructed from the grammar rules, and not the other way around.

Formal definition of context free grammar:

A context-free grammar G is defined by the 4-tuple:

$G = (V\,, \Sigma\,, R\,, S\,)$ where

Where

V - A set of non terminal symbols. Each such symbols represents a different type of clause or phrase in the sentence.

$\Sigma\,$ - Is a finite set of terminals, disjoint from V (also called the alphabet of the language).

$R\,$ - Is the set of rewrite rules with a non terminal on the LHS and a string of non terminals and terminals on the RHS.

$S\,$ - Is the start variable used to represent the whole sentence. It must be part of V.

Questions:

Need to understand the difference between context free and context sensitive grammar?

Insights:

The * category of symbols (in EBNF) can also be represented as recursion in BNF.

Starting to learn compilers

I am starting to learn about compilers. There are a bunch of resources, which is a bit confusing. This blog will contain my notes as I try to make sense of concepts related to compilers, grammer, tools (like ANTLR, JavaCC, etc)

Since I have learned this subject when was in undergrad (a long time back), I at least understand some of the concepts, and remember some stuff. However, I learned it in a very theoretical way, so I have absolutely no grounding in the practice of CC, and again since most of my knowledge is theoretical, I have also forgotten most of it.

Below is a list of things I think I will need to understand:

What is a context free grammar?
What is the significance of context free, in context free grammar?
What is a lexer?
What is a parser?
How do we walk a parser tree?
How do we introduce semantic rules in the AST?
How do we interpret langauges?