UP | HOME
../../

Grammars, Syntax, Semantics, and AST

Table of Contents

Abstract: A quick intro to grammars, syntax, derivation trees, and ASTs. I use these notes in multiple courses.

1 Grammars

  1. A language is a set of sentences. Syntax rules specify which sentences are in the language. Almost always an infinite set.
  2. Syntax: the form of expressions, statements, and program units. Usually in textual form. Often in abstract syntax trees (AST).
  3. Syntax rules use terminals (aka tokens, lexemes) and non-terminals.
  4. Lexemes: sequences of characters without "delimiters" and obeying certain rules. Examples: operators (+, *, …), numbers, identifier, literals, reserved words.
  5. Semantics: the meaning of expressions, statements, and program units.
  6. Grammar: A collection of production rules that generate the sentences of a language.

1.1 Chomsky Hierarchy

chomsky-hierarchy.png

Figure 1: Chomsky Hierarchy

  1. Levels of Languages: Regex (L3), Context-Free (L2), Context-Sensitive (L1), Recursively Enumerable (L0).

1.2 Context Free Grammars (CFG)

  1. Defined using Terminals (Tokens aka Lexemes), and Non-Terminals
  2. Start Symbol (a non-terminal)
  3. Non-terminals aka Syntactic Categories. Non-terminals are called that because a sentence production has not ended yet.
  4. There is no "context"
    1. A grammar production rule: LHS ::= RHS
    2. LHS is a single non-terminal (i.e., without context)
    3. RHS is a seq of terminals and non-terminals
  1. Limitations of CFG
    1. Defines syntax upto a "level"
    2. Cannot capture "context"
      1. E.g., "variable should be declared before use"
  2. CFGs do not define semantics
    1. minor exceptions exist
  3. Every modern PL has a CFG, often several.

2 BNF Notation

  1. BNF (Backus-Naur Form) is a specific notation for writing down a context-free grammar. Named after its originators: Backus (Turing Award winner) and Naur (Turing Award winner).
  2. Examples of BNF rules:
    1. <identList> → identifier | identifier, <identList>
    2. <ifStmt> → if <logicExpr> then <stmt>
  3. LHS → RHS
    1. The LHS is a non-terminal
    2. The RHS consists of a seq of terminals and non-terminals
    3. BNF shows non-terminals within < >
    4. The terminals are aka lexemes.
  4. There are many BNF variations/ extensions.

2.1 Conventions

  1. The production rules generate a sequence of tokens. If the production is starting from the start non-terminal, a sentence in the language is generated.
  2. A rendering of a sequence of tokens as a string of characters will (usually) separate the tokens by non-empty white spaces.
  3. Sometimes the lexical structure is specified using CFGs. Here, the terminals are characters, and there is no white space separation.

2.2 Modern Notations

  1. Simplify the BNF notation.
  2. Drop <> from non-terminals.
  3. Show terminals in a different font, or quote them.
  4. Use {} or * for repetitions.
  5. See the Java Grammar (below).

3 Grammars of Real PLs

3.1 Best Practice

  1. Syntax of most languages is "specified" using context-free grammars.
  2. Almost always incompletely (because we use context-free grammars).
  3. Almost never ambiguously (unique derivation trees, described below). The word ambiguous is not for semantics, but for parsing.

3.2 Java

  1. Chapter2. Grammars [from Oracle Java] "This chapter describes the context-free grammars used in this specification to define the lexical and syntactic structure of a program."
  2. Chapter18. Java [from Oracle Java] "This chapter presents a grammar for the Java programming language."
  3. https://kotlinlang.org/docs/reference/grammar.html Kotlin grammar

3.3 C++/ C

  1. http://www.open-std.org/JTC1/SC22/WG14/ C
  2. http://www.open-std.org/JTC1/SC22/WG21/ C++
  3. Standard for Programming Language C++, Working Draft, 1600+ pp. 2018.

4 Semantics of Languages

  1. Semantics requires context, and more.
  2. Grammars do not "do" semantics
    1. Attribute Grammars do this partially.
  3. Semantics of most PLs is "specified" using carefully worded English prose.
    1. This is (almost always) incomplete.
    2. This is (almost always) unclear / ambiguous.
    3. This is (sometimes) contradictory.
  4. Techniques
    1. Operational Semantics
    2. Axiomatic Semantics
    3. Declarative Semantics
    4. Denotational Semantics
    5. Unfortunately, no real PL has its semantics defined as above.

5 Derivation Trees

  1. Syntax analyzer component of a compiler determines whether the sequence of characters (in a file) is syntactically a sentence. It does this by first tokenizing (aka lexical analysis), and then parsing.
  2. A parser constructs a derivation tree.
  3. The leaves are terminals and internal nodes are non-terminals.

5.1 Derivation Tree Example #1

  1. A Grammar and an Example Derivation Tree. The grammar has just one production rule. der-s.png

5.2 Derivation Tree Example #2

<assign> -> <id> := <expr>
<id> -> A | B | C
<expr> -> <expr> + <expr>
  | <expr> * <expr>
  | ( <expr> )
  | <id>

ast-assign-stmt.png

Figure 2: One Derivation Tree of A := B + C * A (Others are possible)

5.3 Derivation Tree Example #3

  1. Derivation tree of x + 3 * y
  2. From https://www.student.cs.uwaterloo.ca/~cs241/cfg/cfg.html ;; derivation-tree.png
  3. In the above CFG,
    1. ::= and <> are omitted
    2. id, +, *, # are lexemes

6 Abstract Syntax Tree (AST)

  1. A parser constructs a derivation tree. An unnamed component of the compiler then transforms/ constructs ASTs. Semantic analyses, code generation, etc. traverse the ASTs.
  2. None of the nodes are non-terminals. They are either terminals of the grammar, or specially introduced, but not non-terminal, nodes.
  3. Unfortunately, ASTs have not been "standardized".
  4. Eclipse Java development tools (JDT) Uses ASTs. Popular.
  5. JetBrains has https://www.jetbrains.com/idea/ Java/Kotlin IDE. Uses ASTs. Popular.
  6. JetBrains has https://www.jetbrains.com/mps/ Meta Programming System. For the development of DSLs. Uses ASTs. Popular.
  7. Given an AST, its textual standardized version can be algorithmically produced by traversing the AST. Intellij Idea, Eclipse, … do this.

6.1 Example AST of an Arith Exp

  1. An AST example of x + 3 * y

      +
     / \
    x   *
       / \
      3   y
    

6.2 Example AST of an Arith Exp #2

ast-exp.png

6.3 Example AST From JTransformer FOSS

ast-jtransformer.jpg

Figure 4: From JTransformer (Spot any error?)

6.4 Example AST: Euclid's Algorithm for GCD

ast-euclid.png

Figure 5: Euclid's Algorithm for GCD

  1. Examples of AST drawings often do not show symbol-tables, but they are included.
  2. In the above AST: a, b were positive integers initialized by the caller.
  3. Exercise: Deduce the source code from the above AST

7 References

  1. Oracle, https://docs.oracle.com/javase/specs/jls/se8/html/jls-2.html, Chapter 2. Grammars. Chapter18. Java Reference.
  2. https://kotlinlang.org/docs/reference/grammar.html Kotlin grammar. Reference.
  3. Alessio Marchetti, http://www.nongnu.org/hcb/ Hyperlinked C++ BNF Grammar. 2018. Reference.
  4. http://www.open-std.org/JTC1/SC22/WG14/ C; http://www.open-std.org/JTC1/SC22/WG21/ C++. Reference.

8 End


Copyright © 2018 pmateti@wright.edu www.wright.edu/~pmateti 2018-06-21 2020-09-09