parsing.html

INFORMATION SYSTEMS CONSULTANTS

5 Brewster Lane, Bellport, New York 11713-2803
Phone: +1-631-286-1339
E-mail: yaya@bernstein-plus-sons.com

November 26, 2000

Some Comments on Parsing for Computer Programming Languages

© Copyright 2000 All Rights Reserved
by
Herbert J. Bernstein
Originally published in part within the lecture notes
for a course in Computer Programming Languages in Spring 2000
Mathematics and Computer Science Department, St. Joseph's College,
Patchogue, NY, February 2000

The following are some supplemental notes on defining and parsing of Computer Programming Languages.

The defining and parsing of languages has a long and complex history, but current "industrial" practice by most programmers is based on use of commonly available tools for lexical scanning of tokens and parsing of grammars based on:

S. C. Johnson, YACC: Yet another Compiler-Compiler in UNIX Programmer's Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984, CSD, Dept. of EE and CS, U. of California, Berkeley, CA, originally published as Computer Science Technical Report No. 32, 1975, Bell Laboratories, Murray Hill, NJ.
M. E. Lesk and E. Schmidt, Lex - A Lexical Analyzer Generator in UNIX Programmer's Manual, Supplementary Documents, 4.2 Berkeley Software Distribution, Virtual VAX-11 Version, March 1984, CSD, Dept. of EE and CS, U. of California, Berkeley, CA, originally published as Computer Science Technical Report No. 39, 1975, Bell Laboratories, Murray Hill, NJ.

Two highly accessible descendants of these programs are

R. Corbett et. al bison and

V. Paxson et. al flex

Syntax

Productions, types of grammars

Productions are rules for constructing sentences in a language

"Terminal symbols" are what actually appear in the language
For example, 'poodle' might be given as a string of terminal symbols in some language discussing dogs.
Non-terminal symbols are the higher level constructs of the language, e.g. sentences, clauses, etc.
For example might be given as a non-terminal symbol in some language discussing dogs.
Productions may be used to infer rules for parsing the language
For example, ::= { 'poodle' | 'terrier' | 'bulldog' | 'greyhound' } might be given as a rule telling us what names of types of dogs we are allowed to write in this language.

Chomsky hierarchy

Type 0 -- (unrestricted) arbitrary strings on both sides of productions
allowing arbitrary transformations to produce strings in the language
Type 1 -- (context sensitive) |lhs| < |rhs|
transforming from a terminal symbol on the lhs to alternative strings of terminal and non-terminal symbols within a specified context of symbols on the left and right. In other words, the productions are only meaningful in the given context.
Type 2 -- (context free) single non-terminal symbol on the lhs
giving the transformations from the terminal symbol to alternative strings of terminal and non-terminal symbols within all contexts.
Type 3 -- (regular) A single optional non-terminal on the left (left linear) or right (right linear) of rhs with a terminal symbol

BNF, parse trees, diagrams, EBNF
BNF and EBNF are commonly accepted ways to express productions of a context-free grammar. BNF originally stood for "Backus Normal Form". However, it was pointed out that this is not literaly a "normal" form, and the acronym has come to stand for "Backus Naur Form". EBNF stands for "Extended Backus Naur Form".

context free grammar, lhs ::= rhs
Meaning that, to produce a sentence in the grammar, the single non-terminal symbol on the left-hand side (lhs) should be transformed into one of the alternative strings of terminal and/or non-terminal symbols on the right-hand-side (rhs). The first (or last) non-terminal symbol is the "root" of the grammar.
<> delimit non-terminals, bold or quotes for terminals, or "as is"
vertical bar for alternatives
For example, "<dots> ::= '.' | <dots< '.'" generates all strings of dots: ., .., ..., ....,

Parse trees trace the flow of analysis of a sentence
For example

Ambiguity = multiple parse trees for the same sentence
Syntax diagrams convey the same information about productions (think RR tracks)
EBNF

Square brackets, curly braces or parentheses for optional components, with various suffixes (*, + ?) for repetition counts

Many conflicting variants. See http://www.cs.man.ac.uk/~pjj/bnf/bnf.html#EBNF
A reasonable convention is:

{} or () to group portions of a production for sub-alternatives and repeat counts
[] or // to indicate special terminal symbol productions call "regular expressions" (see below)
? * + 0-1, 0-infinity, 1-infinity repeat counts, respectively
| to separate alternative productions

For example "<dots> ::= {'.'}+"

Regular Expressions

Regular expressions are special productions of terminal symbols. As with BNF there is considerable variation in notation. In BNF, whitespace is not meaningful. In regular expressions, all characters, including blanks are meaningful. Some symbols, such as vertical bar "|", backslash "\" and the caret "^", and, in some systems, parentheses, slashes, square brackets, periods, asterisks, plus signs, minus signs and dollar signs have special meaning. The special meanings of these characters is suppressed by use of the backslash as a quoting symbol. A reasonable convention is:

// outer level start and end markers for a regular expression. The string contained with it defines one or more sequences of terminal symbols
() to groups portions of the string for repeat counts and alternatives
? * + 0-1, 0-infinity, 1-infinity repeat counts, respectively
. matches any printable character
[] enclosing a non-empty string of characters which are to be taken as alternatives without the use of vertical bars. Only ^ and \ have special meaning within the brackets, and ] may be included among the alternatives by making it the first character of the string.
^ as the first character within square brackets changes the meaning to be all characters which do not match any of the characters within the string. If followed immediately by ], the ] is one of the characters not to match.
- indicates a range of characters within the ASCII collating sequence.

For example, /[^]]/ would mean any single character not matching a close bracket, while /([^]])+/ would mean any string of one or more characters each of which does not match a close bracket.

Parsing issues and precedence

Ambiguity -- alternative ways to parse the same sentence
Resolve with left-right, right-left and operator precedence rules

Semantics

In addition to specifying the valid sentences in a language, we need to specify the meanings of those sentences. We will use the language Pascal to explore issues in the representation of semantics. In order to understand the language, see the Sun Workshop documentation at:

http://www.math.colostate.edu/manuals/sunpro/pascal/index.html

Some of the material we present is from:

I. R. Wilson and A. M. Addyman, A Practical Introduction to Pascal", 2nd ed., Springer-Verlag New York Inc., 1982, 239 pp.

R. Corbett et. al bison and

V. Paxson et. al flex

Pascal

Created 1970 by Niklaus Wirth
Contrast these features to modern variants with symbolic labels, formal parameters clearly labelled with in and out, etc.

Program Structure

  program program_name ( file_identifier_list ) ;
  declarations ;
  begin
    statement ;
    statement ;
      ...
    statement 
  end .

Declarations

Start of blocks at any level
Label declarations, constant definitions, type definitions, variable declarations, procedure and functions declarations

Label declarations (early Pascal)

  label
  digit_sequence  , 
  digit_sequence  , 
      ...
  digit_sequence  ;

Constant definitions

  const
  identifier  =  constant;
  identifier  =  constant;
      ...
  identifier  =  constant;

Type definitions

  type
  identifier  =  type-denoter;
  identifier  =  type-denoter;
      ...
  identifier  =  type-denoter;

Variable declarations

  var
  identifier_list  :  type-denoter;
  identifier_list  :  type-denoter;
      ...
  identifier_list  :  type-denoter;

Procedure and function declarations (given by simplified EBNF)

  <procedure_and_function_declarations> = { <procedure_declaration> | <function_declaration> }*
  <procedure_declaration> = <procedure_heading> ; <identifier> |
                            <procedure_heading> ; <block> 
  <procedure_heading> = procedure <identifier> { ( <formal_parameter_list> ) }
  <block> = <declaractions> begin <statements> end

  <function_declaration> = <function_heading> ; <identifier> |
                            <function_heading> ; <block> 
  <function_heading> = function <identifier> { ( <formal_parameter_list> ) } ":" <type_denoter>
  <block> = <declaractions> begin <statements> end
  
  <formal_parameter_list> = <formal_parameter_section> { ; <formal_parameter_section> }
  <formal_parameter_section> = <value_parameters> |
                                     <variable_parameters> |
                                     <procedural_parameters> |
                                     <function_parameters> |
                                     <conformant_array_parameters> |
                                     
  <value_parameters> = <identifier_list> : <type-denoter>
  <variable_parameters> = var <identifier_list> : <type-denoter>
  ...

Assignment Statement

  variable_identifier  :=  expression

Compound Statement

  begin
    statement ;
    statement ;
      ...
    statement 
  end

Repeat Statement (executed at least once until expression is true)

  repeat
    statement ;
    statement ;
      ...
    statement 
  until  expression

While Statement (executed only while expression is true)
```
  while expression do
    statement ;
  
```

For Statement (counting up from lower to and inluding higher expression)

  for variable_identifier  :=  expression to expression do
    statement ;

For Statement (counting up from higher to and inluding lower expression)

  for variable_identifier  :=  expression downto expression do
    statement ;

For loops calculate limits once on entry and execute only if the conditions are satisfied
See gnu gpc compiler release for example of detailed grammer with semantics (in parse.y)

Semantics

Two major issues in semantics

The semantics of the language per se
The semantics of programs written in the language

Static semantics

Primarily deals with issues of semantics of the language
Extend syntax rules to deal with relationships, meanings and values
- e.g. declaration of a symbol before use
- e.g. match of use of same symbol in two places
- Create symbol table, initialized with reserved words
  e.g. in old pascal: and, array, begin, case, const, div, ..
- Push state of the symbol on entry to nested blocks
- Pop state of the symbol table on exit
- Search the symbol table as each symbol as it is encountered
  - If in declarations, should not be found, create an entry
  - If in statements, should be found, extract characteristics
Attribute grammars
- Attributes associated with grammar symbols
  - S(x) -- synthesized attributes (computed attributes for LHS of rules)
  - I(x) -- inherited attributes (computed attributes for RHS symbols)
- Semantic functions
  - x₀ := x₁...x_n
  - S(x₀) = f(a(x₁),...,a(x_n))
  - I(x_j) = f(a(x₁),...,a(x_(j-1)))
- Predicate functions -- required assertions about attrubutes
- Intrinsic attributes -- attributes assigned from outside the parse
Dynamic semantics
- Useful both for semantics of the language and for programs
- Focus on the execution time actions (transformations) of a program
- Operational semantics
  - Describe statements in an actual or virtual lower level language
  - Vienna Definitional Language for PL/I
Axiomatic semantics
- Used in parsers, but primarily to understand the semantics of programs
- Attempt to improve program quality and reliability by proving assertions about relationships between program inputs and outputs
Dynamic semantics
- Useful both for semantics of the language and for programs
- Focus on the execution time actions (transformations) of a program
- Operational semantics
  - Describe statements in an actual or virtual lower level language
    - e.g. represent
```
          while expression do
          statement ;
          
```
      with
```
          label:  if  expression then begin
          statement ;
          goto label end
          
```
  - Vienna Definitional Language for PL/I
Axiomatic semantics
- Primarily to understand the semantics of programs
- Attempt to improve program quality and reliability by proving assertions about relationships between program inputs and outputs
- Precondition,statement, postcondition
Denotational semantics
- More formal dynamics semantics for semantics of programs
- Uses mapping from grammar symbols to well-understood mathematical
- (See below) functions

Denotational Semantics

We use the term "state" to refer to the aggregate of information that is stored in the memory of the computer. As instructions are executed this information changes. Therefore we can say that the meaning of instruction, i, is a function M(i,s) which depends on the instruction and the state, s before execution and which returns the state after execution. If we call I the set of possible instructions and S the set of possible states, then we specify the denotational semantics of the instructions by

M: I x S -> S

If two instructions i1 and i2 are executed in sequence, denoted by i1;12, then the composition of the mappings is performed by

M(i1;i2,s) = M(i2,M(i1,s))

Note that the two instructions appear in the reverse order on the right hand side of the definition.

In order to write definitions of mappings, we need access to the information stored. For simplicity of expression, we wave our hands a bit and define a function V from states to values which returns the "last" value stored. Thus, if <expr> is a sequence of instructions forming an expression, then V(M(<expr>,s)) is the value of that expression.

Let us consider the semantics of a C "while" statement:

M( while (<expr>) <statement>, s ) =
if ( V(M(<expr>,s)) = true) then
M( while (<expr>) <statement>, M( <statement>, M(<expr>,s)))
else
M(<expr>,s)
endif

Note that M(<expr>,s) appears three times. This does not imply three evaluations of the expression. Rather M(<expr>,s) is the state of the computer after evaluation of the expression starting in state s. We could just as well have written:

M( while (<expr>) <statement>, s ) =
t = M(<expr>,s)
if ( V(t) = true) then
M( while (<expr>) <statement>, M( <statement>, t))
else
t
endif

Thus the meaning of this while statement is operationally the same as the following steps:

1. Evaluate the expression.
2. If the result if true, then execute the <statement> and return to the loop test
3. If the expression is false continue execution in the context resulting from evaluation of the expression

Similarly a C "for" statement is approximately:

M( for (<expr1>; <expr2>; <expr3>) <statement>, s ) =
t = M(<expr2>,M(<expr1>, s))
if ( V(t) = true ) then
M( for (; <expr2>; <expr3>) <statement>, M (<expr3>, M(<statement>, t)))
else
t
endif

{} or ()	to group portions of a production for sub-alternatives and repeat counts
[] or //	to indicate special terminal symbol productions call "regular expressions" (see below)
? * +	0-1, 0-infinity, 1-infinity repeat counts, respectively
\|	to separate alternative productions

//	outer level start and end markers for a regular expression. The string contained with it defines one or more sequences of terminal symbols
()	to groups portions of the string for repeat counts and alternatives
? * +	0-1, 0-infinity, 1-infinity repeat counts, respectively
.	matches any printable character
[]	enclosing a non-empty string of characters which are to be taken as alternatives without the use of vertical bars. Only ^ and \ have special meaning within the brackets, and ] may be included among the alternatives by making it the first character of the string.
^	as the first character within square brackets changes the meaning to be all characters which do not match any of the characters within the string. If followed immediately by ], the ] is one of the characters not to match.
-	indicates a range of characters within the ASCII collating sequence.