5 Brewster Lane, Bellport, New York 11713-2803
Phone: +1-631-286-1339
E-mail: yaya@bernstein-plus-sons.com
November 26, 2000
The following are some supplemental notes on defining and parsing of Computer Programming Languages.
The defining and parsing of languages has a long and complex history, but current "industrial" practice by most programmers is based on use of commonly available tools for lexical scanning of tokens and parsing of grammars based on:
Two highly accessible descendants of these programs are
Syntax
BNF and EBNF are commonly accepted ways to express productions of a context-free grammar. BNF originally stood for "Backus Normal Form". However, it was pointed out that this is not literaly a "normal" form, and the acronym has come to stand for "Backus Naur Form". EBNF stands for "Extended Backus Naur Form".
<dots> | \ <dots> . | \ \ . . .
{} or () | to group portions of a production for sub-alternatives and repeat counts |
[] or // | to indicate special terminal symbol productions call "regular expressions" (see below) |
? * + | 0-1, 0-infinity, 1-infinity repeat counts, respectively |
| | to separate alternative productions |
For example "<dots> ::= {'.'}+"
Regular expressions are special productions of terminal symbols. As with BNF there is considerable variation in notation. In BNF, whitespace is not meaningful. In regular expressions, all characters, including blanks are meaningful. Some symbols, such as vertical bar "|", backslash "\" and the caret "^", and, in some systems, parentheses, slashes, square brackets, periods, asterisks, plus signs, minus signs and dollar signs have special meaning. The special meanings of these characters is suppressed by use of the backslash as a quoting symbol. A reasonable convention is:
// | outer level start and end markers for a regular expression. The string contained with it defines one or more sequences of terminal symbols |
() | to groups portions of the string for repeat counts and alternatives |
? * + | 0-1, 0-infinity, 1-infinity repeat counts, respectively |
. | matches any printable character |
[] | enclosing a non-empty string of characters which are to be taken as alternatives without the use of vertical bars. Only ^ and \ have special meaning within the brackets, and ] may be included among the alternatives by making it the first character of the string. |
^ | as the first character within square brackets changes the meaning to be all characters which do not match any of the characters within the string. If followed immediately by ], the ] is one of the characters not to match. |
- | indicates a range of characters within the ASCII collating sequence. |
For example, /[^]]/ would mean any single character not matching a close bracket, while /([^]])+/ would mean any string of one or more characters each of which does not match a close bracket.
In addition to specifying the valid sentences in a language, we need to specify the meanings of those sentences. We will use the language Pascal to explore issues in the representation of semantics. In order to understand the language, see the Sun Workshop documentation at:
http://www.math.colostate.edu/manuals/sunpro/pascal/index.html
Some of the material we present is from:
program program_name ( file_identifier_list ) ; declarations ; begin statement ; statement ; ... statement end .
label digit_sequence , digit_sequence , ... digit_sequence ;
const identifier = constant; identifier = constant; ... identifier = constant;
type identifier = type-denoter; identifier = type-denoter; ... identifier = type-denoter;
var identifier_list : type-denoter; identifier_list : type-denoter; ... identifier_list : type-denoter;
<procedure_and_function_declarations> = { <procedure_declaration> | <function_declaration> }* <procedure_declaration> = <procedure_heading> ; <identifier> | <procedure_heading> ; <block> <procedure_heading> = procedure <identifier> { ( <formal_parameter_list> ) } <block> = <declaractions> begin <statements> end <function_declaration> = <function_heading> ; <identifier> | <function_heading> ; <block> <function_heading> = function <identifier> { ( <formal_parameter_list> ) } ":" <type_denoter> <block> = <declaractions> begin <statements> end <formal_parameter_list> = <formal_parameter_section> { ; <formal_parameter_section> } <formal_parameter_section> = <value_parameters> | <variable_parameters> | <procedural_parameters> | <function_parameters> | <conformant_array_parameters> | <value_parameters> = <identifier_list> : <type-denoter> <variable_parameters> = var <identifier_list> : <type-denoter> ...
variable_identifier := expression
begin statement ; statement ; ... statement end
repeat statement ; statement ; ... statement until expression
while expression do statement ;
for variable_identifier := expression to expression do statement ;
for variable_identifier := expression downto expression do statement ;
while expression do statement ;
label: if expression then begin statement ; goto label end
We use the term "state" to refer to the aggregate of information that is stored in the memory of the computer. As instructions are executed this information changes. Therefore we can say that the meaning of instruction, i, is a function M(i,s) which depends on the instruction and the state, s before execution and which returns the state after execution. If we call I the set of possible instructions and S the set of possible states, then we specify the denotational semantics of the instructions by
M: I x S -> S
If two instructions i1 and i2 are executed in sequence, denoted by i1;12, then the composition of the mappings is performed by
M(i1;i2,s) = M(i2,M(i1,s))
Note that the two instructions appear in the reverse order on the right hand side of the definition.
In order to write definitions of mappings, we need access to the information stored. For simplicity of expression, we wave our hands a bit and define a function V from states to values which returns the "last" value stored. Thus, if <expr> is a sequence of instructions forming an expression, then V(M(<expr>,s)) is the value of that expression.
Let us consider the semantics of a C "while" statement:
Note that M(<expr>,s) appears three times. This does not imply three evaluations of the expression. Rather M(<expr>,s) is the state of the computer after evaluation of the expression starting in state s. We could just as well have written:
Thus the meaning of this while statement is operationally the same as the following steps: