The parglare grammar language¶
The parglare grammar specification language is based on BNF with syntactic sugar extensions which are optional and builds on top of a pure BNF. parglare is based on Context-Free Grammars (CFGs) and a grammar is written declaratively. You don't have to think about the parsing process like in e.g. PEGs. Ambiguities are dealt with explicitly (see the section on conflicts).
Each grammar file consists of three parts: - zero or more imports of other grammar files. See grammar modularization - one or more derivation/production rules - zero or more terminal definitions
Each derivation/production rule is of the form:
<symbol>: <expression> ;
where <symbol>
is grammar non-terminal and <expression>
is a sequence of
terminals and non-terminals separated by choice operator |
.
For example:
Fields: Field | Fields "," Field;
Here Fields
is a non-terminal grammar symbol and it is defined as either a
single Field
or, recursively, as Fields
followed by a string terminal ,
and than by another Field
. It is not given here but Field
could also be
defined as a non-terminal. For example:
Field: QuotedField | FieldContent;
Or it could be defined as a terminal in terminals section:
terminals
Field: /[A-Z]*/;
This terminal definition uses regular expression recognizer.
Terminals¶
Terminal symbols of the grammar define the fundamental or atomic elements of your language -- tokens or lexemes (e.g. keywords, numbers). In parglare a terminal is connected to the recognizer which is an object used to recognize token of a particular type in the input. Most of the time you will do parsing of textual content and you will need textual recognizers. These recognizers are built-in and there are two type of textual recognizers:
- string recognizer
- regular expression recognizer
Terminals are given at the end of the grammar file, after production rules,
following the keyword terminals
.
String recognizer¶
String recognizer is defined as a plain string inside of double quotes:
my_rule: "start" other_rule "end";
In this example "start"
and "end"
will be terminals with string recognizers
that match exactly the words start
and end
.
You can write string recognizing terminal directly in the rule expression or you can define terminal separately and reference it by name, like:
my_rule: start other_rule end;
terminals
start: "start";
end: "end";
Either way it will be the same terminal. You can't mix those two approaches for
a single terminal. If you defined a terminal in the terminals
section than you
can't use inline string matches for that terminal.
You will usually write it as a separate terminal if the terminal is used at
multiple places in the grammar or to provide disambiguation information for a
terminal (priority, prefer
etc.).
Regular expression recognizer¶
Or regex recognizer for short is a regex pattern written inside slashes
(/.../
).
For example:
number: /\d+/;
This rule defines terminal symbol number
which has a regex recognizer and will
recognize one or more digits as a number.
Note
You cannot write regex recognizers inline like you can do with string
recognizers. This constraint is introduced because there is no sane way to
deduce terminal name given its regex. Thus, you must write all regex
recognizers/terminals in the terminals
section at the end of the grammar
file.
Custom recognizers¶
If you are parsing arbitrary input (non-textual) you'll have to provide your own recognizers. In the grammar, you just have to provide terminal symbol without body, i.e. without string or regex recognizer. You will provide missing recognizers during grammar instantiation from Python. Although you don't supply body of the terminal you can define disambiguation rules as usual.
Lets say that we have a list of integers (real list of Python ints, not a text with numbers) and we have some weird requirement to break those numbers according to the following grammar:
Numbers: all_less_than_five ascending all_less_than_five EOF;
all_less_than_five: all_less_than_five int_less_than_five
| int_less_than_five;
terminals
// These terminals have no recognizers defined in the grammar
ascending: ;
int_less_than_five: ;
So, we should first match all numbers less than five and collect those, than we
should match a list of ascending numbers and than list of less than five again.
int_less_than_five
and ascending
are terminals/recognizers that will be
defined in Python and passed to grammar construction. int_less_than_five
will
recognize Python integer that is, well, less than five. ascending
will
recognize a sublist of integers in ascending order.
For more details on the usage see this test.
More on this topic can be found in a separate section.
Usual patterns¶
This section explains how some common grammar patterns can be written using just a plain BNF notation.
One or more¶
// sections rule bellow will match one or more section.
sections: sections section | section;
In this example sections
will match one or more section
. Notice the
recursive definition of the rule. You can read this as sections
is either a
single section or sections
and a section
.
Note
Please note that you could do the same with this rule:
sections: section sections | section;
which will give you similar result but the resulting tree will be different. Notice the recursive reference is now at the and of the first production. Previous example will reduce sections early and than add another section to it, thus the tree will be expanding to the left. The example in this note will collect all the sections and than start reducing from the end, thus building a tree expanding to the right. These are subtle differences that are important when you start writing your semantic actions. Most of the time you don't care about this so use the first version as it is more efficient and parglare provides built-in actions for these common cases.
Zero or more¶
// sections rule bellow will match zero or more section.
sections: sections section | section | EMPTY;
In this example sections
will match zero or more section
. Notice the
addition of the EMPTY
choice at the end. This means that matching nothing is a
valid sections
non-terminal.
Same note from above applies here to.
Optional¶
document: optheader body;
optheader: header | EMPTY;
In this example optheader
is either a header or nothing.
Syntactic sugar - BNF extensions¶
Previous section gives the overview of the basic BNF syntax. If you got to use various BNF extensions (like Kleene star) you might find writing patterns in the previous section awkward. Since some of the patterns are used frequently in the grammars (zero-or-more, one-or-more etc.) parglare provides syntactic sugar for this common idioms using a well known regular expression syntax.
Optional¶
Optional
can be specified using ?
. For example:
S: "2" b? "3"?;
terminals
b: "1";
Here, after 2
we might have terminal b
but it is optional, as well as 3
that follows.
Lets see what the parser will return for various inputs (the grammar
variable
is a string holding grammar from above):
g = Grammar.from_string(grammar)
p = Parser(g)
input_str = '2 1 3'
result = p.parse(input_str)
assert result == ["2", "1", "3"]
input_str = '2 3'
result = p.parse(input_str)
assert result == ["2", None, "3"]
Note
Syntax equivalence for optional
operator:
S: b?;
terminals
b: "1";
is equivalent to:
S: b_opt;
b_opt: b | EMPTY;
terminals
b: "1";
Behind the scenes parglare will create b_opt
rule.
All syntactic sugar additions operate by creating additional rules in the
grammar during table construction.
One or more¶
One or more
match is specified using +
operator. For example:
S: "2" c+;
terminals
c: "c";
After 2
we expect to see one or more c
terminals.
Lets see what the parser will return for various inputs (the grammar
variable
is a string holding grammar from above):
g = Grammar.from_string(grammar)
p = Parser(g)
input_str = '2 c c c'
result = p.parse(input_str)
assert result == ["2", ["c", "c", "c"]]
input_str = '2 c'
result = p.parse(input_str)
assert result == ["2", ["c"]]
So the sub-expression on the second position (c+
sub-rule) will by default
produce a list of matched c
terminals. If c
is missing
a parse error will be raised.
Note
Syntax equivalence for one or more
:
S: a+;
terminals
a: "a";
is equivalent to:
S: a_1;
@collect
a_1: a_1 a | a;
terminals
a: "a";
+
operator allows repetition modifier for separators. For example:
S: "2" c+[comma];
terminals
c: "c";
comma: ",";
c+[comma]
will match one or more c
terminals separated by whatever is
matched by the comma
rule.
Lets see what the parser will return for various inputs (the grammar
variable
is a string holding grammar from above):
g = Grammar.from_string(grammar)
p = Parser(g)
input_str = '2 c, c, c'
result = p.parse(input_str)
assert result == ["2", ["c", "c", "c"]]
input_str = '2 c'
result = p.parse(input_str)
assert result == ["2", ["c"]]
As you can see giving a separator modifier allows us to parse a list of items
separated by the whatever is matched by the rule given inside []
.
Note
Syntax equivalence one or more with separator
:
S: a+[comma];
terminals
a: "a";
comma: ",";
is equivalent to:
S: a_1_comma;
@collect_sep
a_1_comma: a_1_comma comma a | a;
terminals
a: "a";
comma: ",";
Making the name of the separator rule a suffix of the additional rule
name makes sure that only one additional rule will be added to the
grammar for all instances of a+[comma]
, i.e. same base rule with the
same separator.
Zero or more¶
Zero or more
match is specified using *
operator. For example:
S: "2" c*;
terminals
c: "c";
This syntactic addition is similar to +
except that it doesn't require rule to
match at least once. If there is no match, resulting sub-expression will be an
empty list. For example:
g = Grammar.from_string(grammar)
p = Parser(g)
input_str = '2 c c c'
result = p.parse(input_str)
assert result == ["2", ["c", "c", "c"]]
input_str = '2'
result = p.parse(input_str)
assert result == ["2", []]
Note
Syntax equivalence zero of more
:
S: a*;
terminals
a: "a";
is equivalent to:
S: a_0;
a_0: a_1 {nops} | EMPTY;
@collect
a_1: a_1 a | a;
terminals
a: "a";
So using of *
creates both a_0
and a_1
rules. Action attached to a_0
returns a list of matched a
and empty list if no match is found. Please note
the usage of nops
. In case if
prefer_shift
strategy is used using nops
will perform both REDUCE and
SHIFT during GLR parsing in case what follows zero or more might be another
element in the sequence. This is most of the time what you need.
Same as one or more
this operator may use separator modifiers.
Note
Syntax equivalence zero or more with separator
:
S: a*[comma];
terminals
a: "a";
comma: ",";
is equivalent to:
S: a_0_comma;
a_0_comma: a_1_comma {nops} | EMPTY;
@collect_sep
a_1_comma: a_1_comma comma a | a;
terminals
a: "a";
where action is attached to a_0_comma
to provide returning a list of
matched a
and empty list if no match is found.
Special built-in rules¶
There are two special rules your can reference in your grammars: EMPTY
and
EOF
. EMPTY
rule will reduce without consuming any input and will always
succeed, i.e. it is empty recognition. EOF
rule will match only at the end of
the input.
Note
parglare doesn't assume that parse should be complete, i.e. that the whole
input should be consumed. Thus, if you want to ensure that the whole input
is consumed you must match EOF
at the end of your input.
Example:
MyFile: SomeRootRule EOF;
SomeRootRule: ...
Named matches (assignments)¶
In section on actions you can see that semantic action (Python callable) connected to a rule will be called with two parameters: a context and a list of sub-expressions evaluation results. This require you to use positional access in the list of sub-expressions.
Named matches
(a.k.a assignments
) enable giving a name to the sub-expression
directly in the grammar.
For example:
S: first=a second=digit+[comma];
terminals
a: "a";
digit: /\d+/;
In this example root rule matches one a
and then one or more digit separated
by a comma. You can see that the first sub-expression (a
match) is assigned to
first
while the second sub-expression digit+[comma]
is assigned to second
.
first
and second
will now be an additional keyword parameters passed to the
semantic action. The values passed in using these parameters will be the results
of evaluation of the rules referenced by the assignments.
There are two kind of assignments:
- plain assignment (
=
) -- will collect RHS and pass it to the action under the names given by LHS, - bool assignment (
?=
) -- will passTrue
if the match return non-empty result. If the result of RHS is empty the assignment will result inFalse
being passed to the action.
Each rule using named matches result in a dynamically created Python class named
after the rule. These classes are kept in a dictionary grammar.classes
and
used to instantiate Python objects during parsing by an implicitly
set built-in obj
action.
Thus, for rules using named matches, default action is to create object with
attributes whose names are those of LHS of the assignments and values are from
RHS of the assignments (or boolean values for bool
assignments). Each object
is an instance of corresponding dynamically created Python class.
Effectively, using named matches enables automatic creation of a nice AST.
Note
You can, of course, override default action either in the grammar
using @
syntax or using actions
dict given to the parser.
See the next section.
Referencing semantic actions from a grammar¶
By default action with the name same as the rule name will be
searched in the acommpanying <grammar>_actions.py
file or actions
dict. You can override this by specifying action name for
the rule directly in the grammar using @
syntax. In that case a name given
after @
will be used instead of a rule name.
For example:
@myaction
some_rule: first second;
For rule some_rule
action with the name myaction
will be searched in the
<grammar>_actions.py
module, actions
dict or built-in
actions provided by the parglare.actions
module. This is
helpful if you have some common action that can be used for multiple rules in
your grammar. Also this can be used to specify built-in action to be used for a
rule directly in the grammar.
Grammar comments¶
In parglare grammar, comments are available as both line comments and block comments:
// This is a line comment. Everything from the '//' to the end of line is a comment.
/*
This is a block comment.
Everything in between `/*` and '*/' is a comment.
*/
Handling whitespaces and comments in your language¶
By default parser will skip whitespaces. Whitespace skipping is controlled
by ws
parameter to the parser which is by default set to
'\n\t '
.
If you need more control of the layout, i.e. handling of not only whitespaces
but comments also, you can use a special rule LAYOUT
:
LAYOUT: LayoutItem | LAYOUT LayoutItem;
LayoutItem: WS | Comment | EMPTY;
terminals
WS: /\s+/;
Comment: /\/\/.*/;
This will form a separate layout parser that will parse in-between each matched tokens. In this example whitespaces and line-comments will be consumed by the layout parser.
If this special rule is found in the grammar ws
parser parameter is ignored.
Here is another example that gives support for both line comments and block comments like the one used in the grammar language itself:
LAYOUT: LayoutItem | LAYOUT LayoutItem;
LayoutItem: WS | Comment | EMPTY;
Comment: '/*' CorNCs '*/' | LineComment;
CorNCs: CorNC | CorNCs CorNC | EMPTY;
CorNC: Comment | NotComment | WS;
terminals
WS: /\s+/;
LineComment: /\/\/.*/;
NotComment: /((\*[^\/])|[^\s*\/]|\/[^\*])+/;
Note
If LAYOUT
is provided it must match before the first token, between any
two tokens in the input, and after the last token. If layout cannot be
empty, the input cannot start or end with a token. If this is not desired,
make sure to include EMPTY
in the layout as one of its alternatives like
in the previous examples.
Handling keywords in your language¶
By default parser will match given string recognizer even if it is part of some larger word, i.e. it will not require matching on the word boundary. This is not the desired behaviour for language keywords.
For example, lets examine this little grammar:
S: "for" name=ID "=" from=INT "to" to=INT EOF;
terminals
ID: /\w+/;
INT: /\d+/;
This grammar is intended to match statement like this one:
for a=10 to 20
But it will also match:
fora=10 to20
which is not what we wanted.
parglare allows the definition of a special terminal rule KEYWORD
. This rule
must define a regular expression recognizer.
Any string recognizer in the grammar that can be also recognized by the KEYWORD
recognizer is treated as a keyword and is changed during grammar construction to
match only on word boundary.
For example:
S: "for" name=ID "=" from=INT "to" to=INT EOF;
terminals
ID: /\w+/;
INT: /\d+/;
KEYWORD: /\w+/;
Now,
fora=10 to20
will not be recognized as the words for
and to
are recognized to be keywords
(they can be matched by the KEYWORD
rule).
This will be parsed correctly:
for a=10 to 20
As =
is not matched by the KEYWORD
rule and thus doesn't require to be
separated from the surrounding tokens.
Note
parglare uses integrated scanner so this example:
for for=10 to 20
will be correctly parsed. for
in for=10
will be recognized as ID
and
not as a keyword for
, i.e. there is no lexical ambiguity due to tokenizer
separation.