Disambiguation¶
At each step LR parser has to decide which operation to execute: SHIFT (to consume the next token) or REDUCE (to reduce what it saw previously to some higher level concept). See section on LR parsing and conflicts.
Defining language by CFG alone often leads to ambiguous languages. Sometimes this is what we want, i.e. our inputs indeed have multiple interpretation and we want all of them. This is usually true for natural languages. But when we deal with computer languages we want to avoid ambiguity as there should be only one interpretation for each valid input. We want to define unambiguous language.
To constrain our grammar and make it define unambiguous language we use so called disambiguation filters. These filters are in charge of choosing the right interpretation/tree when there is ambiguity in the our grammar.
Even in the simple expression grammar there is ambiguity. For example,
2 + 3 * 4
expression — in case we know nothing about priorities of arithmetic
operations — can be interpreted in two different ways: (2 + 3) * 4
and
2 + (3 * 4)
. Without priorities we would be obliged to use parentheses
everywhere to specify the right interpretation.
Ambiguity is a one source of conflicts in the LR grammars. The other is limited
lookahead. Whatever is the source of conflicts GLRParser
can cope with it. In
case of ambiguity the parser will return all interpretations possible. In case
of a limited lookahead the parser will investigate all possible paths and
resolve to the correct interpretation further down the input stream.
If our grammar is ambiguous and our language is not that means that we need to constrain our grammar using disambiguation filters to better describe our language. Ideally, we strive for a grammar that describe all valid sentences in our language with a single interpretation for each of them and nothing more.
Static disambiguation filters¶
Static disambiguation filters are given in the grammar at the end of the
production using {}
syntax. There is also
a dynamic disambiguation filter that is most
powerful and is specified as a Python function.
priority¶
Priority is probably the simplest form of disambiguation. It is also the strongest in parglare as it's first checked. It is given as a numeric value where the default is 10. When the parser can't decide what operation to use it will favor the one associated with the production with a higher priority.
For example:
E: E "*" E {2};
E: E "+" E {1};
This gives priority of 2
to the production E "*" E
and 1
to the production
E "+" E
. When parglare needs to decide, e.g. between shifting +
or reducing
*
it saw, it will choose reduce as the multiplication production has higher
priority.
Priority can also be given to terminal productions.
associativity¶
Associativity is used for disambiguation between productions of the same priority. In the grammar fragment above we still have ambiguity for expression:
2 + 3 + 5
There are two interpretations (2 + 3) + 5
and 2 + (3 + 5)
. Of course, with
arithmetic +
operation the result will be the same but that's not true for
each operation. Anyway, parse trees will be different so some choice has to be
made.
In this situation associativity is used. Both +
and *
in arithmetic are
left associative (i.e. the operation is evaluated from left to right).
E: E "*" E {2, left};
E: E "+" E {1, left};
Now, the expression above is not ambiguous anymore. It is interpreted as (2 +
3) + 5
.
The associativity given in the grammar is either left
or right
. Default is
no associativity, i.e. associativity is not used for disambiguation decision.
Note
Alternatively, you can use keyword shift
instead of right
and reduce
instead of left
.
nops
and nopse
¶
These two are not actual filters but markers used to
disable prefer_shifts
(nops
)
and prefer_shifts_over_empty
(nopse
)
set globally during parser construction on a production level. Productions using
these markers are not influenced by global parser setting meaning that table
construction will not eliminate possible reductions on these productions. Using
these markers have sense only for GLR parsing as the LR deterministic parser
can't be constructed anyway in case of conflicts.
For example:
Statements: Statements1 {nops}
| EMPTY;
prefer¶
This disambiguation filter is applicable to terminal productions only. It will be used to choose the right recognizer/terminal in case of lexical ambiguity.
For example:
INT = /[-+]?[0-9]+\b/ {prefer};
FLOAT = /[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?\b/;
If in the grammar we have a possibility that both recognizers are tried, both
will succeed for input 23
, but we want INT
to be choosen in this case.
Dynamic disambiguation filter¶
All previously described filters are of static nature, i.e. they are compiled during LR table calculation (by removing erroneous automata transitions) and they don't depend on the parsed input.
There are sometimes situations when parsing decision depends on the input.
For example, lets say that we need to parse arithmetic expression but our operation priority increase for operations that are introduced later in the input.
1 + 2 * 3 - 4 + 5
Should be parsed as:
((1 + (2 * (3 - 4))) + 5)
While
1 - 2 + 3 * 4 + 5
should be parsed as:
(1 - ((2 + (3 * 4)) + 5))
As you can see, operations that appears later in the input are of higher priority.
In parglare you can implement this dynamic behavior in two steps:
First, mark productions in your grammar by dynamic
rule:
E: E op_sum E {dynamic}
| E op_mul E {dynamic}
| /\d+/;
op_sum: '+' {dynamic};
op_mul: '*' {dynamic};
This tells parglare that those production are candidates for dynamic ambiguity resolution.
Second step is to register a function, during parser construction, that will be
used for resolution. This function operates as a filter for actions in a given
state and lookahead token. It receives the current action, a token ahead,
production (for REDUCE action), sub-results (for REDUCE action), and current LR
automata state and returns either True
if the action is acceptable or False
otherwise. This function sometimes need to maintain some kind of state. To
initialize its state at the beginning it is called with None
as parameters.
parser = Parser(grammar, dynamic_filter=custom_disambiguation_filter)
Where resolution function is of the following form:
def custom_disambiguation_filter(action, token, production, subresults, state,
context):
"""Make first operation that appears in the input as lower priority.
This demonstrates how priority rule can change dynamically depending
on the input.
"""
global operations
# At the start of parsing this function is called with actions set to
# None to give a chance for the strategy to initialize.
if action is None:
operations = []
return
actions = state.actions[token.symbol]
# Lookahead operation
shift_op = token.symbol
if action is SHIFT:
if shift_op not in operations:
operations.append(shift_op)
if len(actions) == 1:
return True
red_op = [a for a in actions if a.action is REDUCE][0].prod.rhs[1]
return operations.index(shift_op) > operations.index(red_op)
elif action is REDUCE:
# Current reduction operation
red_op = production.rhs[1]
if red_op not in operations:
operations.append(red_op)
if len(actions) == 1:
return True
# If lookahead operation is not processed yet is is of higer priority
# so do not reduce.
# If lookahead is in operation and its index is higher do not reduce.
return (shift_op in operations
and (operations.index(shift_op)
<= operations.index(red_op)))
This function is a predicate that will be called for each action for productions
marked with dynamic
(SHIFT action for dynamic terminal production and REDUCE
action for dynamic non-terminal productions). You are provided with enough
information to make a custom decision whether to perform or reject the
operation.
Parameters are:
-
action - either SHIFT or REDUCE constant from
parglare
module, -
token (Token) - a lookahead token,
-
production (Production) - a production to be reduced. Valid only for REDUCE.
-
subresults (list) - a sub-results for the reduction. Valid only for REDUCE. The length of this list is equal to
len(production.rhs)
. -
state (LRState) - current LR parser state.
For details see test_dynamic_disambiguation_filters.py.
Lexical ambiguities¶
There is another source of ambiguities.
Parglare uses integrated scanner, thus tokens are determined on the fly. This gives greater lexical disambiguation power but lexical ambiguities might arise nevertheless. Lexical ambiguity is a situation when at some place in the input more than one recognizer match successfully.
For example, if in the input we have 3.4
and we expect at this place either an
integer or a float. Both of these recognizer can match the input. The integer
recognizer would match 3
while the float recognizer would match 3.4
. What
should we use?
parglare has implicit lexical disambiguation strategy that will:
- Use priorities first.
- String recognizers are preferred over regexes (i.e. the most specific match).
- If priorities are the same and we have no string recognizers use longest-match strategy.
- If more recognizers still match use
prefer
rule if given. - If all else fails raise an exception. In case of GLR, ambiguity will be handled by parser forking, i.e. you will end up with all solutions/trees.
Thus, in terminal definition rules we can use priorities to favor some of the
recognizers, or we can use prefer
to favor recognizer if there are multiple
matches of the same length.
For example:
number = /\d+/ {15};
or:
number = /\d+/ {prefer};
In addition, you can also specify terminal to take a part in dynamic disambiguation:
number = /\d+/ {dynamic};
Custom lexical disambiguation¶
In the previous section is explained built-in parglare lexical disambiguation strategy. There are use-cases when this strategy is not sufficient. For example, if we want to do fuzzy match of tokens and choose the most similar token at the position.
parglare solves this problem by enabling you to register a callable during
parser instantiation that will, during parsing, get all the symbols expected at
the current location and return a list of tokens (instances
of Token
class) or None
or empty list if no symbol is
found at the location.
This callable is registered during parser instantiation as the parameter
custom_lexical_disambiguation
.
parser = Parser(
grammar, custom_lexical_disambiguation=custom_lexical_disambiguation)
The callable accepts:
- symbols - a list of terminals expected at the current position,
- input_str - input string,
- position - current position in the input string.
- get_tokens - a callable used to get the tokens recognized using the default strategy. Called without parameters. Custom disambiguation might decide to return this list if no change is necessary, reduce the list, or extend it with new tokens. See the example bellow how to return list with a token only if the default recognition doesn't succeed.
Returns: a list of Token
class instances or None
/empty list if no token
is found.
To instantiate Token
pass in the symbol and the value of the token. Value of
the token is usually a sub-string of the input string.
In the following test Bar
and Baz
non-terminals are fuzzy matched. The
non-terminal with the higher score wins but only if the score is above 0.7.
grammar = """
S: Element+ EOF;
Element: Bar | Baz | Number;
Bar: /Bar. \d+/;
Baz: /Baz. \d+/;
Number: /\d+/;
"""
g = Grammar.from_string(grammar)
grammar = [g]
def custom_lexical_disambiguation(symbols, input_str, position, get_tokens):
"""
Lexical disambiguation should return a single token that is
recognized at the given place in the input string.
"""
# Call default token recognition.
tokens = get_tokens()
if tokens:
# If default recognition succeeds use the result.
return tokens
else:
# If no tokens are found do the fuzzy match.
matchers = [
lambda x: difflib.SequenceMatcher(None, 'bar.', x.lower()),
lambda x: difflib.SequenceMatcher(None, 'baz.', x.lower())
]
symbols = [
grammar[0].get_terminal('Bar'),
grammar[0].get_terminal('Baz'),
]
# Try to do fuzzy match at the position
elem = input_str[position:position+4]
elem_num = input_str[position:]
number_matcher = re.compile('[^\d]*(\d+)')
number_match = number_matcher.match(elem_num)
ratios = []
for matcher in matchers:
ratios.append(matcher(elem).ratio())
max_ratio_index = ratios.index(max(ratios))
if ratios[max_ratio_index] > 0.7 and number_match.group(1):
return [Token(symbols[max_ratio_index], number_match.group())]
parser = Parser(
g, custom_lexical_disambiguation=custom_lexical_disambiguation)
# Bar and Baz will be recognized by a fuzzy match
result = parser.parse('bar. 56 Baz 12')
assert result == [['bar. 56', 'Baz 12'], None]
result = parser.parse('Buz. 34 bar 56')
assert result == [['Buz. 34', 'bar 56'], None]
result = parser.parse('Ba. 34 baz 56')
assert result == [['Ba. 34', 'baz 56'], None]
# But if Bar/Baz are too different from the correct pattern
# we get ParseError. In this case `bza` score is bellow 0.7
# for both Bar and Baz symbols.
with pytest.raises(ParseError):
parser.parse('Bar. 34 bza 56')