Disambiguation¶
At each step LR parser has to decide which operation to execute: SHIFT (to consume the next token) or REDUCE (to reduce what it saw previously to some higher level concept). See section on LR parsing and conflicts.
Defining language by CFG alone often leads to ambiguous languages. Sometimes this is what we want, i.e. our inputs indeed have multiple interpretation and we want all of them. This is usually true for natural languages. But when we deal with computer languages we want to avoid ambiguity as there should be only one interpretation for each valid input. We want to define unambiguous language.
To constrain our grammar and make it define unambiguous language we use so called disambiguation filters. These filters are in charge of choosing the right interpretation/tree when there is ambiguity in the grammar.
Even in the simple expression grammar there is ambiguity. For example,
2 + 3 * 4
expression — in case we know nothing about priorities of arithmetic
operations — can be interpreted in two different ways: (2 + 3) * 4
and
2 + (3 * 4)
. Without priorities we would be obliged to use parentheses
everywhere to specify the right interpretation.
Ambiguity is a one source of conflicts in the LR grammars. The other is limited
lookahead. Whatever is the source of conflicts GLRParser
can cope with it. In
case of ambiguity the parser will return all interpretations possible. In case
of a limited lookahead the parser will investigate all possible paths and
resolve to the correct interpretation further down the input stream.
If our grammar is ambiguous and our language is not that means that we need to constrain our grammar using disambiguation filters to better describe our language. Ideally, we strive for a grammar that describe all valid sentences in our language with a single interpretation for each of them and nothing more.
Static disambiguation filters¶
Static disambiguation filters are given in the grammar at the end of the
production using {}
syntax. There is also
a dynamic disambiguation filter that is most
powerful and is specified as a Python function.
priority¶
Priority is probably the simplest form of disambiguation. It is also the strongest in parglare as it's first checked. It is given as a numeric value where the default is 10. When the parser can't decide what operation to use it will favor the one associated with the production with a higher priority.
For example:
E: E "*" E {2};
E: E "+" E {1};
This gives priority of 2
to the production E "*" E
and 1
to the production
E "+" E
. When parglare needs to decide, e.g. between shifting +
or reducing
*
it saw, it will choose reduce as the multiplication production has higher
priority.
Priority can also be given to terminal productions.
associativity¶
Associativity is used for disambiguation between productions of the same priority. In the grammar fragment above we still have ambiguity for expression:
2 + 3 + 5
There are two interpretations (2 + 3) + 5
and 2 + (3 + 5)
. Of course, with
arithmetic +
operation the result will be the same but that's not true for
each operation. Anyway, parse trees will be different so some choice has to be
made.
In this situation associativity is used. Both +
and *
in arithmetic are
left associative (i.e. the operation is evaluated from left to right).
E: E "*" E {2, left};
E: E "+" E {1, left};
Now, the expression above is not ambiguous anymore. It is interpreted as (2 +
3) + 5
.
The associativity given in the grammar is either left
or right
. Default is
no associativity, i.e. associativity is not used for disambiguation decision.
Tip
Alternatively, you can use keyword shift
instead of right
and reduce
instead of left
.
nops
and nopse
¶
These two are not actual filters but markers used to disable
prefer_shifts
(nops
) and
prefer_shifts_over_empty
(nopse
) set
globally during parser construction on a production level. Productions using
these markers are not influenced by global parser setting meaning that table
construction will not eliminate possible reductions on these productions. Using
these markers have sense only for GLR parsing as the LR deterministic parser
can't be constructed anyway in case of conflicts.
For example:
Statements: Statements1 {nops}
| EMPTY;
prefer¶
This disambiguation filter is applicable to terminal productions only. It will be used to choose the right recognizer/terminal in case of lexical ambiguity.
For example:
INT: /[-+]?[0-9]+\b/ {prefer};
FLOAT: /[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?\b/;
If in the grammar we have a possibility that both recognizers are tried, both
will succeed for input 23
, but we want INT
to be chosen in this case.
Dynamic disambiguation filter¶
All previously described filters are of static nature, i.e. they are compiled during LR table calculation (by removing erroneous automata transitions) and they don't depend on the parsed input.
There are sometimes situations when parsing decision depends on the input.
For example, lets say that we need to parse arithmetic expression but our operation priority increase for operations that are introduced later in the input.
1 + 2 * 3 - 4 + 5
Should be parsed as:
((1 + (2 * (3 - 4))) + 5)
While
1 - 2 + 3 * 4 + 5
should be parsed as:
(1 - ((2 + (3 * 4)) + 5))
As you can see, operations that appears later in the input are of higher priority.
In parglare you can implement this dynamic behavior in two steps:
First, mark productions in your grammar by dynamic
rule:
E: E op_sum E {dynamic}
| E op_mul E {dynamic}
| /\d+/;
op_sum: '+' {dynamic};
op_mul: '*' {dynamic};
This tells parglare that those production are candidates for dynamic ambiguity resolution.
Second step is to register a predicate function, during parser construction,
that will be used for resolution. This function operates as a filter for
actions. It receives the parsing context for the action, the LR automata states
between whom the transition is about to occur, and the sub-results in the case
of a reduction. The function should return True
if the transition is allowed
or False
otherwise. This function sometimes need to maintain some kind of
state. To initialize its state at the beginning it is called with None
as
parameters.
parser = Parser(grammar, dynamic_filter=custom_disambiguation_filter)
Where resolution function is of the following form:
def custom_disambiguation_filter(context, from_state, to_state, action,
production, subresults):
"""
Make first operation that appears in the input as lower priority.
This demonstrates how priority rule can change dynamically depending
on the input.
"""
global operations
# At the start of parsing this function is called with actions set to None
# to give a chance for the strategy to initialize.
if action is None:
operations = []
return
if action is SHIFT:
operation = context.token.symbol
else:
operation = context.token_ahead.symbol
actions = from_state.actions[operation]
if operation not in operations and operation.name != 'STOP':
operations.append(operation)
if action is SHIFT:
shifts = [a for a in actions if a.action is SHIFT]
if not shifts:
return False
reductions = [a for a in actions if a.action is REDUCE]
if not reductions:
return True
red_op = reductions[0].prod.rhs[1]
return operations.index(operation) > operations.index(red_op)
elif action is REDUCE:
# Current reduction operation
red_op = production.rhs[1]
# If operation ahead is STOP or is of less or equal priority -> reduce.
return ((operation not in operations)
or (operations.index(operation)
<= operations.index(red_op)))
This function is a predicate that will be called for each action for productions
marked with dynamic
(SHIFT action for dynamic terminal production and REDUCE
action for dynamic non-terminal productions). You are provided with enough
information to make a custom decision whether to perform or reject the
operation.
Parameters are:
-
context - the parsing context object.
-
from_state/to_state --
LRState
instances for the transition, -
action - either SHIFT or REDUCE constant from
parglare
module, -
production - a production used for the REDUCE operation. Valid only if action is REDUCE.
-
subresults (list) - a sub-results for the reduction. Valid only for REDUCE. The length of this list is equal to
len(production.rhs)
.
For details see test_dynamic_disambiguation_filters.py.
Disambiguation of a GLR Forest¶
For GLR forests there is an additional option for a disambiguation. The GLR
forest has a disambiguate
method that accepts a callable of the following
signature:
def disam_callable(parent):
It accepts the GLR GSS Parent
object which has a list of possibilities and the
callable should remove all invalid possibilities from the list. The callable is
called bottom-up for all Parent
object which have more than one possibility.
For a full example with comments see this test.
Lexical ambiguities¶
There is another source of ambiguities.
Parglare uses integrated scanner, thus tokens are determined on the fly. This gives greater lexical disambiguation power but lexical ambiguities might arise nevertheless. Lexical ambiguity is a situation when at some place in the input more than one recognizer match successfully.
For example, if in the input we have 3.4
and we expect at this place either an
integer or a float. Both of these recognizer can match the input. The integer
recognizer would match 3
while the float recognizer would match 3.4
. What
should we use?
parglare has lexical disambiguation strategy that will use priorities first. If this fails (i.e. all terminals have the same priority) we continue with implicit disambiguation strategy as follows:
- String recognizers are preferred over regexes (i.e. the most specific match).
- If we still have multiple matches use longest-match strategy.
- If more recognizers still match use
prefer
rule if given. - If all else fails raise an exception. In case of GLR, ambiguity will be handled by parser forking, i.e. you will end up with all solutions/trees.
Thus, in terminal definition rules we can use priorities to favor some of the
recognizers, or we can use prefer
to favor recognizer if there are multiple
matches of the same length.
For example:
number: /\d+/ {15};
or:
number: /\d+/ {prefer};
Note
Implicit lexical disambiguation is controlled by lexical_disambiguation
parameter passed to Parser
/GLRParser
constructor. By default, Parser
uses implict disambiguation while GLRParser
doesn't.
In addition, you can also specify that the terminal takes part in dynamic disambiguation:
number: /\d+/ {dynamic};
Custom token recognition and lexical disambiguation¶
In the previous section a built-in parglare lexical disambiguation strategy is explained. There are use-cases when this strategy is not sufficient. For example, if we want to do fuzzy match of tokens and choose the most similar token at the position.
parglare solves this problem by enabling you to implement a custom token
recognition by registering a callable during parser instantiation that will,
during parsing, get all the symbols expected at the current location and return
a list of tokens (instances of Token
class) or None
/
empty list if no symbol is found at the location.
This callable is registered during parser instantiation as the parameter
custom_token_recognition
.
parser = Parser(
grammar, custom_token_recognition=custom_token_recognition)
The callable accepts:
-
context - the parsing context object.
-
get_tokens - a callable used to get the tokens recognized using the default strategy. Called without parameters. Custom disambiguation might decide to return this list if no change is necessary, reduce the list, or extend it with new tokens. See the example below how to return list with a token only if the default recognition doesn't succeed.
Returns: a list of Token
class instances or
None
/empty list if no token is found.
To instantiate Token
pass in the symbol and the value of the token. Value of
the token is usually a sub-string of the input string.
In the following test Bar
and Baz
non-terminals are fuzzy matched. The
non-terminal with the higher score wins but only if the score is above 0.7.
grammar = """
S: Element+;
Element: Bar | Baz | Number;
terminals
Bar: /Bar. \d+/;
Baz: /Baz. \d+/;
Number: /\d+/;
"""
g = Grammar.from_string(grammar)
grammar = [g]
def custom_token_recognition(context, get_tokens):
"""
Custom token recognition should return a single token that is
recognized at the given place in the input string.
"""
# Call default token recognition.
tokens = get_tokens()
if tokens:
# If default recognition succeeds use the result.
return tokens
else:
# If no tokens are found do the fuzzy match.
matchers = [
lambda x: difflib.SequenceMatcher(None, 'bar.', x.lower()),
lambda x: difflib.SequenceMatcher(None, 'baz.', x.lower())
]
symbols = [
grammar[0].get_terminal('Bar'),
grammar[0].get_terminal('Baz'),
]
# Try to do fuzzy match at the position
elem = context.input_str[context.position:context.position+4]
elem_num = context.input_str[context.position:]
number_matcher = re.compile('[^\d]*(\d+)')
number_match = number_matcher.match(elem_num)
ratios = []
for matcher in matchers:
ratios.append(matcher(elem).ratio())
max_ratio_index = ratios.index(max(ratios))
if ratios[max_ratio_index] > 0.7 and number_match.group(1):
return [Token(symbols[max_ratio_index], number_match.group())]
parser = Parser(
g, custom_token_recognition=custom_token_recognition)
# Bar and Baz will be recognized by a fuzzy match
result = parser.parse('bar. 56 Baz 12')
assert result == ['bar. 56', 'Baz 12']
result = parser.parse('Buz. 34 bar 56')
assert result == ['Buz. 34', 'bar 56']
result = parser.parse('Ba. 34 baz 56')
assert result == ['Ba. 34', 'baz 56']
# But if Bar/Baz are too different from the correct pattern
# we get ParseError. In this case `bza` score is below 0.7
# for both Bar and Baz symbols.
with pytest.raises(ParseError):
parser.parse('Bar. 34 bza 56')
Note
custom_token_recognition
can be used to implement custom lexical
disambiguation by calling get_tokens
and then reducing returned list to a
list with a single result.
Tip
See the end of the parse trees section for a tip on how to investigate ambiguities in GLR parsing.