Recognizers¶
Parglare uses scannerless parsing. Actually, scanner is integrated in the parser. Each token is created/recognized in the input during parsing using so called recognizer which is connected to the grammar terminal symbol.
This gives us greater flexibility.
First, recognizing tokens during parsing eliminate lexical ambiguities that arise in separate scanning due to the lack of parsing context.
Second, having a separate recognizers for grammar terminal symbols allows us to parse not only text but a stream of anything as parsing is nothing more but constructing a tree (or some other form) out of a flat list of objects. Those objects are characters if text is parsed, but don't have to be.
Parglare has two built-in recognizers for textual parsing that can be specified in the grammar directly. Those are usually enough if text is parsed, but if a non-textual content is parsed you will have to supply your own recognizers that are able to recognize tokens in the input stream of objects.
Recognizers are Python callables of the following form:
def some_recognizer(context, input, pos):
...
...
return part of input starting at pos
where context
is the parsing context object
and is optional (e.g. you don't have to accept it in your recognizers), input
is the input string or list of objects and position
is the position in the
input where match should be performed. The recognizer should return the part of
the input that is recognized or None
if it doesn't recognize anything at the
current position. For example, if we have an input stream of objects that are
comparable (e.g. numbers) and we want to recognize ascending elements starting
at the given position but such that the recognized token must have at least two
object from the input, we could write the following:
def ascending_nosingle(input, pos):
"Match sublist of ascending elements. Matches at least two."
last = pos + 1
while last < len(input) and input[last] > input[last-1]:
last += 1
if last - pos >= 2:
return input[pos:last]
We register our recognizers during grammar construction. All terminal rules in the grammar that don't define string or regex match (i.e. they have empty bodies) must be augmented with custom recognizers for the parser to be complete.
In order to do that, create a Python dict where the key will be a rule name used
in the grammar references and the value will be recognizer callable. Pass the
dictionary as a recognizers
parameter to the parser.
recognizers = {
'ascending': ascending_nosingle
}
grammar = Grammar.from_file('mygrammar.pg', recognizers=recognizers)
In the file mygrammar.pg
you have to provide a terminal rule with empty body:
ascending: ;
Tip
You can also define recognizers in a separate Python file that accompanies your grammar file. In that case, recognizers will be automatically registered on the parser. For more information see grammar file recognizers.
There is a need sometimes to pass additional data from recognizers to appropriate actions. If you need this you can return additional information after the matched part of the input. For example:
def a_rec(input, pos):
m = re.compile(r'(\d+)')
result = m.match(input[pos:])
return result.group(), result
This recognizer returns both the string it matched and the resulting regex match so that action can use the match object to extract more information without repeating the match:
def a_act(context, value, match):
"""
Action will get the additional returned information from the a_rec
recognizer. In this case a regex match object.
"""
# Do something with `value` and `match` and create the result
You can return more than one additional element. Essentially if a tuple is returned by the recognizer, the first element has to be matched input while the rest is additional data.
If you are building parse tree, additional
information returned by recognizers is kept on NodeTerm
(and Token
) as
additional_data
attribute which is a list of all additional info returned by
the recognizer.
Tip
If you want more information you can investigate recognizer tests.