Some important changes in the upcoming releases of Arpeggio/textX parsers • Random bits and pieces

In PEG there is an ordered choice expression which operates by trying each alternative from left to right, takes the first that succeeds and proceeds never returning to the following alternatives in case the one chosen proves as incapable of finishing the parsing process.

Ordered choice might look deceptively similar to regular expressions alternative match but there is an important difference.

Let’s say that we are trying to match a | ab and the input is ab. Regular expression engine will try all alternatives and use the one that matches the longest part of the input, thus the second alternative (ab) succeeds. But, PEG will try alternatives from left to right and choose the first which succeeds, in this case a, leaving b in the input unconsumed. A consequence of this is that in PEG ab will never match! It is thus recommended to always order choice alternatives from the most specific to the most generic (e.g. ab | a in this case).

Another thing to consider are expressions that always succeed (let’s call them infallible) and expressions that may match empty (let’s call them nullable).

Infallible expressions are: Optional, ZeroOrMore, RegexMatch that may match empty, a Sequence of infallibles and OrderedChoice where at last one of the choices is infallible.

All infallible expressions must be nullable but not all nullable expressions are infallible. For example, lookahead expressions (also called syntactic predicates) And and Not never consume the input (thus are nullable) but may fail, thus are not infallible.

Currently, Arpeggio (and thus textX) tries to avoid some pitfalls and deviates slightly from PEG by using a sort of “soft failure”. E.g. consider this ordered choice:

a? | b

I’m using PEG/regex notation in examples for the sake of brevity.

And let’s say the input is b. Current version of Arpeggio tries to match a and then considers the failure to consume the input as a “soft failure” and continue to match b which succeeds.

The problem with this approach is that it clearly violates the PEG approach, thus defining different language. What is worse, it may misleadingly look to work as regular expressions which is far from the truth (as demonstrated above with a | ab) and may confuse both users with PEG as well as regex mental model.

In the upcoming 3.x version of Arpeggio I’ve changed the way matching is done to adhere to PEG rules. Thus, each expression either succeeds or not regardless of whether it consumed any input. This makes reasoning much easier but brings subtle differences that may break previous grammars. So be aware of these breaking changes!

First of all, when you have infallible expression as an alternative of an ordered choice (like in the example above), it will always succeed which would make further alternatives unreachable.

So, a? | b in PEG is always wrong as this expression will never match b because a? would succeed by matching empty and the parser would carry on with the next rule that follows the ordered choice. What you should write is (a | b)?.

Thus, the rule of the thumb is to never use infallible rules in choice alternatives.

The second, more serious problem, is using nullable expressions inside repetitions (Zero/OneOrMore). For example:

(a?)*

In the current version of Arpeggio, ZeroOrMore also uses “soft failure” and will terminate matching when there is no input consumed, which kinda shouldn’t introduce any problem. Indeed, I can’t think of any problem with this except that this grammar is not valid from the PEG standpoint as in PEG it introduces an infinite loop where ZeroOrMore repeatedly run nullable match which always succeeds while consuming nothing which leads the parser to the same state and thus produce the infinite loop. Also, using “soft failures” selectively complicates the design of Arpeggio and reasoning about grammars. It is so much easier to reason about the grammar when expressions either succeed or fail regardless of their surroundings and whether they consume the input or not.

Why I didn’t say using infallible expression inside repetitions then? Well, you can still make infinite loop by fallible expressions. Consider this:

(!a)*

This uses negative lookahead to assert that the next part of the input is not a. And since lookaheads don’t consume input, if this is true the parser will loop forever even though the negative lookahead is not infallible.

Luckily, all these invalid combinations are easy to catch (I think) so, before the next release, Arpeggio will have checks in place that will guide the users through fixings of the eventual problems in existing grammars that worked previously.

Thanks to Stanislav Pankevich who triggered interesting discussions which lead to these changes while working on the better error reporting in Arpeggio/textX.