Lexers

Lexers are components that break an input sequence into its constituent parts called "tokens" or "lexemes". Rustemo implements context-aware lexing. Given the current parsing context, lexers return the next possible tokens from the input stream. This provides greater recognition strength as the lexing is done just for the tokens expected in a given parser state.

Generally, multiple tokens can match at a given location, in which case we say that we have a lexical ambiguity. Rustemo has a built-in mechanism for lexical disambiguation.

To abstract over all possible inputs, Rustemo provides a trait lexer::Input, which must be implemented by any type that can be used as an input to the parsing process. Types str and [u8] implement this trait, and thus parsing a string or a sequence of bytes is possible out-of-the-box.

String parsing is facilitated by so-called recognizers, which are basically string and regex patterns defined for each terminal in the terminals section of the grammar. Recognizers are used to configure the default string lexer.

For parsing other types, you can provide your custom lexer.

Custom lexers

To create a custom lexer, implement the trait rustemo::lexer::Lexer for your type. The method next_token should return a result holding the next token given the current parsing context.

The lexer type should be defined in a file <grammar_name>_lexer.rs where <grammar_name> is the base name of the grammar file.

In the tests project, a custom_lexer test demonstrates the implementation of a custom lexer. The lexer implemented in this test is used to tokenize a sequence of varints, which are variable-width integers used in Google's protocol buffers wire format.

Our lexer must determine the next varint in the sequence of bytes.

There are multiple approaches to handle this. Here is one (see the test for additional information):

We start with the following grammar in a file, say custom_lexer_2.rustemo:

VarInts: VarInt+;
VarInt: MSBByte* NonMSBByte;

terminals
MSBByte:;
NonMSBByte:;

We have defined that the input consists of one or more varints and that each varint contains zero or more MSBByte (a byte whose highest bit is set) and exactly one NonMSBByte (a byte whose highest bit is not set) at the end.

In the file custom_lexer_2_lexer.rs, we specify our lexer:

#![allow(unused)]
fn main() {
/// We are parsing a slice of bytes.
pub type Input = [u8];
pub type Ctx<'i> = LRContext<'i, Input, State, TokenKind>;

pub struct MyCustomLexer2();

impl MyCustomLexer2 {
    pub fn new() -> Self {
        MyCustomLexer2()
    }
}

/// In this custom lexer we are not recognizing a full VarInts but only its
/// constituents: MSBByte (if highest bit is set), NonMSBByte (highest bit is
/// not set). How these bytes is organized into VarInts is defined by the
/// grammar and the transformation to a numeric value is done in actions.
impl<'i> Lexer<'i, Ctx<'i>, State, TokenKind> for MyCustomLexer2 {
    type Input = Input;

    fn next_tokens(
        &self,
        context: &mut Ctx<'i>,
        input: &'i Self::Input,
        _token_kinds: Vec<(TokenKind, bool)>,
    ) -> Box<dyn Iterator<Item = Token<'i, Self::Input, TokenKind>> + 'i> {
        let value;
        let kind: TokenKind;
        if context.position() >= input.len() {
            value = &[][..];
            kind = TokenKind::STOP;
        } else {
            value = &input[context.position()..=context.position()];
            if value[0] & 0b1000_0000 != 0 {
                kind = TokenKind::MSBByte;
            } else {
                kind = TokenKind::NonMSBByte;
            };
        }

        Box::new(iter::once(Token {
            kind,
            value,
            location: Location {
                start: Position::Position(context.position()),
                end: Some(Position::Position(context.position())),
            },
        }))
    }
}
}

The job of the lexer in this case is to return STOP if at the end of the input or to return a token of MSBByte kind containing a single byte at the current position if the highest bit is set, or NonMSBByte otherwise.

The transformation of a sequence of (Non)MSBByte to varint is done in semantic actions given in the file custom_lexer_2_actions.rs. The relevant (non-generated) part is this:

#![allow(unused)]
fn main() {
/// We are doing a conversion in this action. Other actions are generated.
/// msbbyte0 is an option containing first bytes of the VarInt non_msbbyte
/// contains the last byte
pub type VarInt = i128;
pub fn var_int_c1(
    _ctx: &Context,
    msbbyte0: MSBByte0,
    non_msbbyte: NonMSBByte,
) -> VarInt {
    let mut res: i128 = non_msbbyte as i128;
    if let Some(bytes) = msbbyte0 {
        bytes
            .iter()
            .rev()
            .for_each(|b| {
                res <<= 7;
                res |= (b & 0b0111_1111) as i128;
            });
    }
    res
}
}

Next, we configure Rustemo to generate the parser which uses a custom lexer. We do this either through the API by calling lexer_type(LexerType::Custom) (see the build.rs file in the tests project), or using --lexer-type custom if the parser is generated by the rcomp tool.

Finally, we need to configure the generated parser with our lexer as follows:

#![allow(unused)]
fn main() {
    let bytes = std::fs::read(bytes_file).unwrap();
    let result = CustomLexer2Parser::new(MyCustomLexer2::new()).parse(&bytes);
}

Notice the instance of the lexer passed into the new method during parser construction.

In this example, the lexer doesn't need to know what tokens are expected at the current location. But sometimes we need that information to make a more powerful context-aware lexing. To do that, you can import the LEXER_DEFINITION value from the generated parser which implements the LexerDefinition trait. This has a recognizers method accepting the current LR state (you can find it in the passed-in context) and resulting in an iterator of token recognizers which provide information about expected tokens at the location.

For more info, see the full test for custom lexers.

Note

Each lexer accepts a vector of tokens expected at a given location. These vector elements are actually pairs of (token_kind, finish flag) (e.g. _token_kinds: Vec<(TokenKind, bool)> above). The finish flag signifies that if the token is found, the lexer should not return any other token, i.e. lexing should terminate on the first successful match where the finish flag is true.

Lexical disambiguation

When a lexical ambiguity is encountered during the parsing process, Rustemo will use several disambiguation strategies. Some of these strategies can be controlled by configuration parameters. The job of all these strategies is to reduce a set of possible tokens at the given location.

The following table gives an overview of the strategies, their defaults and possibility of control.

Strategy	Default LR	Default GLR	Can be disabled in LR	Can be disabled in GLR	Impl. In
Priorities	yes	yes	no	no	lexer
Most specific	yes	yes	yes	yes	lexer
Longest match	yes	yes	yes	yes	parser
Gram. order	yes	no	no	yes	parser

Strategies are applied in the following order:

Priorities - Expected tokens are sorted by priority. A first match in a priority group will reduce further matches only to that group. You can specify the priority of a terminal inside curly braces. The default priority is 10. For example:
```
terminals
SomeTerminal: 'some match' {15};
```
Most specific match - String matches take precedence over regex matches. String matches are ordered by length. When the first string match succeeds, no further matches are tried. This strategy is implemented in StringLexer.
Longest match - All possible matches based on previous strategies are found and the longest match is used. There still can be multiple matches with the same length. Further disambiguation will be handled by the next strategy.
Grammar order - Matches are tried in the grammar order. First match wins.

Since LR can't handle ambiguity, grammar order is a final resolution strategy which always resolves to a single token. For GLR, this strategy is not enabled by default as we usually want to handle lexical ambiguity by using the GLR machinery.

Rustemo book

Lexers

Custom lexers

Lexical disambiguation