Lexers
Lexers are components that break an input sequence into its constituent parts called "tokens" or "lexemes". Rustemo implements context-aware lexing. Given the current parsing context, lexers return the next possible tokens from the input stream. This provides greater recognition strength as the lexing is done just for the tokens expected in a given parser state.
Generally, multiple tokens can match at a given location, in which case we say that we have a lexical ambiguity. Rustemo has a built-in mechanism for lexical disambiguation.
To abstract over all possible inputs, Rustemo provides a trait lexer::Input
,
which must be implemented by any type that can be used as an input to the
parsing process. Types str
and [u8]
implement this trait, and thus parsing a
string or a sequence of bytes is possible out-of-the-box.
String parsing is facilitated by so-called
recognizers, which are basically string and
regex patterns defined for each terminal in the terminals
section of the
grammar. Recognizers are used to configure the default string lexer.
For parsing other types, you can provide your custom lexer.
Custom lexers
To create a custom lexer, implement the trait rustemo::lexer::Lexer
for your
type. The method next_token
should return a result holding the next token
given the current parsing context.
The lexer type should be defined in a file <grammar_name>_lexer.rs
where
<grammar_name>
is the base name of the grammar file.
In the tests project, a custom_lexer test demonstrates the implementation of a custom lexer. The lexer implemented in this test is used to tokenize a sequence of varints, which are variable-width integers used in Google's protocol buffers wire format.
Our lexer must determine the next varint in the sequence of bytes.
There are multiple approaches to handle this. Here is one (see the test for additional information):
We start with the following grammar in a file, say custom_lexer_2.rustemo
:
VarInts: VarInt+;
VarInt: MSBByte* NonMSBByte;
terminals
MSBByte:;
NonMSBByte:;
We have defined that the input consists of one or more varints and that each
varint contains zero or more MSBByte
(a byte whose highest bit is set) and
exactly one NonMSBByte
(a byte whose highest bit is not set) at the end.
In the file custom_lexer_2_lexer.rs
, we specify our lexer:
#![allow(unused)] fn main() { /// We are parsing a slice of bytes. pub type Input = [u8]; pub type Ctx<'i> = LRContext<'i, Input, State, TokenKind>; pub struct MyCustomLexer2(); impl MyCustomLexer2 { pub fn new() -> Self { MyCustomLexer2() } } /// In this custom lexer we are not recognizing a full VarInts but only its /// constituents: MSBByte (if highest bit is set), NonMSBByte (highest bit is /// not set). How these bytes is organized into VarInts is defined by the /// grammar and the transformation to a numeric value is done in actions. impl<'i> Lexer<'i, Ctx<'i>, State, TokenKind> for MyCustomLexer2 { type Input = Input; fn next_tokens( &self, context: &mut Ctx<'i>, input: &'i Self::Input, _token_kinds: Vec<(TokenKind, bool)>, ) -> Box<dyn Iterator<Item = Token<'i, Self::Input, TokenKind>> + 'i> { let value; let kind: TokenKind; if context.position() >= input.len() { value = &[][..]; kind = TokenKind::STOP; } else { value = &input[context.position()..=context.position()]; if value[0] & 0b1000_0000 != 0 { kind = TokenKind::MSBByte; } else { kind = TokenKind::NonMSBByte; }; } Box::new(iter::once(Token { kind, value, location: Location { start: Position::Position(context.position()), end: Some(Position::Position(context.position())), }, })) } } }
The job of the lexer in this case is to return STOP
if at the end of the input
or to return a token of MSBByte
kind containing a single byte at the current
position if the highest bit is set, or NonMSBByte
otherwise.
The transformation of a sequence of (Non)MSBByte
to varint is done in semantic
actions given in the file custom_lexer_2_actions.rs
. The relevant
(non-generated) part is this:
#![allow(unused)] fn main() { /// We are doing a conversion in this action. Other actions are generated. /// msbbyte0 is an option containing first bytes of the VarInt non_msbbyte /// contains the last byte pub type VarInt = i128; pub fn var_int_c1( _ctx: &Context, msbbyte0: MSBByte0, non_msbbyte: NonMSBByte, ) -> VarInt { let mut res: i128 = non_msbbyte as i128; if let Some(bytes) = msbbyte0 { bytes .iter() .rev() .for_each(|b| { res <<= 7; res |= (b & 0b0111_1111) as i128; }); } res } }
Next, we configure Rustemo to generate the parser which uses a custom lexer. We
do this either through the API by calling lexer_type(LexerType::Custom)
(see
the build.rs
file in the tests project), or using --lexer-type custom
if the
parser is generated by the rcomp
tool.
Finally, we need to configure the generated parser with our lexer as follows:
#![allow(unused)] fn main() { let bytes = std::fs::read(bytes_file).unwrap(); let result = CustomLexer2Parser::new(MyCustomLexer2::new()).parse(&bytes); }
Notice the instance of the lexer passed into the new
method during parser
construction.
In this example, the lexer doesn't need to know what tokens are expected at the
current location. But sometimes we need that information to make a more powerful
context-aware lexing. To do that, you can import the LEXER_DEFINITION
value
from the generated parser which implements the LexerDefinition
trait. This has
a recognizers
method accepting the current LR state (you can find it in the
passed-in context) and resulting in an iterator of token recognizers which
provide information about expected tokens at the location.
For more info, see the full test for custom lexers.
Each lexer accepts a vector of tokens expected at a given location. These vector
elements are actually pairs of (token_kind, finish flag)
(e.g. _token_kinds: Vec<(TokenKind, bool)>
above). The finish flag signifies that if the token is
found, the lexer should not return any other token, i.e. lexing should terminate
on the first successful match where the finish flag is true.
Lexical disambiguation
When a lexical ambiguity is encountered during the parsing process, Rustemo will use several disambiguation strategies. Some of these strategies can be controlled by configuration parameters. The job of all these strategies is to reduce a set of possible tokens at the given location.
The following table gives an overview of the strategies, their defaults and possibility of control.
Strategy | Default LR | Default GLR | Can be disabled in LR | Can be disabled in GLR | Impl. In |
---|---|---|---|---|---|
Priorities | yes | yes | no | no | lexer |
Most specific | yes | yes | yes | yes | lexer |
Longest match | yes | yes | yes | yes | parser |
Gram. order | yes | no | no | yes | parser |
Strategies are applied in the following order:
-
Priorities - Expected tokens are sorted by priority. A first match in a priority group will reduce further matches only to that group. You can specify the priority of a terminal inside curly braces. The default priority is 10. For example:
terminals SomeTerminal: 'some match' {15};
-
Most specific match - String matches take precedence over regex matches. String matches are ordered by length. When the first string match succeeds, no further matches are tried. This strategy is implemented in
StringLexer
. -
Longest match - All possible matches based on previous strategies are found and the longest match is used. There still can be multiple matches with the same length. Further disambiguation will be handled by the next strategy.
-
Grammar order - Matches are tried in the grammar order. First match wins.
Since LR can't handle ambiguity, grammar order is a final resolution strategy which always resolves to a single token. For GLR, this strategy is not enabled by default as we usually want to handle lexical ambiguity by using the GLR machinery.