A key idea behind Oak is to give a type to parsing expression. For example, we expect
e1 e2 to have the type
(T1, T2) if
e1 has type
e2 has type
T2. Indeed, it exists an obvious mapping between PEG combinators and traditional types found in programming language: choice is a sum type, sequence is a tuple, repetition is an array, etc. Oak was born to explore this mapping and to answer a question: Can we automatically generate an AST from a grammar description?
It turned out that generating the AST (data type included) was hard, mostly because we need to name types and that rules does not give enough information by themselves – how to name the variants of the sum type? Of course, we could annotate expressions with names but Oak is designed to describe a grammar in the cleanest way as possible in the first place, so this is the best solution. Also, the user will certainly want to use its own custom types and not arbitrary generated types, so a fully automatic generation is not such a good idea. Therefore, Oak relies on the return types of semantic actions to have a complete type inference scheme. That is, the user implicitly brings additional type information to Oak through semantic actions. This technique has at least two advantages over conventional methods:
number "+" number > addis a valid expression where
addis a semantic action called with two arguments of the type of
!numberwill only recognize the expression
numberbut semantic actions inside
numberwill not be called.
This chapter explains how Oak gives type to expression and how you can efficiently control and use it.
Despite the apparent simplicity of this idea, a direct mapping between expression and type is not very useful. Consider the following grammar for parsing variable identifier.
var_ident = !["0-9"] ["a-zA-Z0-9_"]+ spacing spacing = [" \n\r\t"]*Run
A straightforward mapping would give to this expression the type
(char, Vec<char>, Vec<char>) since the sequence has three sub expressions and thus forms a 3-tuple. Clearly, the only value of interest in
var_ident is the one returned by the expression
["a-zA-Z0-9_"]+ which has type
Vec<char> (note that we could use a semantic action to transform this value into a string). It is natural to think that the rule
var_ident will be of type
Vec<char> too. Oak infers this type if we tell him that we do not care about the value of spaces which is not something that it can guess by itself. We use the combinator
e -> (^) to inform to Oak that we do not want the value of
e to appear in the AST. There is two possible types: unit type
() and invisible type
(^), they both give the type unit to expressions but, in addition,
(^) propagates in the expression tree.
var_ident = !["0-9"] ["a-zA-Z0-9_"]+ spacing spacing = [" \n\r\t"]* -> (^)Run
The new type of
var_ident is now
(char, Vec<char>, (^)). The inference algorithm automatically reduces this type to
Vec<char> thanks to a few simplification rules:
?e) has type
(^). The new type is
((^), Vec<char>, (^)).
These type rewriting rules are intuitive because they produce the type the user expects! Type annotation is only needed to specify that we are not interested by the value, such as with spaces.
A type containing a unit type is simplified if it does not erase a piece of information. If we consider the following grammar which describe the optional presence of the
mut keyword on the left-hand side of a let-expression, the
mut_kw? type is not rewritten into
let_left = let_kw mut_kw? var_ident let_kw = "let" spacing mut_kw = "mut" spacing -> ()Run
-> () otherwise the expression would have the invisible type since literal string and, here spacing, have the type
Therefore, the type of the expression
Option<()> which is expected since the type
Option<()> carries a boolean information. As a rule of thumb, unit inference never erase a potential piece of information. In some cases, expression are only of a pure syntactic interest such as spaces or the first optional
| in OCaml pattern-matching. This is why we use the "invisible type" annotation
e -> (^) to indicate that the unit type must be propagated up since it does not carry any relevant semantic information.
match_expr = match_kw expr with_kw bar? cases cases = case (bar case)* bar = "|" spacingRun
match_expr, the expression
bar? have by default the type
(^). The circumflex symbol in
(^) indicates a bottom up propagation of unit in expressions. The propagation is only stopped if it is composed with a value of a relevant type. For example, the expression
bar? expr has type
(^) has been propagated across
Option<(^)> and then stopped by the tuple
We must distinguish recursive rules that are totally valid in Oak and recursive types that can not be automatically inferred. For example, the following grammar accepts strings in which any character at position
i is 'a' or 'b' if
i is even and is otherwise 'c' or 'd'.
ab = ["ab"] cd? cd = ["cd"] ab?Run
This is a totally valid grammar but it can not be typed without recursive types. Let's try anyway. Say that
ab has type
cd has type
U. We can infer that
T = (char, Option<U>) for rule
U = (char, Option<T>) for rule
cd. By substitution we get
T = (char, Option<(char, Option<T>)) and thus
T is defined by itself. It is a recursive type definition. You might think that this type is fine as long as we give an alias to the tuple types:
type T = (char, Option<U>); type U = (char, Option<T>);Run
However, the names
U are completely arbitrary and the user probably do not want types with random names. We would need name-annotations on expressions which is not our leitmotiv in the first place. It is cleaner and easier to let the user constructs the types by himself with semantic actions.
Nevertheless, we did not want to reject valid grammar because of recursive types. We have chosen to print a warning during compilation informing we reduced the types of rules involved in a type cycle to
(^). You can get rid of this warning by explicitly annotating one the rule in the cycle with