This section describes the regular expression library in the RegExp directory of the SML/NJ library. This directory includes a README file which has some brief notes on using the library. This section shows some examples.
To use the library your CM files will need to include regexp-lib.cm (from the same place you get your smlnj-lib.cm file).
The regular expression library is designed to be very flexible. It is divided into:
a front-end section that implements the syntax of regular expressions;
a back-end section that implements the matching of regular expressions;
a glue section that joins the two together.
The idea is that you can have more than one style of syntax for regular expressions e.g. Perl versus grep. The different syntaxes can be combined with different implementations of the matching algorithm. You can even feed in your own regular expressions in the internal format directly to the matching algorithms.
At the time of writing there is only one front-end which is for an Awk-like syntax. There are two back-ends. One uses back-tracking and the other compiles the regular expression to a deterministic finite-state automaton (DFA). The back-tracking matcher is described as "slow, low memory footprint, low startup cost". The DFA matcher is described as "fast, but memory-intensive and high startup cost (the cost of constructing the automaton in the first place)".
The front-end and back-end are combined together using the RegExpFn functor from the Glue/regexp-fn.sml source file. For example
structure RE = RegExpFn(structure P=AwkSyntax structure E=BackTrackEngine) |
The resulting structure has this signature.
signature REGEXP = sig (* The type of a compiled regular expression. *) type regexp (* Read an external representation of a regular expression from a stream. *) val compile: (char,'a) StringCvt.reader -> (regexp, 'a) StringCvt.reader (* Read an external representation of a regular expression from a string. *) val compileString : string -> regexp (* Scan the stream for the first occurence of the regular expression. *) val find: regexp -> (char,'a) StringCvt.reader -> ({pos: 'a, len: int} option MatchTree.match_tree,'a) StringCvt.reader (* Attempt to match the stream at the current position with the regular expression. *) val prefix: regexp -> (char,'a) StringCvt.reader -> ({pos: 'a, len: int} option MatchTree.match_tree,'a) StringCvt.reader (* Attempt to match the stream at the current position with one of the external representations of regular expressions and trigger the corresponding action. *) val match: (string * ({pos:'a, len:int} option MatchTree.match_tree -> 'b) ) list -> (char,'a) StringCvt.reader -> ('b, 'a) StringCvt.reader end |
Your program will first compile a regular expression using either the compile or compileString functions. You can then use one of find, prefix or match to match a string with the regular expression. The result of matching is a match tree. Here is the (partial) signature which defines the tree.
signature MATCH_TREE = sig (* A match tree is used to represent the results of a nested grouping of regular expressions. *) datatype 'a match_tree = Match of 'a * 'a match_tree list (* Return the root (outermost) match in the tree. *) val root : 'a match_tree -> 'a (* return the nth match in the tree; matches are labeled in pre-order starting at 0. Raises Subscript *) val nth : ('a match_tree * int) -> 'a ... |
Each node in the tree corresponds to a regular expression in parentheses (a group) except the root of the tree which corresponds to the whole regular expression. Since groups can be nested you get a tree of matches. Each match tree node stores an optional pair of position and length (see the match_tree type in the REGEXP signature above). If the group was matched with part of the original string then this pair will show where. The pair is NONE if the group was not matched with anything e.g. if it's for an alternative that was never followed.
The matching functions are designed to work with the StringCvt scanning infrastructure (see the section called Text Scanning in Chapter 3). So for example the expression (find regexp) is a function that maps a character stream to a stream of match trees. To match a string you will need to combine it with the StringCvt.scanString function.
The match function takes a list of pairs of a regular expression (which will be compiled on the fly) and a function to post-process the match tree. It returns the post-processed result (of the type 'b in the REGEXP signature).
All of this is very flexible but a bit verbose to use. The following sections will show some examples.
This test will match the regular expression "the.(quick|slow).brown" against the string "the quick brown fox". First I build some matchers to try out.
structure BT = RegExpFn(structure P=AwkSyntax structure E=BackTrackEngine) structure DFA = RegExpFn(structure P=AwkSyntax structure E=BackTrackEngine) |
Here is the function to run the matching using the BT matcher.
fun demo1BT msg = let val regexp = BT.compileString "the.(quick|slow).brown" in case StringCvt.scanString (BT.find regexp) msg of NONE => print "demo1 match failed\n" | SOME tree => show_matches msg tree end |
The scanString function is used to apply the matcher to the message. The show_matches function reports the parts of the string that were matched by each group in the regular expression. Here it is.
(* Show the matches n=0, ... *) and show_matches msg tree = let val last = MatchTree.num tree fun find n = ( case MatchTree.nth(tree, n) of NONE => "<Unmatched>" | SOME {pos, len} => String.substring(msg, pos, len) ) and loop n = ( print(concat[Int.toString n, " => ", find n, "\n"]); if n >= last then () else loop(n+1) ) in loop 0 end |
Groups are numbered by counting left-parentheses left to right from 1. Group 0 represents the entire regular expression. The nth function returns the match tree node for the nth group. The show_matches function just iterates for increasing values of n. The last group is given by the num function. The output of this test is
Demo 1 using BT 0 => the quick brown 1 => quick |
The front-end translates a regular expression to an intermediate form which is represented by the syntax datatype. This is defined in the following signature from FrontEnd/syntax-sig.sml. The RegExpSyntax structure implements this signature.
signature REGEXP_SYNTAX = sig exception CannotParse exception CannotCompile structure CharSet : ORD_SET where type Key.ord_key = char datatype syntax = Group of syntax | Alt of syntax list | Concat of syntax list | Interval of (syntax * int * int option) | Option of syntax (* == Interval(re, 0, SOME 1) *) | Star of syntax (* == Interval(re, 0, NONE) *) | Plus of syntax (* == Interval(re, 1, NONE) *) | MatchSet of CharSet.set | NonmatchSet of CharSet.set | Char of char | Begin (* Matches beginning of stream *) | End (* Matches end of stream *) val addRange : CharSet.set * char * char -> CharSet.set val allChars : CharSet.set end |
You can build regular expressions using this datatype. This intermediate form is further translated by the back-end to its own internal representation, for example the DFA for the DFA back-end. Each back-end has its own compile function to do this.
The following code shows the quick brown fox example from the previous section done this way.
local structure RE = RegExpSyntax structure CS = RE.CharSet val dot = RE.NonmatchSet(CS.singleton #"\n") fun cvt_str s = RE.Concat(map RE.Char (explode s)) in fun demo2BT msg = let (* "the.(quick|slow).brown" *) val regexp = BackTrackEngine.compile(RE.Concat[ cvt_str "the", dot, RE.Group(RE.Alt[ cvt_str "quick", cvt_str "slow"]), dot, cvt_str "brown" ]) in case StringCvt.scanString (BT.find regexp) msg of NONE => print "demo2 match failed\n" | SOME tree => show_matches msg tree end end |
The dot in a regular expression usually means any character excluding the new-line character. I can achieve this with NonmatchSet which means all characters but the one in the set. Look at the ORD_SET signature for the available operations on character sets.
The cvt_str function converts a string to a sequence of character matchers. The syntax value is not the simplest since the cvt_str calls produce redundant nesting of Concats. If you were going to be doing a lot of this sort of thing it would be useful to write a normalising function that flattened nested Concats. The Group constructor signals a group of characters to be put into the match tree. The result is the same as before.
Demo 2 using BT 0 => the quick brown 1 => quick |