TDParseKit
About
TDParseKit is a Mac OS X Framework written by Todd Ditchendorf in Objective-C 2.0 and released under the MIT Open Source License. The framework is an Objective-C port of the tools described in "Building Parsers with Java" by Steven John Metsker. Some changes have been made to the designs in the book to match common Cocoa/Objective-C design patterns and conventions. However, the changes are relatively superficial, and the book is the best documentation available for this framework.
Xcode Project
The Xcode project containing this framework consists of 4 targets:
- Framework : the TDParseKit Framework.
- Tests : a UnitTest Bundle containing many unit tests (actually, more correctly, interaction tests) for the framework as well as some example classes that serve as real-world usages of the framework.
- DemoApp : A simple Cocoa demo app that gives a visual presentation of the results of tokenizing text using the TDTokenizer class.
- DebugApp : A simple Cocoa app that exists only to run arbitrary test code thru GDB with breakpoints for debugging (I was not able to do that with the UnitTest bundle.).
TDParseKit Framework
Classes in the TDParseKit Framework offer 2 basic services of general use to Cocoa developers:
- Tokenization via a tokenizer class
- Parsing via a high-level parser-building toolkit
Tokenization
The API for tokenization is provided by the TDTokenizer class. Cocoa developers will be familiar with the NSScanner class provided by the Foundation Framework which provides a similar service. However, the TDTokenizer class is much simpler, yet more configurable, flexible, and powerful.
Example usage:
NSString *s = @"\"It's 123 blast-off!\", she said, // watch out!\n"
@"and <= 3.5 'ticks' later /* wince */, it's blast-off!";
TDTokenizer *t = [TDTokenizer tokenizerWithString:s];
TDToken *eof = [TDToken EOFToken];
TDToken *tok = nil;
while ((tok = [t nextToken]) != eof) {
NSLog(@" (%@)", tok.stringValue);
}
outputs:
("It's 123 blast-off!")
(,)
(she)
(said)
(,)
(and)
(<=)
(3.5)
('ticks')
(later)
(,)
(it's)
(blast-off)
(!)
Each token produced is an object of class TDToken. TDTokens have a tokenType (Word, Symbol, Num, QuotedString, etc.) and both a stringValue and a floatValue.
As you can see from the output, TDTokenzier is configured by default to handle several common parsing tasks:
- C- and C++-style comments
- single- and double-quoted string tokens
- common multiple character symbols (<=)
- apostrophes, dashes and other symbol chars that should not signal the start of a new Symbol token, but rather be included in the current Word or Num token (it's, blast-off, 3.5)
All of those features are configurable. TDTokenizer may be configured to:
- recognize more (or fewer) multi-char symbols. ex:
[t.symbolState add:@"!="];
allows != to be recognized as a single Symbol token rather than two adjacent Symbol tokens
- add new internal symbol chars to be included in the current Word token OR recognize internal symbols like apostrophe and dash to actually signal a new Symbol token rather than being part of the current Word token. ex:
[t.wordState setWordChar:'_'];
allows Word tokens to contain internal underscores
- change which chars singnal the start of a token of any given type. ex:
[t setTokenizerState:t.wordState from:'_' to:'_'];
allows Word tokens to start with underscore
- turn off recognition of "slash-slash" (//) comments.ex:
[t setTokenizerState:t.symbolState from:'/' to:'/'];
slash chars now produce a Symbol token rather than causing the tokenizer to strip text until the next newline char or begin striping for a multiline comment if appropriate (/*)
Parsing
TDParseKit also includes a collection of token parser subclasses (of the abstract TDParser class) including collection parsers such as TDAlternation, TDSequence, and TDRepetition as well as terminal parsers including TDWord, TDNum, TDSymbol, TDQuotedString, etc. Also included are parser subclasses which work in individual chars such as TDChar, TDDigit, and TDSpecificChar. These char parsers are useful for things like RegEx parsing. Generally speaking though, the token parsers will be more useful and interesting.
The parser classes represent a Composite pattern. Programs can build a composite parser, in Objective-C (rather than a separate language like with lex&yacc), from a collection of terminal parsers composed into alternations, sequences, and repetitions to represent an infinite number of languages.
Parsers built from TDParseKit are non-deterministic, recursive descent parsers, which basically means they trade some performance for ease of user programming and simplicity of implementation.
Here is an example of how one might build a parser for a simple voice-search command language (note: TDParseKit does not include any kind of speech recognition technology). The language consists of:
search google for? <search-term>
...
[self parseString:@"search google 'iphone'"];
...
- (void)parseString:(NSString *)s {
TDSequence *parser = [TDSequence sequence];
[parser add:[[TDLiteral literalWithString:@"search"] discard]];
[parser add:[[TDLiteral literalWithString:@"google"] discard]];
TDAlternation *optionalFor = [TDAlternation alternation];
[optionalFor add:[TDEmpty empty]];
[optionalFor add:[TDLiteral literalWithString:@"for"]];
[parser add:[optionalFor discard]];
TDParser *searchTerm = [TDQuotedString quotedString];
[searchTerm setAssembler:self selector:@selector(workOnSearchTermAssembly:)];
[parser add:searchTerm];
TDAssembly *result = [parser bestMatchFor:[TDTokenAssembly assmeblyWithString:s]];
NSLog(@" %@", result);
// output:
// ['iphone']search/google/'iphone'^
}
...
- (void)workOnSearchTermAssembly:(TDAssembly *)a {
TDToken *t = [a pop]; // a QuotedString token with a stringValue of 'iphone'
[self doGoogleSearchForTerm:t.stringValue];
}