W3 pm
Daniel Klein

Regular Expression Mastery

Almost everyone has written a regex that produced unexpected results. Sometimes regexes appear to hang forever, and it’s not clear what has gone wrong. Sometimes they behave differently in different utilities, and you can’t tell why. This class will fix all these problems. The first section of the class will explore the matching algorithms used internally by common utilities such as grep and Perl. Understanding these algorithms will allow us to predict whether a regex will match, which of several matches will be found, and which regexes are likely to be faster than others, and to understand why all of these behaviors occur. We’ll learn why commonly used regex symbols such as “.,” “$.” and “\1” may not mean what you thought they did. In the second section, we’ll look at common matching disasters, a few practical parsing applications, and some advanced Perl features. We'll finish with a discussion of optimizations that were added to Perl 5.6, and why you should avoid using “/i.”
Topics include:
- Inside the regex engine
- Regular expressions are programs
- Backtracking
- NFA vs. DFA- POSIX and Perl
- Quantifiers
- Greed and anti-greed
- Anchors and assertions
- Backreferences
- Disasters and optimizations
- Where machines come from
- Disaster examples
- Tokenizing
- New optimizations
- Matching strings with balanced parentheses
Daniel Klein
Who should attend:
System administrators and users who
use Perl, grep, sed, awk, procmail, vi,
or emacs.