Regular Expression Recipes 258

Posted by timothy on Tuesday March 22, 2005 @03:45PM from the prune-talkin' dept.

r3lody writes "If you spend time working writing applications that have to do pattern matches and/or replacements, you know about some of the intricacies of regular expressions. For many people they can be an arcane hodgepodge of odd characters that somehow manage to do wonderful things, but they don't have enough time (or interest) to really understand how to code them. Nathan A. Good has written Regular Expression Recipes: A Problem-Solution Approach for those people. In its relatively slim 289 pages, he offers 100 regular expressions in a cookbook format, tailored to solve problems in one of six broad categories (Words and Text, URLs and Paths, CSV and Tab-Delimited Files, Formatting and Validating, HTML and XML, and Coding and Using Commands)." Read on for the rest of Lodato's review.

Regular Expression Recipes: A Problem-Solution Approach
author	Nathan A. Good
pages	289
publisher	Apress
rating	8/10
reviewer	Raymond Lodato (rlodato AT yahoo DOT com)
ISBN	159059441X
summary	A cookbook of useful regular expressions for Perl, Python and more.

Regular expressions are not restricted to just the Perl or shell environments, so Nathan offers variations for Python, PHP, and VIM as well. In most cases the translation is relatively straight-forward, but in a few cases a different environment may have (or lack) additional facilities, prompting a different expression to do the same task.

Before you even read chapter 1, Nathan provides a quick summary course on regular expressions, with detail given to each of the five environments you might utilize. He has written the syntax overview in a highly-readable format, making it easy to understand the gobbledy-gook of the most bizarre concoctions you might encounter.

The first chapter (Words and Text) starts simply enough. He gives examples of how to find single words, multiple words, and repeated words, along with examples of how to replace various detected strings with others. In each case he gives an example of its use for each platform, followed by a bit-by-bit breakdown of how it works. Not every environment is given on every example, and in many cases the "How It Works" section refers to the first one, as most REs are identical between the platforms.

The next chapter (URLs and Paths) offers various methods of doing commonly needed parsing. Pulling out file names, query strings, and directories, as well as reconstructing them in useful fashions is covered in the 15 offerings given here. Validating, converting, and extracting fields of CSV and tab-delimited files are handled in chapter 3, while chapter 4 is concerned with validating field formats, as well as re-formatting text for the fields. Chapter 5 handles similar tasks for HTML and XML documents. The final chapter covers expressions that facilitate the management of program code, log files, and the output of selected commands.

First, I must admit that there are a number of useful solutions provided, especially for someone who is concerned with application and web development. However, I did feel a little cheated by the fact that several chapters covered essentially the same task, with only minor variations. It almost seemed as though the author was trying to pad out the solution count to the magic number 100. A simple example: three solutions in chapter one cover (a) replacing smart quotes with straight quotes, (b) replacing copyright symbols with the (c) tri-graph, and (c) replacing trademark symbols with the (tm) sequence. In each case, the expression was simply "s/\xhh/ rep /g;". Did we really need three separate chapters for that? I don't think so.

Another quibble revolves around some of the coding of the expressions. Nathan has made liberal use of the non-capturing groups (that is, (: expr )) to insure only the items that needed replacement were captured. While a worthy idea, in some cases the expression may have been simplified for understanding. Another issue is a slight error in searching for letters. In a number of expressions, Nathan uses [A-z] to capture all letters. Unfortunately, the special characters [, \, ], ^, _, and ` occur between upper-case Z and lower-case a, making it match too much. Either [[:alpha:]] or [A-Za-z] should have been used.

Despite these quibbles, Regular Expression Recipes does provide a useful compendium of solutions for common problems developers face. Presenting the information in a cookbook fashion, along with ensuring that those using something other than Perl don't have to sweat translating the expressions to their target language, makes this a handy book to have. I wouldn't hesitate to recommend it.

You can purchase Regular Expression Recipes: A Problem-Solution Approach from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

Regular Expression Recipes

This discussion has been archived. No new comments can be posted.

Search 258 Comments Log In/Create an Account

Comments Filter:

cant get used to them (Score:1, Informative)

by alienfluid ( 677872 ) writes: on Tuesday March 22, 2005 @03:48PM (#12014942) Homepage

regular expressions are nice and all but i still cant get used to them .. a good manual should be kept handy at all times. Vist Lafayette Linux Users Group at http://lug.lafayette.edu [lafayette.edu]. Suggestions are welcome.

Points (Score:5, Informative)

by 2.7182 ( 819680 ) writes: on Tuesday March 22, 2005 @03:49PM (#12014946)

I really liked this book, but

1. the binding broke
2. the index has a lot of typos.

Regular expressions in a cookbook? (Score:5, Informative)

by DeadSea ( 69598 ) * writes: on Tuesday March 22, 2005 @03:51PM (#12014962) Homepage Journal

Sounds like good eating. ;-)
Regular expressions are great, but once you know them and you think you can conquer the world, I find they occasionally let you down. The text editor I was using had a rudementary regular expression search that did not support non-greedy matching. I found that writing a regular expression that finds C style /* comments */ to be quite tricky with only greeding matching [ostermiller.org]. I wrote it up as an article where I build the expression piece by piece showing common things you might try that won't work.
If you want more of a challenge, try writing a regular expression that find any <script></script> tags along with anything in between using only greedy matching. You will find that the length of your regular expression goes up exponentially with the length of your ending condition.
--
Calculator for Converting Currency [ostermiller.org]

I personally... (Score:5, Informative)

by BlueCodeWarrior ( 638065 ) writes: <steevk@gmail.com> on Tuesday March 22, 2005 @03:55PM (#12015012) Homepage

...use 'Mastering Regular Expressions [oreilly.com] . It's a good book on the topic as well.

add this book to your list (Score:3, Informative)

by yagu ( 721525 ) writes: <yayagu@gmai[ ]om ['l.c' in gap]> on Tuesday March 22, 2005 @03:55PM (#12015014) Journal

While I can't vouch for the quality of the reviewed book,if you want something definitive on regular expressions, Mastering Regular Expressions, Second Edition [amazon.com] by Jeffrey E. F. Friedl is an absolute must for your professional library. Jeffrey breaks down and then builds back up what regular expressions are and how they work, and offers an entire matrix breakout of the slightly different implementations among the most common utilities (grep, sed, awk, perl...). Not to shill for amazon, but if you select the reviewed book, the "buy this book too, and you get this great price" deal actually includes the Mastering Regular Expressions, Second Edition. . Get 'em both, you won't be sorry.

Re:A language in their own right. (Score:2, Informative)

by APDent ( 81994 ) writes: on Tuesday March 22, 2005 @04:05PM (#12015147)

Regular expressions are not Turing complete.

Re:A language in their own right. (Score:2, Informative)

by smoany ( 832744 ) writes: on Tuesday March 22, 2005 @04:06PM (#12015166)

Um, last time I checked, Reg. Exp's are not turing complete. Take the expression O^n 1^n, which can be made by Turing machines. If you can make that for me using a Regular Expression, you deserve a Turing Award. Regular expressions are DFA/NFA complete, not turing complete... not even close!

Re:A language in their own right. (Score:4, Informative)

by khrtt ( 701691 ) writes: on Tuesday March 22, 2005 @04:07PM (#12015173)

Regular expressions are probably the first Turing-complete language to be encapsulated in another Turing-complete language (C).

Don't you just love to sound like a StarTrek character, with all that fancy terminology?

Go look up your complexity book - if you have one - regexes are not even close to Turing-complete.

Re:Regular expressions in a cookbook? (Score:5, Informative)

by merlyn ( 9918 ) writes: on Tuesday March 22, 2005 @04:08PM (#12015194) Homepage Journal

Yup, regular expressions are not capable of a full-range of computing

That's the "classic" regular expressions, not the modern regular expressions accepted by PCRE, and Perl itself. In fact, Perl regular expressions are full Turing machines, with PCRE being a few steps behind that. So PCRE isn't really PCRE... it's P-likeCRE. {grin}

Re:Unacceptable mistakes (Score:3, Informative)

by hattmoward ( 695554 ) writes: on Tuesday March 22, 2005 @04:14PM (#12015253)

\w is [A-Za-z0-9_]. The reviewer mentions use of the POSIX character class [[:alpha:]], which is more in line with what you want, and will (is supposed to) match alpha characters in non-ASCII character sets.

Regexes are overused (Score:5, Informative)

by ryantate ( 97606 ) writes: <ryantate@ryantate.com> on Tuesday March 22, 2005 @04:14PM (#12015263) Homepage

Anyone who drops in regularly on a Perl discussion forum (like perlmonks.org) knows that programmers tend to over-use regular expressions.

Regexes are actually a pretty poor way to extract information from comma-delimited or tab-delimited files, for example. By the time you're done dealing with escaped commas, escaped tabs, quoting characters (which many CSV and TDT exporters use in addition to commas and tabs), escaped quote characters, escaped newlines, and escaped escape chars, you end up with a super-complicated regex.

HTML is even more complicated. You have HTML comments and nested tags on top of everything else.

To validate a simple email address, Jeffrey Friedl in his Mastering Regular Expressions book for O'Reilly writes an *11-page* regex.

Most of the time the correct answer is not "here is a regex recipe" but rather "here is a simple library to do the job property with a parser", like Text::CSV or HTML::Parser in perl.

F*ck this book and all others like it: (Score:1, Informative)

by stratjakt ( 596332 ) writes: on Tuesday March 22, 2005 @04:16PM (#12015280) Journal

All you need is regexlib.com and a copy of Regulator (I believe thats the free as in beer one) that will break out a regex into english steps like "capture (" "capture 3 or more 0's", and so on.. .NET has a regex facility that's slicker than greased pigeon shit, so I've been making heavy use of it lately.

Regex Coach helps building Regexp (Score:5, Informative)

by uss_valiant ( 760602 ) writes: on Tuesday March 22, 2005 @04:18PM (#12015294) Homepage

Regex Coach [weitz.de]

This program assists you building regular expressions. I've never used it (real men code regexp at once and it works). But some friends recommend it.

Re:Regexes are overused (Score:2, Informative)

by stratjakt ( 596332 ) writes: on Tuesday March 22, 2005 @04:20PM (#12015324) Journal

Of course, the compiled regex will likely be faster than any parsing library you write. So it all depends what you're doing.

For some sort of system that processes umpteen billion transactions per second, they can be a godsend. For parsing a .conf file once every six months when the machine is rebooted, it's a waste of time.

It's all about knowing how and when to use the tool. A pneumatic nailgun can save a carpenter hours on a jobsite, but it's a waste of time to set it all up if you only need to knock in one nailhead that's popped through the drywall.

Re:cant get used to them (Score:2, Informative)

by Anonymous Coward writes: on Tuesday March 22, 2005 @04:21PM (#12015339)

regular expressions are nice and all but i still cant get used to them .. a good manual should be kept handy at all times. [ ... ]
Suggestions are welcome.

I have a suggestion. Write a few regular expressions to get your brain refreshed on them, then go read this excellent article [plover.com] on how regular expressions work. At the very least, it will clear some confusing things up. Most likely you'll find that having a better understanding of the underlying concepts will make it easier for you to work with regular expressions day to day.

Also, it helps if you are familiar with finite state machines. I learned about them in a couple classes while getting my CS degree, but they're not that hard and most people should be able to grasp them without any kind of formal CS training.

Re:cant get used to them (Score:2, Informative)

by Waffle Iron ( 339739 ) writes: on Tuesday March 22, 2005 @04:22PM (#12015349)

regular expressions are nice and all but i still cant get used to them
They may be kind of hard to get used to, but not has hard as writing, debugging and maintaining a dozen or more lines of custom string parsing code for each case where you would use one.

Re:cant get used to them (Score:4, Informative)

by halber_mensch ( 851834 ) writes: on Tuesday March 22, 2005 @04:22PM (#12015350)

A good starting point is to understand finite automata and regular languages first. See http://en.wikipedia.org/wiki/Automata_theory/ [wikipedia.org] for a good first reference on automata. If you can grok automata, regular expressions will click with you.

Different flavors? (Score:4, Informative)

by dpbsmith ( 263124 ) writes: on Tuesday March 22, 2005 @04:30PM (#12015448) Homepage

In an average month, I use regular expressions as implemented in Microsoft Visual C++ 6.0, BBEdit Lite, TextWrangler, Apple MPW, and REALBasic. Every single one of them has _significant_ differences in syntax and semantics.

My understanding is that even the UNIX world sports several different flavors of regular expression in grep, egrep, fgrep, etc.

The biggest barrier to _my_ use of regular expressions is that every time I switch from one regular expression context to another, it takes me a good half hour to refresh my memory of what does and doesn't work in each environment.

Re:Regexes are overused (Score:3, Informative)

by smittyoneeach ( 243267 ) * writes: on Tuesday March 22, 2005 @04:31PM (#12015467) Homepage Journal

Consider the boost libraries http://boost.org/ [boost.org].

You get tokenizer, regex, and a parser library (spirit), in sorted by increasing caliber.

It's all about the right tool for the job.

Re:Regular expressions in a cookbook? (Score:3, Informative)

by DeadSea ( 69598 ) * writes: on Tuesday March 22, 2005 @04:33PM (#12015496) Homepage Journal

Your expression fails for this case:
<script><scri</script>
It will match <scri< with your |</scri[^p] rule and then go on to match beyond the end of your regular expression.
But I acknowledge that it may be quadratic rather than exponenetial even with a correct regular expression.
--
Exchange Rate Calculator [ostermiller.org]

Re:Unacceptable mistakes (Score:3, Informative)

by Speare ( 84249 ) writes: on Tuesday March 22, 2005 @04:36PM (#12015532) Homepage Journal

No, [A-z] does not capture all letters. For example, "Å" and "é" are not usually included in the class [A-z], but it is often a part of the class \w.

Free Alternative (Score:4, Informative)

by MudButt ( 853616 ) writes: on Tuesday March 22, 2005 @04:48PM (#12015661)

This is free... And interactive...
http://www.regexlib.com/ [regexlib.com]

Re:Another one? (Score:4, Informative)

by carnivore302 ( 708545 ) writes: on Tuesday March 22, 2005 @04:50PM (#12015675) Journal

I don't think there is a need for another book on regexps, since there is already the excellent Mastering Regular Expressions [amazon.com] by Jeffrey Friedl. What else then the best can you expect from an O'Reilly book?

Re:Unacceptable mistakes (Score:2, Informative)

by LordoftheWoods ( 831099 ) writes: on Tuesday March 22, 2005 @04:54PM (#12015731)

the uppercase letters A-Z are followed by a number of special symbols,

Indeed. If anyone is interested in why ASCII sticks a few characters in there, it's because it allows you to flip a bit to switch between cases.

Re:Regex Coach helps building Regexp (Score:2, Informative)

by DigitalDeviation ( 857048 ) writes: on Tuesday March 22, 2005 @04:56PM (#12015747) Homepage

Regex Coach is nice for those long regexs that you may have missed an escape somewhere. I write most regexs myself, but I'm no guru at it. Regex Coach is a nice verification that the regex works (particularly for extracting something from a large string).

Re:cant get used to them (Score:4, Informative)

by B'Trey ( 111263 ) writes: on Tuesday March 22, 2005 @05:01PM (#12015801)

If you really want to understand regexes, get Jeffrey E. E. Friedl's "Mastering Regular Expressions" from O'Reilly. It's much deeper than the casaul reader will ever need, but if you get through it you will certainly know how regexes work from both a user perspective and from a regex engine perspective.

ignorance is bliss (Score:3, Informative)

by RelliK ( 4466 ) writes: on Tuesday March 22, 2005 @05:04PM (#12015841)

If you want more of a challenge, try writing a regular expression that find any tags along with anything in between using only greedy matching.
duh! Repeat after me: HTML is not a regular language. There is no regular expression that can match it. The problem arises when people try to use regular expressions without understanding what they are. But, as the saying goes, when the only tool you have is a hammer, everything looks like a nail...

Re:Regexes How2 (Score:5, Informative)

by softcoder ( 252233 ) writes: on Tuesday March 22, 2005 @05:23PM (#12016053)

In addition to a good book, or even INSTEAD of a good book, download and use THE REGEX COACH
http://www.weitz.de/regex-coach/

It is a very very nice interactive pgm that lets you debug REGEXES on the fly visually, by feeding them sample text.

Re:ignorance is bliss (Score:3, Informative)

by DeadSea ( 69598 ) * writes: on Tuesday March 22, 2005 @06:15PM (#12016632) Homepage Journal

> duh! Repeat after me: HTML is not a regular language. There is no regular expression that can match it.
Script tags cannot be nested which makes that portion of html able to be matched by a regular expression.
--
Currency conversion calculator [ostermiller.org]

Re:I personally... (Score:3, Informative)

by Bryson ( 112202 ) writes: on Tuesday March 22, 2005 @06:30PM (#12016815)

> use 'Mastering Regular Expressions . It's a good book on the topic as well.

I'm one of the few people who doesn't like Friedl's /Mastering
Regular Expressions/. (I have the first edition.)

First, he says that extended regexp engines, such as Perl's, use
nondeterministic finite automata (NFA). Not true; NFA's can
accept exactly the same languages as DFA's (deterministic finite
automata). The extended regexps use search-and-backtrack
engines.

Friedl gives some examples of (extended) regexps that have
catastrophic worst-case behavior, but doesn't present a
systematic method for recognizing or avoiding them. The naive
use of extended regexps, mostly by people who think they have
mastered them, is setting us up for denial-of-service attacks
based on the worst-case complexity of regular expressions.

Formal regular expressions are exactly the languages DFA's and
NFA's can accept. A DFA can parse any string in time
proportional to the length of the string. Compiling the DFA may
be exponential time, and space, but at least we find out at
compile time, not when some attacker figures out a case we
missed.

:help pattern (Score:4, Informative)

by digitect ( 217483 ) writes: <digitect AT dancingpaper DOT com> on Tuesday March 22, 2005 @10:23PM (#12019047)

Of course, if you use the one true text editor [vim.org], all you need to know about regular expressions is:

:help pattern

:)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Regular Expression Recipes 258

Regular Expression Recipes More Login

Regular Expression Recipes

cant get used to them (Score:1, Informative)

Points (Score:5, Informative)

Regular expressions in a cookbook? (Score:5, Informative)

I personally... (Score:5, Informative)

add this book to your list (Score:3, Informative)

Re:A language in their own right. (Score:2, Informative)

Re:A language in their own right. (Score:2, Informative)

Re:A language in their own right. (Score:4, Informative)

Re:Regular expressions in a cookbook? (Score:5, Informative)

Re:Unacceptable mistakes (Score:3, Informative)

Regexes are overused (Score:5, Informative)

F*ck this book and all others like it: (Score:1, Informative)

Regex Coach helps building Regexp (Score:5, Informative)

Re:Regexes are overused (Score:2, Informative)

Re:cant get used to them (Score:2, Informative)

Re:cant get used to them (Score:2, Informative)

Re:cant get used to them (Score:4, Informative)

Different flavors? (Score:4, Informative)

Re:Regexes are overused (Score:3, Informative)

Re:Regular expressions in a cookbook? (Score:3, Informative)

Re:Unacceptable mistakes (Score:3, Informative)

Free Alternative (Score:4, Informative)

Re:Another one? (Score:4, Informative)

Re:Unacceptable mistakes (Score:2, Informative)

Re:Regex Coach helps building Regexp (Score:2, Informative)

Re:cant get used to them (Score:4, Informative)

ignorance is bliss (Score:3, Informative)

Re:Regexes How2 (Score:5, Informative)

Re:ignorance is bliss (Score:3, Informative)

Re:I personally... (Score:3, Informative)

:help pattern (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot