A Brief Guide to Regular Expressions

What is a regular expression?

Regular expressions are a way to find text that falls into specific patterns. When coupled with a search-and-replace engine, they increase its power exponentially.

Consider a footnoted book. Typically, such books indicate a footnote reference with a superscript number. When converting these books to eBooks, the numbers have to turn into links:

<sup>1</sup>

has to become

<sup><a href="notes.htm#nt1">1</a></sup>

With normal search, all you can do is find each note and insert its link by hand. With regular expressions, every single reference in a file can have its link created in a single search-and-replace. No muss, no fuss, no mistakes.

Regular expressions have a reputation for being hard to use. Don’t let that scare you. The people who gave them this reputation are computer programmers, who typically work better with numbers than text. Most people who work with text for a living take to regular expressions quickly, and wonder how they ever lived without them.

How can I use regular expressions?

Many text editors allow regular-expression search-and-replace. EditPlus for Windows has this capability, as does BBEdit for the Macintosh.

EditPlus

The EditPlus search-replace window has a checkbox called “Regular expression.” To use regular expressions in your search, simply check this box.

BBEdit

BBEdit’s search-replace window also has such a checkbox; its label, however, is “Use grep”.

Grep. What an odd term. The word “grep” is from the creators of the UNIX operating system, some of the first implementers of regular expressions. UNIX programmers delighted in reducing long commands to meaningless acronyms; “grep” is said to have meant “general regular expression print”.

As best I can tell, EditPlus and BBEdit’s implementations of regular expressions are essentially identical, so anything in this document should work with both programs. Do NOT, however, assume that regular expressions always work the same way everywhere. They don’t. When in doubt, read the manual or online help—and do your tests on a COPY of your file.

Defining regular expression patterns

The way regular-expression patterns work is by creating a special little language in which ordinary symbols take on special meanings. This guide will go through the special meanings little by little, with examples. You will get the most from this guide if you read all the way through it. To be sure you do, I’ve left a very important piece of information—how to replace what you find—for the end.

Dot, question mark, star, plus, and backslash

Imagine that you have a book of letters, and you need to tag all the salutations. Salutations fall into a pattern: the word “Dear,” a name, and a colon (or possibly a comma, but we’ll stick with a colon for now). Obviously, the problem with finding this via ordinary search is that the name could be anything.

Regular expressions have a way of saying “any character”: the dot, or period. To find a three-letter word beginning and ending with “b”, for example, you could search on b.b. This would find “bib” or “bob” or “bub”, but not “bud” or “dub” or “bulb”.

Note that “whitespace” characters such as space or tab can also be located by the dot. So b.b would find words with a space or tab between them. (Quick exercise: where would b.b match in the preceding sentence? There are two possibilities!) The hard return, however, is not matched by a dot; more on hard returns later.

This still won’t solve our salutation problem, though: names are made up out of a variable number of letters, not just one.

Regular expressions have several ways to say “not just one”: the question mark (?), the star or asterisk (*), and the plus (+). The question mark means “zero or one,” the star means “zero or more,” and the plus means “one or more.” These marks are like adjectives; they modify other characters. What’s more, they’re like adjectives in some foreign languages, in that they come immediately after the character they modify.

So the regular expression Ba? will match B or Ba but not Baa or a. The regular expression Ba* will match B or Ba or Baa, up to any number of “a”s. The regular expression Ba+ will match Ba or Baa and on up, but it will not match B by itself, since the plus sign demands at least one “a.”

Combining the dot with the plus or star solves our salutation problem. The regular expression Dear .+: will find any imaginable business-letter salutation.

But what if you actually want to look for a dot, a star, a question mark, or a plus? How can you find them, if they’ve got special meanings?

Any special regular-expression character loses its special meaning if there is a backslash (\) before it. So \. will find a real period, like the one at the end of this sentence. The backslash works on itself, too; to find a real backslash, put \\ in your search.

What we’ve learned so far:

Character Regular-expression meaning
. Any character, including space or tab
? Zero or one of the preceding character
* Zero or more of the preceding character
+ One or more of the preceding character
\ Negates the special meaning of the following character

Metacharacters

As you’ve learned, the backslash negates any special meaning that the character following it has to a regular expression. It has another function, too: it can turn ordinary characters into special ones.

Consider the tab. You don’t “see” it on the screen the way you see ordinary letters; you see what it does. If you turn on the show-invisibles function, however, you generally see an indication that there is a character there.

Regular expressions let you access these “invisible” characters (usually called “metacharacters”):

Metacharacter Meaning
\n Newline (or paragraph mark, or however you think of it)
\t Tab character
\s Any “whitespace” character (tab, space, or newline)

For purposes of modifiers like star and plus, these metacharacters act like single characters. So \n+ finds one or more newlines.

A special caution with BBEdit: Because of ancient OS wars, Macs and non-Macs treat newlines differently. If a regular expression containing \n isn’t finding what you think it should, try replacing \n in your search pattern with \r.

Depending on your regular-expression engine or editing program, there may be other metacharacters available to you. Read the manual or help pages for details.

In addition, a few more special regular-expression characters provide useful functions. Remember that to look for the actual character, you must precede it with a backslash.

Character Meaning
^ Beginning of a line
$ End of a line

So to find a tab at the beginning of a line (say, to tag paragraphs for HTML), look for ^\t.

The story so far:

Character(s) Regular-expression meaning
. Any character, including space or tab
? Zero or one of the preceding character
* Zero or more of the preceding character
+ One or more of the preceding character
\ Negates the special meaning of the following character
\n Newline
\t Tab character
\s Any “whitespace” character (tab, space, or newline)
^ Beginning of a line
$ End of a line

Character classes

With the tricks you’ve learned so far, you can get a great deal done with regular expressions. Still, one key capability remains to be explored.

Remember the footnote-callout example? The pattern there is surely obvious: a number inside superscript tags. But using the dot in place of the number(s) means that you might catch superscripted letters (e.g. ordinals, like 4th) as well.

Fear not. Regular expressions allow you to specify exactly which characters will match in any given situation. Simply enclose them in square brackets. Yes, this means that square brackets are special characters, and need to be backslashed if you are looking for the actual character.

So to ensure that you only match a number between superscript tags, you can search for <sup>[0123456789]</sup>. This square-bracketed expression is called a “character class.”

For purposes of modifier characters such as star, plus, and question mark, a character class represents only one character. So to be sure you catch footnotes 10 and above as well as the one-digit ones, you can search for <sup>[0123456789]+</sup>.

One thing to note about character classes: special characters are not special any more inside them! So there’s no need to backslash anything (except the ] character, for obvious reasons) inside a character class. Metacharacters can be included inside a character class, though; one way to reproduce the function of the metacharacter \s would be with the character class [ \t\n].

Right now, you’re probably saying “What a pain—I have to type all the digits in every character class where I want to use them!” No, you don’t. Character classes also allow you to specify character “ranges.” It’s even easy; just separate the beginning and end of the range with a hyphen. So you can find “one or more digits” by searching on [0-9]+, “one or more lowercase letters” with [a-z]+, and so on. So the easy way to find footnotes is <sup>[0-9]+</sup>

You can combine ranges and single characters inside a character class however you like. To find lowercase letters and end-of-sentence punctuation characters, try [a-z.?!]. To include a hyphen in a character class, either backslash it or include it as the first character of the class, thus: [-0-9].

Another truly cool thing about character classes is that they are negatable. Imagine that you need to find heads that are typed in all caps. You could try searching for ^[A-Z ]+$ (note that there’s a space inside that character class), but you’ll miss any head that happens to have a number or punctuation in it. What you’re really looking for is a line that doesn’t have any lowercase characters on it.

No problem. Outside a character class, the ^ character means the beginning of a line. When ^ is the first character inside a character class, however, it negates the class—that is, says to search for anything that is not specified in the character class. So to find those all-uppercase lines, try ^[^a-z]+$.

Here’s another look at our famous table. This time, it has a new column, since some characters mean different things inside and outside character classes.

Character(s) Meaning outside character class Meaning inside character class
. Any character, including space or tab .
? Zero or one of the preceding character ?
* Zero or more of the preceding character *
+ One or more of the preceding character +
\ Negates special meaning of following character; marks metacharacters Negates special meaning of following character; marks metacharacters
\n Newline Newline
\t Tab character Tab character
\s Any “whitespace” character Any “whitespace” character
^ Beginning of a line As first character, negates character class
$ End of a line $
[] Mark a character class n/a
- - Except as first character in class, mark a range of characters

A word about greed

One regular-expression quirk can trip you up badly if you’re not aware of it. It has to do with when a regular expression decides it’s found a match. Consider the marked-up text:

<i>Italic text.</i> Plain text. <i>More italic text.</i>

If you’re looking for stuff in italics, you’ll probably try to search on <i>.+</i>. You might think that this would get you each of the italic sections on the line above. But you’d be wrong. The search will match ONCE on the above line, and it will find the ENTIRE LINE. This is probably not what you want.

Regular expressions find the largest match they can, always. (Some regular-expression engines allow you to turn this “feature” off, but most don’t.) This is called “greed,” and it can cause all kinds of trouble.

One way to sneak around it is using negation. For the above example, try searching on <i>[^<]+</i>. This will ensure that you get the matches you need.

Replacing what you find: parentheses

If you understand what you’ve read so far, you can find almost anything you want with a regular expression. By now, I’m sure one wrinkle has occurred to you: when you replace something you find, you often want to use part of what you found with the regular expression!

Remember the footnote example? To make it work, you need to be able to grab onto the number that the regular expression found, so that it goes into the replacement text.

To accomplish this, you need to do two things: identify each part of the regular-expression search that you want to preserve, and reference those parts on the replace line.

To preserve part of a regular expression for later, enclose it in parentheses. For the footnote example, instead of just searching on <sup>[0-9]+</sup>, you search on <sup>([0-9]+)</sup>. This tells the program to “remember” the number whenever it finds a match.

To pop something “remembered” into your replacement line, count your opening parentheses from the left, then use a backslash with the number of the appropriate parenthetical expression. If you’ve only got one set of parentheses, use \1.

So to find those footnote references and replace them with superscripted links, you would search on:

<sup>([0-9]+)</sup>

and replace with:

<sup><a href="notes.htm#nt\1">\1</a></sup>

Additional resources

The best book on regular expressions is Mastering Regular Expressions by Jeffrey Friedl. It’s published by O’Reilly and Associates, and just came out in a new edition.

The help files for both EditPlus and BBEdit contain a basic introduction to regular expressions. The manual for BBEdit has a longer discussion.