Project Advice

Project Advice

3. Regular expressions

3.1 Character classes

Character classes are a way of saying, “find any of these”. The any of these part goes inside square brackets, so [abc123] finds any of those six characters. If you want to find instances of the name Taylor and Tailor then the regular expression Ta[yi]lor will find both. Rather than type [abcdefghijklmnopqrstuvwxyz] you can put an alphabetical range, such as [a-z], to find all lower-case characters, or [0-9] to find all numbers.

An important usage of character classes is that they can be negated, with a caret (aka circumflex, or even hat) immediately inside the square bracket: [^abc] means any character that is not a, b, or c.

Quantifiers were mentioned in module 4, when discussing DTDs. They have exactly the same meaning in regular expressions as they do in a DTD, to recap:

*

One, many or none

+

One or many

?

One or none

 

They’re easier to follow with some examples. If you’re searching a text for the name Cook, you want to be sure that you match Cook and Cooke. This is a situation for the? quantifier:

Cooke?

That means Cook followed by one e or none, which is what we want.

If we also want to include the spelling Coke this is where we might use the +:

Co+ke?

That means Coke, with any number of o’s in the middle, followed by an optional e. As + matches one or many it would also match: Coook, or even Cooooooke, if it occurs. It will also match Cok: if you didn't want Cok you would have to exclude it by running the expression twice, in two slightly different forms.

Since * means one, none or many, the above search could have been done as:

Coo*ke?

Because that matches Co followed by one, zero or many other o’s, followed by k and an optional e.

If you’re looking for all examples of an element in an XML text you might not know if it has an attribute or not. Here we’ll use the * quantifier, combined with a negated character class, so

<head[^>]*>

Finds any head element, whether or not it has an attribute. What we are literally asking for is <head followed by any character that is not > followed by >. But since we asked for none or many matches of things that aren’t >, then we can match:

<head desc=”main”>

Where there are many things after <head which are not >. But also just

<head>

Where there is nothing after <head which is not >.