Project Advice

Project Advice

3. Regular expressions

If you are working with text, in almost any format, then learning regular expressions may well be the best investment of time you can make.

XKCD regular expressions superhero cartoon

Image from XKCD

 

You may not save the day, but you’ll save yourself an enormous amount of time. We’ll focus on using regular expressions to edit XML, but they can be used almost anywhere that there is a large amount of text to be processed. To illustrate their time-saving power, suppose that we have an XML text of 20,000 paragraphs, each tagged with the p element, but we now need to number (or re-number) each paragraph.  To do this by hand would be a tedious and error-prone task, but a simple regular expression will do the job:

Textpad replace dialogue box

Find

<p>

Replace

<p n=”\i”>

This screenshot is from the popular text editor TextPad, which is free for evaluation purposes.

Most text editors and XML editors, including free ones, support regular expressions (you’ll find the options in the find-and-replace box). The syntax is broadly the same but may vary slightly (the exception is Microsoft Word, which does support some regular expression-like behaviour, but calls it “pattern matching” and uses a different system completely).

The key parts of regular expressions are:

  • Character classes
  • Quantifiers
  • Special characters
  • Back references