Project Advice

Project Advice

3. Regular expressions

3.3 Back references

Back references are a very powerful part of regular expressions because they remember matches from your find expression and put them back where you specify in your replacement expression: this allows text to be moved around and copied.

 The part to be remembered normally goes in round brackets, and the back reference is $1, $2 etc, where the numbers refer to the order of sets of brackets

Find:

(Ebeneezer) (Scrooge)

Replace:

$2, $1

Gives you Scrooge, Ebeneezer.

A real-world example, which shows how you can begin to automate XML markup with regular expressions

Remember the Houndsditch example from the Parish Clerks' Memoranda? We kept the original spelling of Houndsditch but put the modern, standardised spelling into an attribute:

<place loc=”Houndsditch, London”>Hounsditch</place>

We could automate the markup of things like this, using regular expressions:

Find:

(Ho[^ ]+ch)

Replace

<place loc=”Houndsditch, London”>$1</place>

The find part here is looking for Ho followed by any character other than a space (this ensures that we don’t match across multiple words), followed by ch. The brackets mean that everything found will be remembered. Then we use the back reference, $1, to replace the match with itself, this time with the requisite tagging placed around it.

The danger here, you may have spotted, is that we might be inadvertently matching things which are not Houndsditch but which fit the requested pattern, such as Hooch. We could narrow the expression by adding more letters, for example:

Find:

(Hou[^ ]+tch)

But then the opposite risk is run: of not matching enough. The above expression would not find, say, Houndsdich or Hondsditch. The judgement is best made by the person who knows the data best, which on your project will be you, but in general it is best to match too much than too little – false positives are easier to find than false negatives. You can always use your text editor to extract all of the matches and look through the list for false positives.