My favourite regexp

Do you have a favourite regular expression? That might be a tricky question for some- like the benighted masses who haven’t yet heard the gospel of regular expressions. Or maybe you have so many dear to your heart, a real Sophie’s Choice? For me, it is easy, the first non-trivial one I wrote, for a task management system called TOM. Take a look and see if you can sell what it does- to help you out (?) I have left it in the context of the line of Perl it came from.

$string =~ s/(?=.{79,})((.{0,77}[\-,;:\/!?.\ \t])|(.{78}))/$1\r\n/g;

The answer, is that it reformatted the input $string (called $email in the original code, but that was too much of a giveaway!) to have hard-wrapped lines of no more than 78 characters long, to build emails out of. Yeah, it used to be a real thing, a Campaign For Real Email (Remail?) arguing for pure email, unsullied by HTML. You already know who won that fight.

So the actual regexp is this: (?=.{79,})((.{0,77}[\-,;:\/!?.\ \t])|(.{78}))

(?=.{79,}) It looks ahead to see if there are 79 or more characters. This is a zero-width positive lookahead assertion. It doesn’t “eat” any characters, it just kind of reads the future. It functions a bit like the test of a if statement.

.{0,77}[\-,;:\/!?.\ \t] If so, let’s remember as many of them as we can that are followed by by some sort of symbol, such as a space, hyphen, or full stop etc., because that’s a great place to split the string and make a new line. It’s “as many of” because quantifiers in regexps are greedy. So it won’t match only 1 character if it could match 2, and not 76 if it could match 77.

| …or…

.{78} If we couldn’t do that (there’s no good place to end a line in all of those 78 characters), well heck, just snap it off at character 78.

Of course, if there weren’t as many as 79-plus characters remaining to look at, nothing happens, no match is made, i.e. no more newline characters are added, which is as it should be.

Leave a Reply

Your email address will not be published. Required fields are marked *