1. Technology
You can opt-out at any time. Please refer to our privacy policy for contact information.

Discuss in my forum

How to Write Regular Expressions for Pattern Matching

Commands, Syntax, and Examples

By

The regular expression syntax allows you to do more complex searches than the simple "Find" searches typically available in software applications. For example, if you are searching for two or more words in a particular order that may have other words between them, you cannot use the ordinary Find function that allows you find occurrences of a fixed string.

The so-called wild card characters allow you to perform a rudimentary type of regular expression matching. For example, at the Bash shell prompt you can type "ls *.txt", which will list all files that end with ".txt", with the "*" matches any string of characters.

Regular expressions expand on this idea and enable you to specify almost any imaginable constraints between a sequence of characters.

Regular expressions are used in various commands (such as awk, and sed), software utilities, and programming languages. The syntax and functionality may vary, but the basic concepts are the same. The programming language Perl provides the most powerful implementation of regular expressions.

Let's start with reviewing regular expressions in the GNU/Linux utility grep, which is used to retrieve those lines from a file that match the specified regular expression.

Let's say you have a list of persons specified by their first, middle, and last names, and want to find all individuals that have first name "John" and last name "Travolta" and any middle name. For this task you could use the following regular expression as search string:

John .* Travolta

The period matches any character, and the '*' (start) means: match the preceding character (in this case the '.') as many as necessary to make the regular expression match the line. If the star follows an expression enclosed in parentheses, that expression is matched as many times as necessary.

If the file name is guests.txt, the corresponding grep command would look like this:

grep 'John .* Travolta' guests.txt

If only single letters are given as middle names you could use this command:

grep 'John . Travolta' guests.txt

where the '.' stands for any character.

If the middle initial is followed by an actual period you would need to use a backslash in front of the period, so that it isn't interpreted with the special meaning of the regular expression:

grep 'John .\. Travolta' guests.txt

The single quotes around the regular expression is used to make sure that the shell interpreter doesn't try to interpret any special characters inside the regular expression.

The characters '^' (caret) and '$' (dollar) can be used to specify the location of the start or end of the line in the regular expression. For example

grep '^John' guests.txt

means, that the lines to be extracted have to start with "John". Similarly,

grep 'Smith$' guests.txt

means, that the line has to end with "Smith".

As mentioned above, you can use the backslash '\' prevent any such special characters to be interpreted as such. For example, if you want to use the '$' to match to a '$' in the input file, you would write it as '\$' in the regular expression.

Another useful syntactic element are square brackets '[' and ']', which can be used to list a specific set of characters, such at any of the characters in this set may match a character in the line being matched. For example, if you are searching for the name "Berkeley", but you are not sure if it is spelled with a "k" or a "c", you can use the regular expression

Ber[kc]eley

Some versions of grep, such as egrep, can process disjunctions (the logical 'or'). For example,

egrep 'Br(ow|au)n' guests.txt

will retrieve any lines that contain either "Brown" or "Braun". The vertical bar '|' and parentheses are used to list the alternative substrings. Note that the square bracket notation mentioned above, is effectively a disjunction of the characters listed inside the square brackets.

Finally, instead of the star '*' you can use the notation {x,y} to specify how many times an expression many be repeated. For example, {3,7} means that the preceding character may occur 3 to 7 times in a row. In the command line the curly brackets '{' and '}' need to be escaped with the backslash '\'. For example,

grep 'Br\{3,7\}' file1

matches any line that contains a 'B' followed by 3 to 7 'r's.
  1. About.com
  2. Technology
  3. Linux
  4. Linux HowTos
  5. Bash How-To's
  6. How to Write Regular Expressions for Pattern Matching

©2014 About.com. All rights reserved.