|
Regular
expressions (regexpr)
Regular experessions
are used for "pattern" matching in search, replace, etc. They are often
used with utilities (e.g., grep, sed) and programming
languages (e.g., perl). The shell command dir, uses a slightly
modifed flavour of regular expressions (the two main differences are noted
below). This brief writeup includes almost all the features of standard
regular expression--regexpressions are not as complicated as they might
seem at first. Definitely worth a closer look at.
In
regular expressions, most characters just match themselves. So to search
for string "peter", I would just use a searchstring "peter". The exceptions
are so-called "special characters" ("metacharacters"), which have special
meaning.
The regexpr
special characters are: "\" (backslash), "." (dot), "*" (asterisk),
"[" (bracket), "^" (caret, special only at the beginnig of a string),
"$" (dollar sign, special only at the end of a string). A character
terminating a pattern string is also special for this string.
The backslash,
"\" is used as an "escape" character, i.e., to quote a subsequent special
character.
Thus, "\\" searches for a backslash, "\." searches for a dot, "\*" searches
for the asterisk, "\[" searches for the bracket, "\^" searches for the
caret even at the begining of the string, "\$" searches for the dollar
sign even at the end of the string.
Backslash
followed by a regular (non-special) character may gain a special meaning.
Thus, the symbols \< and \> match an
empty string at the beginning and the end of a word, respectively.
The symbol \b matches the empty string
at the edge of a word, and \B matches the empty string provided it's
not at the edge of a word.
The dot, ".",
matches any single character. [The dir command uses "?" in this
place.] Thus, "m.a" matches "mpa" and "mea" but not "ma" or "mppa".
Any string
is matched by ".*" (dot and asterisk). [The dir command uses
"*" instead.] In general, any pattern followed by "*" matches
zero or more occurences of this pattern. Thus, "m*" matches zero or
more occurances of "m". To search for one or more "m", I could use "mm*".
The
* is a repetition operator. Other repetition operators are used less
often--here is the full list:
* the proceding item is
to be matched zero or more times;
\+ the preceding item is to be
matched one or more times;
\? the preceding item is
optional and matched at most once;
\{n} the preceding item is to be matched
exactly n times;
\{n,} the preceding item is to be matched n or
more times;
\{n,m} the preceding item is to be matched at least
n times, but not more than m times.
The caret,
"^", means "the beginning of the line". So "^a" means "find a line starting
with an "a".
The dollar
sign, "$", means "the end of the line". So "a$" means "find a line ending
with an "a".
Example.
This command searches the file myfile for lines starting
with an "s" and ending with an "n", and prints them to the standard
output (screen):
cat
myfile | grep '^s.*n$'
Any character
terminating the pattern string is special, precede it with a backslash
if you want to use it within this string.
The bracket,
"[" introduces a set. Thus [abD] means: either a or b or D. [a-zA-C]
means any character from a to z or from A to C.
Attention
with some characters inside sets. Within a set, the only special characters
are "[", "]", "-", and "^", and the combinations "[:", "[=", and "[.".
The backslash is not special within a set.
Useful
categories of characters are (as definded by the POSIX standard):
[:upper:] =upper-case letters, [:lower:] =lower-case letters,
[:alpha:] =alphabetic (letters) meaning upper+lower, [:digit:]
=0 to 9, [:alnum:] =alphanumeric meaning alpha+digits, [:space:] =whitespace
meaning <Space>+<Tab>+<Newline> and similar, [:graph:]
=graphically printable characters except space, [:print:] =printable
characters including space, [:punct:] =punctuation characters meaning
graphical characters minus alpha and digits, [:cntrl:] =control characters
meaning non-printable characters, [:xdigit:] = characters that are
hexadecimal digits.
Example.
This command scans the output of the dir command, and prints
lines containing a capital letter followed by a digit:
dir
-l | grep '[[:upper:]][[:digit:]]'
Next > Back
to "Learn Linux Commands"
|