Regular Expression

From TBP Wiki
Jump to: navigation, search

A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.

Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries.

Bracket expressions

Characters enclosed in square brackets "[]" constitute bracket expressions. This will match any one character within these brackets. For example, the regular expression b[aeiou]g matches the words bag, beg, big, bog, and bug.

Range expressions

A range expression is a variation of the bracket expression. Instead of listing every character that matches, range expressions list the beginning and end points separated by a dash "-", as in a[3-4]z . This will match as a3z, a4z, and a5z.

Single characters

A period "." represents any single character except for a newline. The a.z matches a7z, adz, aRz, or any other three character string or item which begins with "a" and ends with "z".

Start and end of line

The carat "^" defines the start of a line and the dollar sign "$" defines the end of a line.

Repetition operators

A full or partial regular expression can be followed up by a certain special symbol to show just how many times a matching item is required to exist within a set. The asterisk "*" denotes zero or more, the plus sign "+" matches one or more, and the question mark "?" is zero or one. An asterisk is often combined with the period ".*" to match within a string.

Multiple strings

The pipe "|" separates possible matches. The example cat|dog can match cat or dog.

Parentheses

The parentheses "()" surround sub-expressions. Parentheses are generally used to specify how operators are applied. Surrounding the pipe with parentheses will treat the words as a group instead.

Escaping

If you want to match a special character, such as the period, you have to escape it with a backslash "\". In order to match a hostname like ns1.domain.com, you have to escape the periods: ns1\.domain\.com.