/

									/ g
								

	
Regular Expression - Documentation

A regular expression is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

Metacharacters

Character What does it do?
\
  • Used to indicate that the next character should NOT be interpreted literally. For example, the character 'w' by itself will be interpreted as 'match the character w', but using '\w' signifies 'match an alpha-numeric character including underscore'.
  • Used to indicate that a metacharacter is to be interpreted literally. For example, the '.' metacharacter means 'match any single character but a new line', but if we would rather match a dot character instead, we would use '\.'.
^
  • Matches the beginning of the input. If in multiline mode, it also matches after a line break character, hence every new line.
  • When used in a set pattern ([^abc]), it negates the set; match anything not enclosed in the brackets
$ Matches the end of the input. If in multiline mode, it also matches before a line break character, hence every end of line.
* Matches the preceding character 0 or more times.
+ Matches the preceding character 1 or more times.
?
  • Matches the preceding character 0 or 1 time.
  • When used after the quantifiers *, +, ? or {}, makes the quantifier non-greedy; it will match the minimum number of times as opposed to matching the maximum number of times.
. Matches any single character except the newline character.
(x) Matches 'x' and remembers the match. Also known as capturing parenthesis.
(?:x) Matches 'x' but does NOT remember the match. Also known as NON-capturing parenthesis.
x(?=y) Matches 'x' only if 'x' is followed by 'y'. Also known as a lookahead.
x(?!y) Matches 'x' only if 'x' is NOT followed by 'y'. Also known as a negative lookahead.
x|y Matches 'x' OR 'y'.
{n} Matches the preceding character exactly n times.
{n,m} Matches the preceding character at least n times and at most m times. n and m can be omitted if zero..
[abc] Matches any of the enclosed characters. Also known as a character set. You can create range of characters using the hyphen character such as A-Z (A to Z). Note that in character sets, special characters (., *, +) do not have any special meaning.
[^abc] Matches anything NOT enclosed by the brackets. Also known as a negative character set.
[\b] Matches a backspace.
\b Matches a word boundary. Boundaries are determined when a word character is NOT followed or NOT preceded with another word character.
\B Matches a NON-word boundary. Boundaries are determined when two adjacent characters are word characters OR non-word characters.
\cX Matches a control character. X must be between A to Z inclusive.
\d Matches a digit character. Same as [0-9] or [0123456789].
\D Matches a NON-digit character. Same as [^0-9] or [^0123456789].
\f Matches a form feed.
\n Matches a line feed.
\r Matches a carriage return.
\s Matches a single white space character. This includes space, tab, form feed and line feed.
\S Matches anything OTHER than a single white space character. Anything other than space, tab, form feed and line feed.
\t Matches a tab.
\v Matches a vertical tab.
\w Matches any alphanumeric character including underscore. Equivalent to [A-Za-z0-9_].
\W Matches anything OTHER than an alphanumeric character including underscore. Equivalent to [^A-Za-z0-9_].
\x A back reference to the substring matched by the x parenthetical expression. x is a positive integer.
\0 Matches a NULL character.
\xhh Matches a character with the 2-digits hexadecimal code.
\uhhhh Matches a character with the 4-digits hexadecimal code.

Character classes

POSIX Non-standard Perl/Tcl Vim Java ASCII Description
[:ascii:] \p{ASCII} [\x00-\x7F] ASCII characters
[:alnum:] \p{Alnum} [A-Za-z0-9] Alphanumeric characters
[:word:] \w \w \w [A-Za-z0-9_] Alphanumeric characters plus "_"
\W \W \W [^A-Za-z0-9_] Non-word characters
[:alpha:] \a \p{Alpha} [A-Za-z] Alphabetic characters
[:blank:] \s \p{Blank} [ \t] Space and tab
\b \< \> \b (?<=\W)(?=\w)|(?<=\w)(?=\W) Word boundaries
\B (?<=\W)(?=\W)|(?<=\w)(?=\w) Non-word boundaries
[:cntrl:] \p{Cntrl} [\x00-\x1F\x7F] Control characters
[:digit:] \d \d \p{Digit} or \d [0-9] Digits
\D \D \D [^0-9] Non-digits
[:graph:] \p{Graph} [\x21-\x7E] Visible characters
[:lower:] \l \p{Lower} [a-z] Lowercase letters
[:print:] \p \p{Print} [\x20-\x7E] Visible characters and the space character
[:punct:] \p{Punct} [][!"#$%&'()*+,./:;<=>?@\^_`{|}~-] Punctuation characters
[:space:] \s \_s \p{Space} or \s [\t\r\n\v\f] Whitespace characters
\S \S \S [^ \t\r\n\v\f] Non-whitespace characters
[:upper:] \u \p{Upper} [A-Z] Uppercase letters
[:xdigit:] \x \p{XDigit} [A-Fa-f0-9] Hexadecimal digits

Examples

Meta­character(s) Description Example
. Normally matches any character except a newline.
Within square brackets the dot is literal.
$string1 = "Hello World\n";
    if ($string1 =~ m/...../) {
      print "$string1 has length >= 5.\n";
    }
    

Output:

Hello World
     has length >= 5.
    
( ) Groups a series of pattern elements to a single element.
When you match a pattern within parentheses, you can use any of $1, $2, ... later to refer to the previously matched pattern.
$string1 = "Hello World\n";
    if ($string1 =~ m/(H..).(o..)/) {
      print "We matched '$1' and '$2'.\n";
    }
    

Output:

We matched 'Hel' and 'o W'.
    
+ Matches the preceding pattern element one or more times.
$string1 = "Hello World\n";
    if ($string1 =~ m/l+/) {
      print "There are one or more consecutive letter \"l\"'s in $string1.\n";
    }
    

Output:

There are one or more consecutive letter "l"'s in Hello World.
    
? Matches the preceding pattern element zero or one time.
$string1 = "Hello World\n";
    if ($string1 =~ m/H.?e/) {
      print "There is an 'H' and a 'e' separated by ";
      print "0-1 characters (e.g., He Hue Hee).\n";
    }
    

Output:

There is an 'H' and a 'e' separated by 0-1 characters (e.g., He Hue Hee).
    
? Modifies the *, +, ? or {M,N}'d regex that comes before to match as few times as possible.
$string1 = "Hello World\n";
    if ($string1 =~ m/(l.+?o)/) {
      print "The non-greedy match with 'l' followed by one or ";
      print "more characters is 'llo' rather than 'llo Wo'.\n";
    }
    

Output:

The non-greedy match with 'l' followed by one or more characters is 'llo' rather than 'llo Wo'.
    
* Matches the preceding pattern element zero or more times.
$string1 = "Hello World\n";
    if ($string1 =~ m/el*o/) {
      print "There is an 'e' followed by zero to many ";
      print "'l' followed by 'o' (e.g., eo, elo, ello, elllo).\n";
    }
    

Output:

There is an 'e' followed by zero to many 'l' followed by 'o' (e.g., eo, elo, ello, elllo).
    
{M,N} Denotes the minimum M and the maximum N match count.
N can be omitted and M can be 0: {M} matches "exactly" M times; {M,} matches "at least" M times; {0,N} matches "at most" N times.
x* y+ z? is thus equivalent to x{0,} y{1,} z{0,1}.
$string1 = "Hello World\n";
    if ($string1 =~ m/l{1,2}/) {
      print "There exists a substring with at least 1 ";
      print "and at most 2 l's in $string1\n";
    }
    

Output:

There exists a substring with at least 1 and at most 2 l's in Hello World
    
[…] Denotes a set of possible character matches.
$string1 = "Hello World\n";
    if ($string1 =~ m/[aeiou]+/) {
      print "$string1 contains one or more vowels.\n";
    }
    

Output:

Hello World
     contains one or more vowels.
    
| Separates alternate possibilities.
$string1 = "Hello World\n";
    if ($string1 =~ m/(Hello|Hi|Pogo)/) {
      print "$string1 contains at least one of Hello, Hi, or Pogo.";
    }
    

Output:

Hello World
     contains at least one of Hello, Hi, or Pogo.
    
\b Matches a zero-width boundary between a word-class character (see next) and either a non-word class character or an edge; same as

(^\w|\w$|\W\w|\w\W).

$string1 = "Hello World\n";
    if ($string1 =~ m/llo\b/) {
      print "There is a word that ends with 'llo'.\n";
    }
    

Output:

There is a word that ends with 'llo'.
    
\w Matches an alphanumeric character, including "_";
same as [A-Za-z0-9_] in ASCII, and
[\p{Alphabetic}\p{GC=Mark}\p{GC=Decimal_Number}\p{GC=Connector_Punctuation}]

in Unicode, where the Alphabetic property contains more than Latin letters, and the Decimal_Number property contains more than Arab digits.

$string1 = "Hello World\n";
    if ($string1 =~ m/\w/) {
      print "There is at least one alphanumeric ";
      print "character in $string1 (A-Z, a-z, 0-9, _).\n";
    }
    

Output:

There is at least one alphanumeric character in Hello World
     (A-Z, a-z, 0-9, _).
    
\W Matches a non-alphanumeric character, excluding "_";
same as [^A-Za-z0-9_] in ASCII, and
[^\p{Alphabetic}\p{GC=Mark}\p{GC=Decimal_Number}\p{GC=Connector_Punctuation}]

in Unicode.

$string1 = "Hello World\n";
    if ($string1 =~ m/\W/) {
      print "The space between Hello and ";
      print "World is not alphanumeric.\n";
    }
    

Output:

The space between Hello and World is not alphanumeric.
    
\s Matches a whitespace character,
which in ASCII are tab, line feed, form feed, carriage return, and space;
in Unicode, also matches no-break spaces, next line, and the variable-width spaces (amongst others).
$string1 = "Hello World\n";
    if ($string1 =~ m/\s.*\s/) {
      print "In $string1 there are TWO whitespace characters, which may";
      print " be separated by other characters.\n";
    }
    

Output:

In Hello World
     there are TWO whitespace characters, which may be separated by other characters.
    
\S Matches anything but a whitespace.
$string1 = "Hello World\n";
    if ($string1 =~ m/\S.*\S/) {
      print "In $string1 there are TWO non-whitespace characters, which";
      print " may be separated by other characters.\n";
    }
    

Output:

In Hello World
     there are TWO non-whitespace characters, which may be separated by other characters.
    
\d Matches a digit;
same as [0-9] in ASCII;
in Unicode, same as the \p{Digit} or \p{GC=Decimal_Number} property, which itself the same as the \p{Numeric_Type=Decimal} property.
$string1 = "99 bottles of beer on the wall.";
    if ($string1 =~ m/(\d+)/) {
      print "$1 is the first number in '$string1'\n";
    }
    

Output:

99 is the first number in '99 bottles of beer on the wall.'
    
\D Matches a non-digit;
same as [^0-9] in ASCII or \P{Digit} in Unicode.
$string1 = "Hello World\n";
    if ($string1 =~ m/\D/) {
      print "There is at least one character in $string1";
      print " that is not a digit.\n";
    }
    

Output:

There is at least one character in Hello World
     that is not a digit.
    
^ Matches the beginning of a line or string.
$string1 = "Hello World\n";
    if ($string1 =~ m/^He/) {
      print "$string1 starts with the characters 'He'.\n";
    }
    

Output:

Hello World
     starts with the characters 'He'.
    
$ Matches the end of a line or string.
$string1 = "Hello World\n";
    if ($string1 =~ m/rld$/) {
      print "$string1 is a line or string ";
      print "that ends with 'rld'.\n";
    }
    

Output:

Hello World
     is a line or string that ends with 'rld'.
    
\A Matches the beginning of a string (but not an internal line).
$string1 = "Hello\nWorld\n";
    if ($string1 =~ m/\AH/) {
      print "$string1 is a string ";
      print "that starts with 'H'.\n";
    }
    

Output:

Hello
    World
     is a string that starts with 'H'.
    
\z Matches the end of a string (but not an internal line).
$string1 = "Hello\nWorld\n";
    if ($string1 =~ m/d\n\z/) {
      print "$string1 is a string ";
      print "that ends with 'd\\n'.\n";
    }
    

Output:

Hello
    World
     is a string that ends with 'd\n'.
    
[^…] Matches every character except the ones inside brackets.
$string1 = "Hello World\n";
    if ($string1 =~ m/[^abc]/) {
     print "$string1 contains a character other than ";
     print "a, b, and c.\n";
    }
    

Output:

Hello World
     contains a character other than a, b, and c.