Tuesday, November 4, 2014

JavaScript Regular Expression

1. Defining Regular Expressions in JavaScript

In JavaScript, regular expressions are represented by RegExp objects. RegExp objects may be created with the RegExp() constructor or with a literal syntax. Regular expression literals are specified as characters within a pair of slash characters.
For example:
var regexp = /^Hello JavaScript$/;
This regular expression could have equivalently been defined with the RegExp() constructor like this:
var regexp = new RegExp('^Hello JavaScript$');
1.1 Character representations
All alphabetic characters and digits match themselves literally in regular expressions.
JavaScript regular-expression syntax also supports character shorthands. For example, the sequence \n matches a literal newline character in a string.

Character SequenceMatches
\0Null character, \x00.
\nNewline, \x0A.
\rCarriage return, \x0D.
\fForm feed, \x0C.
\tHorizontal tab, \x09.
\vVertical tab, \x0B.
\xhhThe Latin Character specified by a two-digit hexadecimal code.
\uhhhhThe Unicode Character specified by a four-digit hexadecimal code.
\ccharNamed control character.

A number of punctuation characters have special meanings in regular expressions. They are:
^ $ . * + ? = ! : | \ / ( ) [ ] { }
The meanings of these characters are discussed in the sections that follow.
If you want to include any of these punctuation characters literally in a regular expression, you must precede them with a \.
If you can’t remember exactly which punctuation characters need to be escaped with a backslash, you may safely place a backslash before any punctuation character. On the other hand, some letters and numbers have special meaning when preceded by a backslash, so do not escape letters or numbers you want to match literally.

1.2 Character Classes
Individual literal characters can be combined into character classes by placing them within square brackets. A character class matches any one character that is contained within it.
/[abc]/
Negated character classes match any character except those contained within the brackets.
/[^abc]/
Character classes use a hyphen to indicate a range of characters. For example the following match any character of the Latin alphabet:
/[a-zA-Z0-9]/
The JavaScript regular-expression syntax includes special class shorthands to represent commonly used classes. For example, \s matches the space character, the tab character, and any other Unicode whitespace character.

CharacterMatches
.Any character except newline or another Unicode line terminator.
\w Any ASCII word character. Equivalent to [a-zA-Z0-9_]
\W Any character that is not an ASCII word character. Equivalent to [^a-zA-Z0-9_]
\s Any Unicode whitespace character
\S Any character that is not Unicode whitespace
\d Any ASCII digit. Equivalent to [0-9]
\D Any character other than an ASCII digit. Equivalent to [^0-9]
[\b] A literal backspace

1.3 Repetition
You can describe a two-digit number as /\d\d/, but you don’t have any way to describe, for example, a string of three letters followed by an optional digit. These more complex patterns use syntax that specifies how many times an element may be repeated.
Table below summarizes the repetition syntax.

CharacterMatches
{ n , m } Match the previous item at least n times but no more than m times.
{ n ,} Match the previous item n or more times.
{ n } Match exactly n occurrences of the previous item.
? Match zero or one occurrences of the previous item. Equivalent to {0,1}.
+ Match one or more occurrences of the previous item. Equivalent to {1,}.
* Match zero or more occurrences of the previous item. Equivalent to {0,}.
/\d{2,4}/ // Match two and four digits
/\s+javascript\s+/ // Match "javascript" between one or more spaces

1.4 Alternation, Grouping, and References
The regular-expression grammar includes special characters for specifying alternatives, grouping subexpressions, and referring to previous subexpressions.

The | character separates alternatives. For example, /he|she|it/ matches the string “he” or the string “she” or the string “it”.

Parentheses have several purposes in regular expressions. One purpose is to group separate items into a single subexpression so that the items can be treated as a single unit by |, *, +, ?, and so on. For example, /java(script)?/.

Another purpose of parentheses in regular expressions is to define subpatterns within the complete pattern. When a regular expression is successfully matched against a target string, it is possible to extract the portions of the target string matched by any particular parenthesized subpattern. For example: (/[a-z]+(\d+)/).
A related use of parenthesized subexpressions is to allow you to refer back to a subexpression later in the same regular expression.

For example, the nested subexpression ([Ss]cript) is referred to as \2:
/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/
A reference to a previous subexpression of a regular expression does not refer to the pattern for that subexpression but rather to the text matched by the subpattern. For example, the following regular expression matches zero or more characters within single or double quotes.
/(['"])[^'"]*\1/
It is not legal to use a reference within a character class, so you cannot write:
/(['"])[^\1]*\1/
In a regular expression you can group items without remembering those items, so you cannot refer back to matched charaters later. Instead of simply grouping the items within ( and ), begin the group with (?: and end it with ). For example:
/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/
Here, the subexpression (?:[Ss]cript) is used simply for grouping and to apply the ? repetition character to the group. In this regular expression, \2 refers to the text matched by (fun\w*).

Table below summarizes the alternation, grouping, and referencing operators.

CharacterMeaning
| Alternation. Match either the subexpression to the left or the subexpression to the right.
(...) Grouping. Group items into a single unit that can be used with *, +, ?, |, and so on or to reference the characters that match this group later.
(?:...) Grouping only. Group items into a single unit and do not remember the characters matched
\ n Match the same characters that were matched when group number n was first matched.

1.5 Specifying Match Position
Some regular expression elements match the positions between characters, instead of actual characters. For example, \b matches a word boundary—the boundary between a \w (ASCII word character) and a \W (nonword character), or the boundary between an ASCII word character and the beginning or end of a string. Sometimes these elements are called regular expression anchors because they anchor the pattern to a specific position in the search string. The most commonly used anchor elements are ^, which ties the pattern to the beginning of the string, and $, which anchors the pattern to the end of the string.
For example, to match the word “JavaScript” on a line by itself, you can use the regular expression /^JavaScript$/.

If you want to search for “Java” as a word by itself, you can use the pattern /\bJava\b/. Instead, if you try the pattern /\sJava\s/, which requires a space before and after the word. This pattern does not match “Java” at the beginning or the end of a string, but only if it appears with space on either side. Also, when this pattern finds a match, it returns a matched string with leading and trailing spaces.

The element \B anchors the match to a location that is not a word boundary. Thus, the pattern /\B[Ss]cript/ matches “JavaScript” and “VBScript”, but not “script” or “Scripting”.

A lookahead assertion is when you include an expression within (?= and ) characters, and it specifies that the enclosed characters must match, without actually matching them. For example, to match the name of a common programming language, but only if it is followed by a colon, you could use
/[Jj]ava([Ss]cript)?(?=\s\&)/
This pattern matches the word “JavaScript” in “JavaScript & jQuery”, but it does not match “JavaScript” in “JavaScript: The Good Parts”, because it is not followed by a ampersand.

Table below summarizes regular-expression anchors.

CharacterMeaning
^ Match the beginning of the string and, in multiline searches, the beginning of a line.
$ Match the end of the string and, in multiline searches, the end of a line.
\b Match a word boundary. That is, match the position between a \w character and a \W character or between a \w character and the beginning or end of a string.
\B Match a position that is not a word boundary.
(?= p ) A positive lookahead assertion. Require that the following characters match the pattern p, but do not include those characters in the match.
(?! p ) A negative lookahead assertion. Require that the following characters do not match the pattern p.

1.6 Flags
JavaScript supports three flags. The i flag specifies that pattern matching should be case-insensitive. The g flag specifies that pattern matching should be global: this means that it finds all existing matches within the searched string. The m flag performs pattern matching in multiline mode. In this mode, if the string to be searched contains newlines, the ^ and $ anchors match the beginning and end of a line in addition to matching the beginning and end of a string. For example, the pattern /java$/im matches “java” as well as “Java\nis fun”.

The table below summarizes the regular-expression flags.

CharacterMeaning
i Perform case-insensitive matching.
g Perform a global match—that is, find all matches rather than stopping after the first match.
m Multiline mode. ^ matches beginning of line or beginning of string, and $ matches end of line or end of string.

No comments :

Post a Comment