Tuesday, July 29, 2014

Regular expressions Introduction

1. What is a regular expression?

"A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings."[Ken Thompson]

From the English dictionary:
  ●   pattern = (noun) a regularly repeated arrangement of something.
  ●   to match = (verb) to be equal.

A regex can or cannot match a string and, if it does, can match the string once or multiple times.

2. Regular Expression sample applications

The Unix and MS-DOS shell commands use regular expressions (known as glob patterns ) to match file names. The Unix command:
 $ rm my_file*.txt 
delete all .txt files whose names start with my_file and continue with zero or more characters. The MS-DOS command:
 > del file_num?.txt 
deletes all .txt files whose names are file_num plus a character.

With filename patterns, a few characters have special meaning: the star means match anything, and the question mark means match any one character.

The Unix program egrep is an utility which scans a specified file line by line, displaying only lines that match a given regular expression. The egrep command:
 $ egrep -n '(log.info|log.debug)' CustomerBean.java 
print to screen all lines of the java file containing the strings 'log.info' or 'log.debug'.

3. The literal Regular Expression

If a regular expression doesn't use metacharacters, it effectively becomes a simple plain text search. For example, searching for "cat" finds all occurrences with the three letters c-a-t in a row.
It's important to notice that regular-expression searching is not done on a word basis, a regex engine can understand the concept of lines in a string, but it has no idea of human language words, sentences, paragraphs etc.

4. Metacharacters and Literal characters

Regular expressions are composed of two types of characters:
  • the 15 metacharacters .^$*+?|(){}[]\-
  • literal or normal text characters
The 15 metacharacters are used for encoding a pattern. The following paragraphs will try to explain what are metacharacters used for.

5. Start and End of the Line

The ^ (caret) and $ (dollar) metacharacters represent the start and end of the input string that is being checked.
The regular expression ^cat matches only if the cat is at the beginning of the string. The expression dogs$ finds dogs only at the end of the string.
The caret and dollar are special in that they match a position in the string rather than text characters themselves.

Example 1: matching the whole input string with success

Example 2: matching the whole input string with failure

Example 3: matching the start of the input string

Example 4: matching the end of the input string

6. Character Classes

6.1 Matching any one of several characters
The regular-expression construct [], usually called a character class, allow you to list the characters you want at a certain point in the matching string.

Example: the regular expression h[aeu]llo matches the different spellings of hello

As another example the regular expression reali[sz]e can match the two different spelling: realize and realise.
Notice how outside of a class, literal characters (like the "r" and "e" of reali[sz]e) have an implied AND, whilst inside a character class the contents of a class is a list of characters, so the implication is OR.

Example: matching capitalization of a word's first letter

You can list in the class as many characters as you like. For example, the regular expression H[123456] matches any of the HTML headers <H1>, <H2>,..., <H6>.
6.2 Character Class and Metacharacters

Within a character class, the metacharacter - (dash) indicates a range of characters: <H[1-6]> is identical to H[123456]. Note that a dash is a metacharacter only within a character class. If it is the first character listed in the class it is not considered a metacharacter. Along the same lines, the question mark and period at the end of the class are usually regular-expression metacharacters, but only when not within a class.

[A-F] the dash is a metacharacter.
[-ABCDEF] the dash isn't a metacharacter.
.[A-F]*? the star, dot and question mark are metacharacters.
[A-F.*?] the star, dot and question mark aren't metacharacters.
6.3 Negated character classes
A negated character class uses [^] instead of [] and matches any character that isn't listed within the brackets. For example, [^abc] matches any one character other than the letters a, b or c. Also [^6-9] matches a character that's not 6 through 9.

Example part 1 of 2: matching words containing the letters "tt".

Example part 2 of 2: matching words containing the letters "tt" followed by something other than the "l" letter.

In the previous example the words little and Matt are not matched because "tt" is followed an "l" and by an end-of-line instead of by any other character different than "l".
A convenient way to view a negated class is that it is simply a shorthand for a normal class that includes all possible characters except those that are listed.

7. Dot Character

7.1 Dot Matches Any Character
The metacharacter . (called dot or point) matches any character. It can be useful when you want to put a placeholder for any character in a regular expression.

Example 1: matching different date separators with metacharacter dot

The above regex is easy to be read, but it is vague because it matches lottery numbers.

Example 2: matching different date separators with a character class

The above regex is hard to be read but more precise, because it only matches the three date strings.

If you want to match a date such as 28/01/70, 28-01-70, or 28.01.70, you may use the simple regular expression 28.01.70, where the dot metacharacter matches any kind of separator, or you may build a regular expression containing the character class [-./] to allow a limited list of characters between each number. Notice that if within a character class the dot is not a metacharacter.

8. Alternation

8.1 Matching any one of several subexpressions
The | metacharacter (OR) allows you to combine two or more regular expressions into a single expression that matches strings of any of the initial regular expressions.

Example 1: matching both words "cat" and "dog"

When combined this way, the subexpressions are called alternatives.

Looking back to the example using the character class h[ea]llo, it is interesting to notice that it can be written as hello|hallo, and even h(e|a)llo.

Example 2: alternation reaches far as character classes

With h(e|a)llo, the parentheses are required because without them he|allo means he or allo.

Here's an example involving an alternate spelling of the name Steven. Compare and contrast the following three expressions, which are all effectively the same:


Example 3: matching multiple spellings with alternation

Note: although the examples h[ea]llo and h(a|e)llo might blur the distinction, be careful not to confuse the concepts of alternation and character class. A character class can match just a single character in the target string. With alternations each sub-expression is a complete regular expression that can match an arbitrary amount of text in the target string.

9. Word Boundaries

A common problem is when a regular expression matches occurrences of a searched word not only when the word is by itself but also when it is embedded in a longer word. The majority of versions of regular expression solve this problem and offer support for word recognition: the ability to match the boundary of a word (where a word begins or ends).

Example 1: matching the word "cat" by itself

Example 2: finding words ending with "cations".

The regular expression \bcat\b uses the metasequences \b to match the word cat by itself.

The expression \bcat\b means: match if you can find a start-of-word position, followed by letters c-a-t, followed by an end-of-word position. The start of a word is simply the position where a sequence of alphanumeric characters begins, the end of word is where such a sequence ends. The figure below shows these positions marked.

Note: in regular expressions \b normally matches a word boundary, but within a character class, it matches a backspace.

10. Optional Items

Noah Webster, father of American Dictionary, modifying the spellings of various words, dropped the u in words like colour or favour. If you wish to match double spelling words like colour-color, favour-favor which are the same except an u, you can use metacharacter ? (question mark) which means optional.

Example 1: matching both words armour and armor.

The metacharacter ? placed after the u means that the regular expression will successfully match whether the u exists in that position of target string or not.

11. Other Quantifiers: Repetition

The metacharacters + (plus) and * (star), like the question mark, affect the quantity of the immediately-preceding item. The metacharacter + means one or more of the preceding item, and * means any number, including none, of the preceding item.

The regular expression with star: A*BACUS means "try to match 'A' as many times as possible, but it's OK to settle for nothing".

The regular expression with plus: A+BACUS means "try to match 'A' as many times as possible, but fail if you can't match at least once".

The three metacharacters, question mark, plus, and star, are called quantifiers because they influence the quantity of what they apply to.

Example: adding support for space in HTML headings regular expression.

We modify the previous <H[1-6]> regular expression example, which matches HTML headers, to add support for optional space. The HTML specification says that spaces are allowed immediately before the closing >, such as with <h1   >.
We insert  * into the regular expression where we want to allow (but not require) spaces and we get:
<H[1-6] *>

Example: searching for <hr> tag

Continuing to examine quantifiers, we want to search for an HTML tag such as <HR SIZE=14>. Like in the headings example, optional spaces are allowed before the closing angle bracket and on either side of the equal sign. Additionally, one space is required between the HR and SIZE, even if more spaces are allowed.
To suit all these requirements we use the regular expression:
 <HR +SIZE *= *14 *> 

Now, we want to search <hr> tags with any value of size attribute. To accomplish this, we replace the 14 with an expression that match a number with one or more digits. This leaves us with:
 <HR +SIZE *= *[0-9]+ *> 
We are using RegexBuddy tool's -i case-insensitive option, so we don't have to use "[Hh][Rr]" instead of "HR".

Continuing to use the <hr> example, we want to make the size attribute optional, to match the <hr> simple tag . Optional means using "?", so we get:
 <HR( +SIZE *= *[0-9]+)? *> 

11.2 Defined range of matches: intervals
Some flavours of regular expression support a metasequence for providing your own minimum and maximum: C{min,max}. This is called the interval quantifier. For example, A{3,12} tries to match the "A" up to 12 times if possible, but settles for three.

The regular expression [a-zA-Z]{1,5} matches a US stock ticker (from one to five letters).

The notation {0,1} has the same meaning as question mark quantifier.

12. Parentheses and Backreferences

So far, we have seen two uses for parentheses: to create sub-patterns with alternation and to create units of characters with quantifiers. There is another use for parentheses: in many regular expression flavors, parentheses can remember text matched by the subexpression they enclose. Suppose that you are given the task of checking HTML pages for doubled words, you could make use of text-remembering feature to build a regular expression that solve the doubled words problem.

If you were searching for double word such as "the the", the regular expression \bthe +the\b could find them. In the previous expression we used word-boundary metasequences \b...\b and + to add support for space between the two words.
If you want to build a regular expression that match generic doubled words, you have to use backreferencing, the capacity of a regular expression engine of matching new text that is the same as some text matched earlier in the expression.

The regular expression:
 \b([A-Za-z]+) +\1\b 
match double words, because [A-Za-z]+ match a general word, the parentheses around it tells the regex engine to remember the text that the subexpression [A-Za-z]+ matches, and the metasequence \1 represents that text later in the regular expression.
If you have more than one set of parentheses in a regular expression, you can use "\1", "\2", "\3", etc., to refer to the first, second, third, etc. sets.

If you want to use this regular expression with egrep application, you can run the following command:
 $ egrep -i '\b([a-z]+) +\1\b' files 
The regex uses the case-insensitive -i option. It is important to notice that the regex has some limitations. Since egrep considers each line in isolation, it isn't able to find when the ending word of one line is repeated at the beginning of the next.

13. The Escape Character

We want to match a string containing a character that belongs to the metacharacters list. For example, you may want to search the internet hostname "sega.etr.com". If you use the regular expression sega.etr.com you can find either the correct string or a wrong one such as "segaretrocomputers.com", because the dot is a metacharacter matching any character.

The metasequence to match an actual period is \., a period preceded by a backslash. The regular expression sega\.etr\.com matches correctly the hostname "sega.etr.com" and does not match other wrong hostnames anymore.

Where the sequence \. is said an escaped period or escaped dot and the backslash in the sequence \. is said an escape. You can do this with all the metacharacters, except in a character class. When a metacharacter is escaped with a backslash, it loses its special meaning and becomes like a literal character.

As another example, if you want to match a number within parentheses such as (555), a phone prefix, you could use the regular expression: \([0-9]+\).
The backslashes in the \( and \) sequences remove the special interpretation of the parentheses, making them literals characters to match actual parentheses in the target string.

14. Regular expressions examples

The following examples provide an insight into regular expressions even if they may improved.
Example: String between double quotes
Example: Dollar amount
Example: The simplest regex for URLs 
Example: More complete regex for URLs
Example: Time of day regex 1 
Example: Time of day regex 2 
Example: Time of day for 24H clock - 1/2
Example: Time of day for 24H clock 2/2

No comments :

Post a Comment