澳门新萄京官方网站-www.8455.com-澳门新萄京赌场网址

澳门新萄京官方网站:正则表明式指南,正则表

2019-12-01 作者:www.8455.com   |   浏览(94)

Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.
In Python a regular expression search is typically written as:

15
Python Regular Expressions:
##Regular expressions are a powerful language for matching text patterns.
##The Python "re" module provides regular expression support.

[python] view plaincopyprint?
import re 
r1 = re.compile(r'(?im)(?P<name></html>)$') 
content = """
        <HTML>
boxsuch as 'box' and 'boxes', but not 'inbox'. In other words
box
<html>dsafdsafdas  </html> </ahtml>
</html>
  </HTML> 
""" 
 
reobj = re.compile("(?im)(?P<name></.*?html>)$") 
for match in reobj.finditer(content): 
    # match start: match.start() 
    # match end (exclusive): match.end() 
    # matched text: match.group() 
    print "start>>", match.start()   
    print "end>>", match.end() 
    print "span>>", match.span() 
    print "match.group()>>", match.group() 
 
print "*"*20 
 
     
 
if r1.match(content): print 'match succeeds' 
else: print 'match fails'                          # prints: match fails 
 
if r1.search(content): print 'search succeeds'   # prints: search succeeds 
else: print 'search fails' 
 
print r1.flags  
print r1.groupindex 
print r1.pattern 
 
l = r1.split(content) 
print "l>>", l 
 
for item in r1.findall(content): 
    print "item>>", item 
     
s = r1.sub("aa", content) 
print "s>>", s 
 
s_subn, s_sub_count = r1.subn("aaaaaaaaaaaa", content) 
print "s_subn>>", s_subn 
print "s_sub_count>>", s_sub_count 

    (文章内容主要摘自《JavaScript-The Definitive Guide》5th edition)
    利用JavaScript提供的诀要,在客商端通过正则表明式(regular expression卡塔尔(英语:State of Qatar)的措施,验证页面输入的合法性是很常用且很迅猛的做法。想要与给定的正则表明式的形式绝比较,不仅能够通过字符串提供的有些办法,也足以透过正则表达式对象(RegExp卡塔尔(英语:State of Qatar)提供的诀要完成。

  match = re.search(pat, str)

##In Python a regular expression search is typically written as:
match = re.search(pat, str)

[ Team LiB ] 
9.7 Regular Expressions and the re Module
A regular expression is a string that represents a pattern. With regular expression functionality, you can compare that pattern to another string and see if any part of the string matches the pattern.
The re module supplies all of Python's regular expression functionality. The compile function builds a regular expression object from a pattern string and optional flags. The methods of a regular expression object look for matches of the regular expression in a string and/or perform substitutions. Module re also exposes functions equivalent to a regular expression's methods, but with the regular expression's pattern string as their first argument.
Regular expressions can be difficult to master, and this book does not purport to teach them桰 cover only the ways in which you can use them in Python. For general coverage of regular expressions, I recommend the book Mastering Regular Expressions, by Jeffrey Friedl (O'Reilly). Friedl's book offers thorough coverage of regular expressions at both the tutorial and advanced levels.
9.7.1 Pattern-String Syntax
The pattern string representing a regular expression follows a specific syntax:
Alphabetic and numeric characters stand for themselves. A regular expression whose pattern is a string of letters and digits matches the same string.
Many alphanumeric characters acquire special meaning in a pattern when they are preceded by a backslash ().
Punctuation works the other way around. A punctuation character is self-matching when escaped, and has a special meaning when unescaped.
The backslash character itself is matched by a repeated backslash (i.e., the pattern \).
Since regular expression patterns often contain backslashes, you generally want to specify them using raw-string syntax (covered in Chapter 4). Pattern elements (e.g., r't', which is equivalent to the non-raw string literal '\t') do match the corresponding special characters (e.g., the tab character 't'). Therefore, you can use raw-string syntax even when you do need a literal match for some such special character.
Table 9-2 lists the special elements in regular expression pattern syntax. The exact meanings of some pattern elements change when you use optional flags, together with the pattern string, to build the regular expression object. The optional flags are covered later in this chapter.
Table 9-2. Regular expression pattern syntax

正则表明式的定义与语法

Some Simple Examples

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

## search后不常会跟 if 剖断语句
## following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

Element
Meaning
 .
Matches any character except n (if DOTALL, also matches n)
 ^
Matches start of string (if MULTILINE, also matches after n)
 $
Matches end of string (if MULTILINE, also matches before n)
 *
Matches zero or more cases of the previous regular expression; greedy (match as many as possible)
 
Matches one or more cases of the previous regular expression; greedy (match as many as possible)
 ?
Matches zero or one case of the previous regular expression; greedy (match one if possible)
*? , ?, ??
Non-greedy versions of *, , and ? (match as few as possible)
{m,n}
Matches m to n cases of the previous regular expression (greedy)
{m,n}?
Matches m to n cases of the previous regular expression (non-greedy)
 [...]
Matches any one of a set of characters contained within the brackets
 |
Matches expression either preceding it or following it
 (...)
Matches the regular expression within the parentheses and also indicates a group
(?iLmsux)
Alternate way to set optional flags; no effect on match
(?:...)
Like (...), but does not indicate a group
(?P<id>...)
Like (...), but the group also gets the name id
(?P=id)
Matches whatever was previously matched by group named id
(?#...)
Content of parentheses is just a comment; no effect on match
(?=...)
Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string
(?!...)
Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string
(?<=...)
Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)
(?<!...)
Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)
number
Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)
A
Matches an empty string, but only at the start of the whole string
b
Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also w)
B
Matches an empty string, but not at the start or end of a word
d
Matches one digit, like the set [0-9]
D
Matches one non-digit, like the set [^0-9]
s
Matches a whitespace character, like the set [ tnrfv]
S
Matches a non-white character, like the set [^ tnrfv]
w
Matches one alphanumeric character; unless LOCALE or UNICODE is set, w is like [a-zA-Z0-9_]
W
Matches one non-alphanumeric character, the reverse of w
Z
Matches an empty string, but only at the end of the whole string
\
Matches one backslash character
9.7.2 Common Regular Expression Idioms
'.*' as a substring of a regular expression's pattern string means "any number of repetitions (zero or more) of any character." In other words, '.*' matches any substring of a target string, including the empty substring. '. ' is similar, but it matches only a non-empty substring. For example:
'pre.*post'
matches a string containing a substring 'pre' followed by a later substring 'post', even if the latter is adjacent to the former (e.g., it matches both 'prepost' and 'pre23post'). On the other hand:
'pre. post'
matches only if 'pre' and 'post' are not adjacent (e.g., it matches 'pre23post' but does not match 'prepost'). Both patterns also match strings that continue after the 'post'.
To constrain a pattern to match only strings that end with 'post', end the pattern with Z. For example:
r'pre.*postZ'
matches 'prepost', but not 'preposterous'. Note that we need to express the pattern with raw-string syntax (or escape the backslash by doubling it into \), as it contains a backslash. Using raw-string syntax for all regular expression pattern literals is good practice in Python, as it's the simplest way to ensure you'll never fail to escape a backslash.
Another frequently used element in regular expression patterns is b, which matches a word boundary. If you want to match the word 'his' only as a whole word and not its occurrences as a substring in such words as 'this' and 'history', the regular expression pattern is:
 r'bhisb'
with word boundaries both before and after. To match the beginning of any word starting with 'her', such as 'her' itself but also 'hermetic', but not words that just contain 'her' elsewhere, such as 'ether', use:
r'bher'
with a word boundary before, but not after, the relevant string. To match the end of any word ending with 'its', such as 'its' itself but also 'fits', but not words that contain 'its' elsewhere, such as 'itsy', use:
r'itsb'
with a word boundary after, but not before, the relevant string. To match whole words thus constrained, rather than just their beginning or end, add a pattern element w* to match zero or more word characters. For example, to match any full word starting with 'her', use:
r'bherw*'
And to match any full word ending with 'its', use:
r'w*itsb'
9.7.3 Sets of Characters
You denote sets of characters in a pattern by listing the characters within brackets ([ ]). In addition to listing single characters, you can denote a range by giving the first and last characters of the range separated by a hyphen (-). The last character of the range is included in the set, which is different from other Python ranges. Within a set, special characters stand for themselves, except , ], and -, which you must escape (by preceding them with a backslash) when their position is such that, unescaped, they would form part of the set's syntax. In a set, you can also denote a class of characters by escaped-letter notation, such as d or S. However, b in a set denotes a backspace character, not a word boundary. If the first character in the set's pattern, right after the [, is a caret (^), the set is complemented. In other words, the set matches any character except those that follow ^ in the set pattern notation.
A frequent use of character sets is to match a word, using a definition of what characters can make up a word that differs from w's default (letters and digits). To match a word of one or more characters, each of which can be a letter, an apostrophe, or a hyphen, but not a digit (e.g., 'Finnegan-O'Hara'), use:
r"[a-zA-z'-] "
It's not strictly necessary to escape the hyphen with a backslash in this case, since its position makes it syntactically unambiguous. However, the backslash makes the pattern somewhat more readable, by visually distinguishing the hyphen that you want to have as a character in the set from those used to denote ranges.
9.7.4 Alternatives
A vertical bar (|) in a regular expression pattern, used to specify alternatives, has low precedence. Unless parentheses change the grouping, | applies to the whole pattern on either side, up to the start or end of the string, or to another |. A pattern can be made up of any number of subpatterns joined by |. To match such a regular expression, the first subpattern is tried first, and if it matches, the others are skipped. If the first subpattern does not match, the second subpattern is tried, and so on. | is neither greedy nor non-greedy, as it doesn't take into consideration the length of the match.
If you have a list L of words, a regular expression pattern that matches any of the words is:
'|'.join([r'b%sb' % word for word in L])
If the items of L can be more-general strings, not just words, you need to escape each of them with function re.escape, covered later in this chapter, and you probably don't want the b word boundary markers on either side. In this case, use the regular expression pattern:
'|'.join(map(re.escape,L))
9.7.5 Groups
A regular expression can contain any number of groups, from none up to 99 (any number is allowed, but only the first 99 groups are fully supported). Parentheses in a pattern string indicate a group. Element (?P<id>...) also indicates a group, and in addition gives the group a name, id, that can be any Python identifier. All groups, named and unnamed, are numbered from left to right, 1 to 99, with group number 0 indicating the whole regular expression.
For any match of the regular expression with a string, each group matches a substring (possibly an empty one). When the regular expression uses |, some of the groups may not match any substring, although the regular expression as a whole does match the string. When a group doesn't match any substring, we say that the group does not participate in the match. An empty string '' is used to represent the matching substring for a group that does not participate in a match, except where otherwise indicated later in this chapter.
For example:
r'(. )1 Z'
matches a string made up of two or more repetitions of any non-empty substring. The (. ) part of the pattern matches any non-empty substring (any character, one or more times), and defines a group thanks to the parentheses. The 1 part of the pattern matches one or more repetitions of the group, and the Z anchors the match to end-of-string.
9.7.6 Optional Flags
A regular expression pattern element with one or more of the letters "iLmsux" between (? and ) lets you set regular expression options within the regular expression's pattern, rather than by the flags argument to function compile of module re. Options apply to the whole regular expression, no matter where the options element occurs in the pattern. For clarity, options should always be at the start of the pattern. Placement at the start is mandatory if x is among the options, since x changes the way Python parses the pattern.
Using the explicit flags argument is more readable than placing an options element within the pattern. The flags argument to function compile is a coded integer, built by bitwise ORing (with Python's bitwise OR operator, |) one or more of the following attributes of module re. Each attribute has both a short name (one uppercase letter), for convenience, and a long name (an uppercase multiletter identifier), which is more readable and thus normally preferable:
I or IGNORECASE
Makes matching case-insensitive
L or LOCALE
Causes w, W, b, and B matches to depend on what the current locale deems alphanumeric
M or MULTILINE
Makes the special characters ^ and $ match at the start and end of each line (i.e., right after/before a newline), as well as at the start and end of the whole string
S or DOTALL
Causes the special character . to match any character, including a newline
U or UNICODE
Makes w, W, b, and B matches depend on what Unicode deems alphanumeric
X or VERBOSE
Causes whitespace in the pattern to be ignored, except when escaped or in a character set, and makes a # character in the pattern begin a comment that lasts until the end of the line
For example, here are three ways to define equivalent regular expressions with function compile, covered later in this chapter. Each of these regular expressions matches the word "hello" in any mix of upper- and lowercase letters:
import re
r1 = re.compile(r'(?i)hello')
r2 = re.compile(r'hello', re.I)
r3 = re.compile(r'hello', re.IGNORECASE)
The third approach is clearly the most readable, and thus the most maintainable, even though it is slightly more verbose. Note that the raw-string form is not necessary here, since the patterns do not include backslashes. However, using raw strings is still innocuous, and is the recommended style for clarity.
Option re.VERBOSE (or re.X) lets you make patterns more readable and understandable by appropriate use of whitespace and comments. Complicated and verbose regular expression patterns are generally best represented by strings that take up more than one line, and therefore you normally want to use the triple-quoted raw-string format for such pattern strings. For example:
repat_num1 = r'(0[0-7]*|0x[da-fA-F] |[1-9]d*)L?Z'
repat_num2 = r'''(?x)            # pattern matching integer numbers
              (0 [0-7]*        | # octal: leading 0, then 0 octal digits
               0x [da-f-A-F] | # hex: 0x, then 1 hex digits
               [1-9] d*       ) # decimal: leading non-0, then 0 digits
               L?Z              # optional trailing L, then end of string
              '''
The two patterns defined in this example are equivalent, but the second one is made somewhat more readable by the comments and the free use of whitespace to group portions of the pattern in logical ways.
9.7.7 Match Versus Search
So far, we've been using regular expressions to match strings. For example, the regular expression with pattern r'box' matches strings such as 'box' and 'boxes', but not 'inbox'. In other words, a regular expression match can be considered as implicitly anchored at the start of the target string, as if the regular expression's pattern started with A.
Often, you're interested in locating possible matches for a regular expression anywhere in the string, without any anchoring (e.g., find the r'box' match inside such strings as 'inbox', as well as in 'box' and 'boxes'). In this case, the Python term for the operation is a search, as opposed to a match. For such searches, you use the search method of a regular expression object, while the match method only deals with matching from the start. For example:
import re
r1 = re.compile(r'box')
if r1.match('inbox'): print 'match succeeds'
else print 'match fails'                          # prints: match fails
if r1. search('inbox'): print 'search succeeds'   # prints: search succeeds
else print 'search fails'
9.7.8 Anchoring at String Start and End
The pattern elements ensuring that a regular expression search (or match) is anchored at string start and string end are A and Z respectively. More traditionally, elements ^ for start and $ for end are also used in similar roles. ^ is the same as A, and $ is the same as Z, for regular expression objects that are not multiline (i.e., that do not contain pattern element (?m) and are not compiled with the flag re.M or re.MULTILINE). For a multiline regular expression object, however, ^ anchors at the start of any line (i.e., either at the start of the whole string or at the position right after a newline character n). Similarly, with a multiline regular expression, $ anchors at the end of any line (i.e., either at the end of the whole string or at the position right before n). On the other hand, A and Z anchor at the start and end of the whole string whether the regular expression object is multiline or not. For example, here's how to check if a file has any lines that end with digits:
import re
digatend = re.compile(r'd$', re.MULTILINE)
if re.search(open('afile.txt').read(  )): print "some lines end with digits"
else: print "no lines end with digits"
A pattern of r'dn' would be almost equivalent, but in that case the search would fail if the very last character of the file were a digit not followed by a terminating end-of-line character. With the example above, the search succeeds if a digit is at the very end of the file's contents, as well as in the more usual case where a digit is followed by an end-of-line character.
9.7.9 Regular Expression Objects
A regular expression object r has the following read-only attributes detailing how r was built (by function compile of module re, covered later in this chapter):
flags
The flags argument passed to compile, or 0 when flags is omitted
groupindex
A dictionary whose keys are group names as defined by elements (?P<id>); the corresponding values are the named groups' numbers
pattern
The pattern string from which r is compiled
These attributes make it easy to get back from a compiled regular expression object to its pattern string and flags, so you never have to store those separately.
A regular expression object r also supplies methods to locate matches for r's regular expression within a string, as well as to perform substitutions on such matches. Matches are generally represented by special objects, covered in the later Section 9.7.10.
findall 
r.findall(s)
 
When r has no groups, findall returns a list of strings, each a substring of s that is a non-overlapping match with r. For example, here's how to print out all words in a file, one per line:
import re
reword = re.compile(r'w ')
for aword in reword.findall(open('afile.txt').read(  )):
    print aword
When r has one group, findall also returns a list of strings, but each is the substring of s matching r's group. For example, if you want to print only words that are followed by whitespace (not punctuation), you need to change only one statement in the previous example:
reword = re.compile('(w )s')
When r has n groups (where n is greater than 1), findall returns a list of tuples, one per non-overlapping match with r. Each tuple has n items, one per group of r, the substring of s matching the group. For example, here's how to print the first and last word of each line that has at least two words:
import re
first_last = re.compile(r'^W*(w )b.*b(w )W*$',
                                      re.MULTILINE)
for first, last in
first_last.findall(open('afile.txt').read(  )):
    print first, last
match 
r.match(s,start=0,end=sys.maxint)
 
Returns an appropriate match object when a substring of s, starting at index start and not reaching as far as index end, matches r. Otherwise, match returns None. Note that match is implicitly anchored at the starting position start in s. To search for a match with r through s, from start onwards, call r.search, not r.match. For example, here's how to print all lines in a file that start with digits:
import re
digs = re.compile(r'd ')
for line in open('afile.txt'):
    if digs.match(line): print line,
search 
r.search(s,start=0,end=sys.maxint)
 
Returns an appropriate match object for the leftmost substring of s, starting not before index start and not reaching as far as index end, that matches r. When no such substring exists, search returns None. For example, to print all lines containing digits, one simple approach is as follows:
import re
digs = re.compile(r'd ')
for line in open('afile.txt'):
    if digs.search(line): print line,
split 
r.split(s,maxsplit=0)
 
Returns a list L of the splits of s by r (i.e., the substrings of s that are separated by non-overlapping, non-empty matches with r). For example, to eliminate all occurrences of substring 'hello' from a string, in any mix of lowercase and uppercase letters, one way is:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
astring = ''.join(rehello.split(astring))
When r has n groups, n more items are interleaved in L between each pair of splits. Each of the n extra items is the substring of s matching r's corresponding group in that match, or None if that group did not participate in the match. For example, here's one way to remove whitespace only when it occurs between a colon and a digit:
import re
re_col_ws_dig = re.compile(r'(:)s (d)')
astring = ''.join(re_col_ws_dig.split(astring))
If maxsplit is greater than 0, at most maxsplit splits are in L, each followed by n items as above, while the trailing substring of s after maxsplit matches of r, if any, is L's last item. For example, to remove only the first occurrence of substring 'hello' rather than all of them, change the last statement in the first example above to:
astring = ''.join(rehello.split(astring, 1))
sub 
r.sub(repl,s,count=0)
 
Returns a copy of s where non-overlapping matches with r are replaced by repl, which can be either a string or a callable object, such as a function. An empty match is replaced only when not adjacent to the previous match. When count is greater than 0, only the first count matches of r within s are replaced. When count equals 0, all matches of r within s are replaced. For example, here's another way to remove only the first occurrence of substring 'hello' in any mix of cases:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
astring = rehello.sub('', astring, 1)
Without the final 1 argument to method sub, this example would remove all occurrences of 'hello'.
When repl is a callable object, repl must accept a single argument (a match object) and return a string to use as the replacement for the match. In this case, sub calls repl, with a suitable match-object argument, for each match with r that sub is replacing. For example, to uppercase all occurrences of words starting with 'h' and ending with 'o' in any mix of cases, you can use the following:
import re
h_word = re.compile(r'bhw ob', re.IGNORECASE)
def up(mo): return mo.group(0).upper(  )
astring = h_word.sub(up, astring)
Method sub is a good way to get a callback to a callable you supply for every non-overlapping match of r in s, without an explicit loop, even when you don't need to perform any substitution. The following example shows this by using the sub method to build a function that works just like method findall for a regular expression without groups:
import re
def findall(r, s):
    result = [  ]
    def foundOne(mo): result.append(mo.group(  ))
    r.sub(foundOne, s)
    return result
The example needs Python 2.2, not just because it uses lexically nested scopes, but because in Python 2.2 re tolerates repl returning None and treats it as if it returned '', while in Python 2.1 re was more pedantic and insisted on repl returning a string.
When repl is a string, sub uses repl itself as the replacement, except that it expands back references. A back reference is a substring of repl of the form g<id>, where id is the name of a group in r (as established by syntax (?P<id>) in r's pattern string), or dd, where dd is one or two digits, taken as a group number. Each back reference, whether named or numbered, is replaced with the substring of s matching the group of r that the back reference indicates. For example, here's how to enclose every word in braces:
import re
grouped_word = re.compile('(w )')
astring = grouped_word.sub(r'{1}', astring)
subn 
r.subn(repl,s,count=0)
 
subn is the same as sub, except that subn returns a pair (new_string, n) where n is the number of substitutions that subn has performed. For example, to count the number of occurrences of substring 'hello' in any mix of cases, one way is:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
junk, count = rehello.subn('', astring)
print 'Found', count, 'occurrences of "hello"'
9.7.10 Match Objects
Match objects are created and returned by methods match and search of a regular expression object. There are also implicitly created by methods sub and subn when argument repl is callable, since in that case a suitable match object is passed as the actual argument on each call to repl. A match object m supplies the following attributes detailing how m was created:
pos
The start argument that was passed to search or match (i.e., the index into s where the search for a match began)
endpos
The end argument that was passed to search or match (i.e., the index into s before which the matching substring of s had to end)
lastgroup
The name of the last-matched group (None if the last-matched group has no name, or if no group participated in the match)
lastindex
The integer index (1 and up) of the last-matched group (None if no group participated in the match)
re
The regular expression object r whose method created m
string
The string s passed to match, search, sub, or subn
A match object m also supplies several methods.
end, span, start 
m.end(groupid=0)
m.span(groupid=0)
m.start(groupid=0)
 
These methods return the delimiting indices, within m.string, of the substring matching the group identified by groupid, where groupid can be a group number or name. When the matching substring is m.string[i:j], m.start returns i, m.end returns j, and m.span returns (i, j). When the group did not participate in the match, i and j are -1.
expand 
m.expand(s)
 
Returns a copy of s where escape sequences and back references are replaced in the same way as for method r.sub, covered in the previous section.
group 
m.group(groupid=0,*groupids)
 
When called with a single argument groupid (a group number or name), group returns the substring matching the group identified by groupid, or None if that group did not participate in the match. The common idiom m.group( ), also spelled m.group(0), returns the whole matched substring, since group number 0 implicitly means the whole regular expression.
When group is called with multiple arguments, each argument must be a group number or name. group then returns a tuple with one item per argument, the substring matching the corresponding group, or None if that group did not participate in the match.
groups 
m.groups(default=None)
 
Returns a tuple with one item per group in r. Each item is the substring matching the corresponding group, or default if that group did not participate in the match.
groupdict 
m.groupdict(default=None)
 
Returns a dictionary whose keys are the names of all named groups in r. The value for each name is the substring matching the corresponding group, or default if that group did not participate in the match.
9.7.11 Functions of Module re
The re module supplies the attributes listed in the earlier section Section 9.7.6. It also provides a function that corresponds to each method of a regular expression object (findall, match, search, split, sub, and subn), each with an additional first argument, a pattern string that the function implicitly compiles into a regular expression object. It's generally preferable to compile pattern strings into regular expression objects explicitly and call the regular expression object's methods, but sometimes, for a one-off use of a regular expression pattern, calling functions of module re can be slightly handier. For example, to count the number of occurrences of substring 'hello' in any mix of cases, one function-based way is:
import re
junk, count = re.subn(r'(?i)hello', '', astring)
print 'Found', count, 'occurrences of "hello"'
In cases such as this one, regular expression options (here, for example, case insensitivity) must be encoded as regular expression pattern elements (here, (?i)), since the functions of module re do not accept a flags argument.
Module re also supplies error, the class of exceptions raised upon errors (generally, errors in the syntax of a pattern string), and two additional functions.
compile 
compile(pattern,flags=0)
 
Creates and returns a regular expression object, parsing string pattern as per the syntax covered in Section 9.7.1, and using integer flags as in the section Section 9.7.6, both earlier in this chapter.
escape 
escape(s)
 
Returns a copy of string s where each non-alphanumeric character is escaped (i.e., preceded by a backslash ). This is handy when you need to match string s literally as part (or all) of a regular expression pattern string.
[ Team LiB ] 


Searching for Elvis

Suppose you spend all your free time scanning documents looking for evidence that Elvis is still alive. You could search with the following regular expression:

1. elvis Find elvis

This is a perfectly valid regular expression that searches for an exact sequence of characters. In .NET, you can easily set options to ignore the case of characters, so this expression will match "Elvis", "ELVIS", or "eLvIs". Unfortunately, it will also match the last five letters of the word "pelvis". We can improve the expression as follows:

2. belvisb Find elvis as a whole word

Now things are getting a little more interesting. The "b" is a special code that means, "match the position at the beginning or end of any word". This expression will only match complete words spelled "elvis" with any combination of lower case or capital letters.

Suppose you want to find all lines in which the word "elvis" is followed by the word "alive." The period or dot "." is a special code that matches any character other than a newline. The asterisk "*" means repeat the previous term as many times as necessary to guarantee a match. Thus, ".*" means "match any number of characters other than newline". It is now a simple matter to build an expression that means "search for the word 'elvis' followed on the same line by the word 'alive'."

3. belvisb.*baliveb Find text with "elvis" followed by "alive"

With just a few special characters we are beginning to build powerful regular expressions and they are already becoming hard for we humans to read.

Let's try another example.

str = 'an example word:cat!!'
match = re.search(r'word:www', str)

# If-statement after search() tests if it succeeded
  if match:                      
    print 'found', match.group() ## 'found word:cat'
  else:
    print 'did not find'
  str = 'an example word:cat!!'
  match = re.search(r'word:www', str)
  # If-statement after search() tests if it succeeded
    if match:                      
   print 'found', match.group() ## 'found word:cat'
    else:
   print 'did not find'

] view plaincopyprint? import re r1 = re.compile(r(?im)(?Pname/html)$) content = HTML boxsuch as box and boxes, but not inbox. In other words box htmldsafdsafdas /html /ahtm...

    在JavaScrpt中,能够通过RegExp的布局函数RegExp(卡塔尔(英语:State of Qatar)来布局三个正则表明式对象,更分布的,也得以通过直接量的语法,定义七个正则表达式对象。与字符串近似,表明式内容的两边用斜线(/卡塔尔(英语:State of Qatar)标志。
   
    直接量字符
   
    反斜线起先的字符具备独特的意义    

Determining the Validity of Phone Numbers

Suppose your web page collects a customer's seven-digit phone number and you want to verify that the phone number is in the correct format, "xxx-xxxx", where each "x" is a digit. The following expression will search through text looking for such a string:

4. bddd-dddd Find seven-digit phone number

Each "d" means "match any single digit". The "-" has no special meaning and is interpreted literally, matching a hyphen. To avoid the annoying repetition, we can use a shorthand notation that means the same thing:

5. bd{3}-d{4}*Find seven-digit phone number a better way*

The "{3}" following the "d" means "repeat the preceding character three times".

Let's learn how to test this expression.

The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.
The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

## The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat').

Character

Matches

字符、数字

Itself

空字符 (u003c/tt>)

t

Tab (u0009)

n

换行 (u000A)

v

Vertical tab (u000B)

f

Form feed (u000C)

r

回车 (u000D)

xnn

The Latin character specified by the hexadecimal number nn; for example, x0A is the same as n

uxxxx

The Unicode character specified by the hexadecimal number xxxx; for example, u0009 is the same as t

cX

The control character ^X; for example, cJ is equivalent to the newline character n

Expresso

If you don't find regular expressions hard to read you are probably an idiot savant or a visitor from another planet. The syntax can be imposing for anyone, including those who use regular expressions frequently. This makes errors common and creates a need for a simple tool for building and testing expressions. Many such tools exist, but I'm partial to my own, Expresso, originally launched on the CodeProject. Version 2.0 is shown here. For later versions, check the Ultrapico website.

To get started, install Expresso and select the Tutorial from the Windows Program menu. Each example can be selected using the tab labeled "Expression Library".

澳门新萄京官方网站 1

Figure 1. Expresso running example 5

Start by selecting the first example, "1. Find Elvis". Click Run Match and look at the TreeView on the right. Note there are several matches. Click on each to show the location of the match in the sample text. Run the second and third examples, noting that the word "pelvis" no longer matches. Finally, run the fourth and fifth examples; both should match the same numbers in the text. Try removing the initial "b" and note that part of a Zip Code matched the format for a phone number.

Note: match.group() returns a string of matched expression(type:str)

## The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

    其它一些特种意义的记号:
       ^ $ . * ? = ! : | / ( ) [ ] { }
   
    字符类

Basics of .NET Regular Expressions

Let's explore some of the basics of regular expressions in .NET.

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * ? { [ ]  | ()
  • . (a period) -- matches any single character except newline 'n'
  • w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. W (upper case W) matches any non-word character.
  • b -- boundary between word and non-word
  • s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ nrtf]. S (upper case S) matches any non-whitespace character.
  • t, n, r -- tab, newline, return
  • d -- decimal digit [0-9]
  • ^ = start, $ = end -- match the start or end of the string
  •  -- inhibit the "specialness" of a character. So, for example, use . to match a period or  to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, @, to make sure it is treated just as a character.

澳门新萄京官方网站 2

骨干的形式/Basic Patterns:
a, X, 9, < ##-- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * ? { [ ] | ( ) (details below)
. (a period) ##-- matches any single character except newline 'n'
w ##-- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. W (upper case W) matches any non-word character.
b ##-- boundary between word and non-word
s ##-- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ nrtf]. S (upper case S) matches any non-whitespace character.
t, n, r ##-- tab, newline, return
d ##-- decimal digit [0-9] (some older regex utilities do not support but d, but they all support w and s)
^ = start, $ = end ##-- match the start or end of the string
##-- inhibit the "specialness" of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, @, to make sure it is treated just as a character.

    大多独自的字符能够选取方括号,组合成叁个字符类。三个字符类能够合作任何二个其包括的字符,只限三个字符。比方: /[abc]/ 相称字母a, b, c中的任义三个假名。而“脱字符”^能够表明相反的情趣,例如,/[^abc]/相配除了a, b, c以外的任义二个字符。连字号 - 表达八个字符之间的任义字符,举例,/[a-z]/ 相配小写字母 a 到 z 之间的任义叁个假名。
    因为有的字符类相比常用,JavaScript中定义了风度翩翩部分字符来代表那个常用的字符类。

Special Characters

You should get to know a few characters with special meaning. You already met "b", ".", "*", and "d". To match any whitespace characters, like spaces, tabs, and newlines, use "s" . Similarly, "w" matches any alphanumeric character.

Let's try a few more examples:

6. baw*b Find words that start with the letter a

This works by searching for the beginning of a word (b), then the letter "a", then any number of repetitions of alphanumeric characters (w*), then the end of a word (b).

7. d Find repeated strings of digits

Here, the " " is similar to "*", except it requires at least one repetition.

8. bw{6}b Find six letter words

Try these in Expresso and start experimenting by inventing your own expressions. Here is a table of some of the characters with special meaning:

 

.

Match any character except newline

w

Match any alphanumeric character

s

Match any whitespace character

d

Match any digit

b

Match the beginning or end of a word

^

Match the beginning of the string

$

Match the end of the string

 

Table 1. Commonly used special characters for regular expressions

Basic Features

The basic rules of regular expression search for a pattern within a string are:

  • The search proceeds through the string from start to end, stopping at the first match found
  • All of the pattern must be matched, but not all of the string
  • If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

基本法则:
1) search的长河是从贰个字符串的头到尾进行的,但意识第一个相称项时停下
2)假诺 match = re.search(pat,str卡塔尔(英语:State of Qatar) 成功了,match的值为非None值,而且match.group(卡塔尔(قطر‎中存放着拾叁分项

Character

Matches

[...]

任意一个在中括号内的字符。

[^...]

任意一个不在中括号内的字符

.

Any character except newline or another Unicode line terminator.

w

任意一个 ASCII 字符。 相当于 [a-zA-Z0-9_]

W

任意一个非 ASCII 字符。 相当于 [^a-zA-Z0-9_]

s

任意一个 Unicode 空格符。

S

任意一个非Unicode空格符。 注意 w(小写)S 不是一回事。

d

任意一个 ASCII 数字。相当于 [0-9]

D

任意一个非 ASCII 数字。相当于[^0-9]

[b]

一个退格符 (特例)。

In the beginning

The special characters "^" and "$" are used when looking for something that must start at the beginning of the text and/or end at the end of the text. This is especially useful for validating input in which the entire text must match a pattern. For example, to validate a seven-digit phone number, you might use:

9. ^d{3}-d{4}$ Validate a seven-digit phone number

This is the same as example (5), but forced to fill the whole text string, with nothing else before or after the matched text. By setting the "Multiline" option in .NET, "^" and "$" change their meaning to match the beginning and end of a single line of text, rather than the entire text string. The Expresso example uses this option.

Repetition

Things get more interesting when you use and * to specify repetition in the pattern

  • -- 1 or more occurrences of the pattern to its left, e.g. 'i ' = one or more i's
  • **'*'** -- 0 or more occurrences of the pattern to its left
  • ? -- match 0 or 1 occurrences of the pattern to its left

澳门新萄京官方网站 3

Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. and * go as far as possible (the and * are said to be "greedy").

 ## i  = one or more i's, as many as possible.
  match = re.search(r'pi ', 'piiig') =>  found, match.group() == "piii"

  ## Finds the first/leftmost solution, and within it drives the  
  ## as far as possible (aka 'leftmost and largest').
  ## In this example, note that it does not get to the second set of i's.
  match = re.search(r'i ', 'piigiiii') =>  found, match.group() == "ii"

  ## s* = zero or more whitespace chars
  ## Here look for 3 digits, possibly separated by whitespace.
  match = re.search(r'ds*ds*d', 'xx1 2   3xx') =>  found, match.group() == "1 2   3"
  match = re.search(r'ds*ds*d', 'xx12  3xx') =>  found, match.group() == "12  3"
  match = re.search(r'ds*ds*d', 'xx123xx') =>  found, match.group() == "123"

  ## ^ = matches the start of string, so this fails:
  match = re.search(r'^bw ', 'foobar') =>  not found, match == None
  ## but without the ^ it succeeds:
  match = re.search(r'bw ', 'foobar') =>  found, match.group() == "bar"
    ## Search for pattern 'iii' in string 'piiig'.
    ## All of the pattern must match, but it may appear anywhere.
    ## On success, match.group() is matched text.
    match = re.search(r'iii', 'piiig') =>  found, match.group() == "iii"
    match = re.search(r'igs', 'piiig') =>  not found, match == None

    ## . = any char but n
    match = re.search(r'..g', 'piiig') =>  found, match.group() == "iig"

    ## d = digit char, w = word char
    match = re.search(r'ddd', 'p123g') =>  found, match.group() == "123"
    match = re.search(r'www', '@@abcd!!') =>  found, match.group() == "abc"

    转义字符是足以行使在[ ]内的。值得注意的是b,在方括号[ ]以内时,其意思是退格符。然则在方括号之外直接接受时,则相称字符的界限。
    **
    重复

Escaped characters

澳门新萄京官方网站:正则表明式指南,正则表明式。A problem occurs if you actually want to match one of the special characters, like "^" or "$". Use the backslash to remove the special meaning. Thus, "^", ".", and "\", match the literal characters "^", ".", and "", respectively.

Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'w @w ':

  str = 'purple alice-b@google.com monkey dishwasher'
  match = re.search(r'w @w ', str)
  if match:
    print match.group()  ## 'b@google'

The search does not get the whole email address in this case because the w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

重复:

**

Repetitions

You've seen that "{3}" and "*" can be used to indicate repetition of a single character. Later, you'll see how the same syntax can be used to repeat entire subexpressions. There are several other ways to specify a repetition, as shown in this table:

 

*

Repeat any number of times

Repeat one or more times

?

Repeat zero or one time

{n}

Repeat n times

{n,m}

Repeat at least n, but no more than m times

{n,}

Repeat at least n times

 

Table 2. Commonly used quantifiers

Let's try a few more examples:

10. bw{5,6}b Find all five and six letter words

11. bd{3}sd{3}-d{4} Find ten digit phone numbers

12. d{3}-d{2}-d{4} Social security number

13. ^w* The first word in the line or in the text

Try the last example with and without setting the "Multiline" option, which changes the meaning of "^".

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes w, s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[w.-] @[w.-] ' to get the whole email address:

 match = re.search(r'[w.-] @[w.-] ', str)
  if match:
    print match.group()  ## 'alice-b@google.com'

You can also use a dash to indicate a range, so

  1. [a-z] matches all lowercase letters.
  2. To use a dash without indicating a range, put the dash last, e.g. [abc-].
  3. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
  • -- ## 1 or more occurrences of the pattern to its left, e.g. 'i ' = one or more i's
    * -- ## 0 or more occurrences of the pattern to its left
    ? -- ## match 0 or 1 occurrences of the pattern to its lef

With the regular expression syntax you've learned so far, you can describe a two-digit number as /dd/ and a four-digit number as /dddd/. But you don't have any way to describe, for example, a number that can have any number of digits or a string of three letters followed by an optional digit. These more complex patterns use regular-expression syntax that specifies how many times an element of a regular expression may be repeated.

Character Classes

It is simple to find alphanumerics, digits, and whitespace, but what if we want to find anything from some other set of characters? This is easily done by listing the desired characters within square brackets. Thus, "[aeiou]" matches any vowel and "[.?!]" matches the punctuation at the end of a sentence. In this example, notice that the "." And "?" lose their special meanings within square brackets and are interpreted literally. We can also specify a range of characters, so "[a-z0-9]" means, "match any lowercase letter of the alphabet, or any digit".

Let's try a more complicated expression that searches for telephone numbers.

14. (?d{3}[) ]s?d{3}[- ]d{4} A ten digit phone number

This expression will find phone numbers in several formats, like "(800) 325-3535" or "650 555 1212". The "(?" searches for zero or one left parentheses, "[) ]" searches for a right parenthesis or a space. The "s?" searches for zero or one whitespace characters. Unfortunately, it will also find cases like "650) 555-1212" in which the parenthesis is not balanced. Below, you'll see how to use alternatives to eliminate this problem.

Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([w.-] )@([w.-] )'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

str = 'purple alice-b@google.com monkey dishwasher'
  match = re.search('([w.-] )@([w.-] )', str)
  if match:
    print match.group()   ## 'alice-b@google.com' (the whole match)
    print match.group(1)  ## 'alice-b' (the username, group 1)
    print match.group(2)  ## 'google.com' (the host, group 2)

A common workflow(职业流程卡塔尔(英语:State of Qatar) with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

Note: match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis

Ex.

The characters that specify repetition always follow the pattern to which they are being applied. Because certain types of repetition are quite commonly used, there are special characters to represent these cases. For example, matches one or more occurrences of the previous pattern. Table 11-3 summarizes the repetition syntax.

Negation

Sometimes we need to search for a character that is NOT a member of an easily defined class of characters. The following table shows how this can be specified.

 

W

Match any character that is NOT alphanumeric

S

Match any character that is NOT whitespace

D

Match any character that is NOT a digit

B

Match a position that is NOT the beginning or end of a word

[^x]

Match any character that is NOT x

[^aeiou]

Match any character that is NOT one of the characters aeiou

 

Table 3. How to specify what you don't want

15. S *All strings that do not contain whitespace characters*

Later, we'll see how to use "lookahead" and "lookbehind" to search for the absence of more complex patterns.

findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings(list), with each string representing one match.

  ## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[w.-] @[w.-] ', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print email
     ## i  = one or more i's, as many as possible.
     match = re.search(r'pi ', 'piiig') =>  found, match.group() == "piii"

     ## Finds the first/leftmost solution, and within it drives the  
     ## as far as possible (aka 'leftmost and largest').
     ## In this example, note that it does not get to the second set of i's.
     match = re.search(r'i ', 'piigiiii') =>  found, match.group() == "ii"

     ## s* = zero or more whitespace chars
     ## Here look for 3 digits, possibly separated by whitespace.
     match = re.search(r'ds*ds*d', 'xx1 2   3xx') =>  found, match.group() == "1 2   3"
     match = re.search(r'ds*ds*d', 'xx12  3xx') =>  found, match.group() == "12  3"
     match = re.search(r'ds*ds*d', 'xx123xx') =>  found, match.group() == "123"

     ## ^ = matches the start of string, so this fails:
     match = re.search(r'^bw ', 'foobar') =>  not found, match == None
     ## but without the ^ it succeeds:
     match = re.search(r'bw ', 'foobar') =>  found, match.group() == "bar"
Table 11-3. Regular expression repetition characters

Character

Meaning

{``n``,``m``}

该项起码现身n次,然则非常少于m次。

{``n``,}

该项起码现身n次。

{``n``}

该项现身n次。(无法多,也无法少)

?

该项现身0次或许贰回。便是说,该项是可选的,约等于{0,1}。

该项现身1次照旧更频仍,约等于 {1,}.

*

该项现身0次恐怕更频仍。 约等于 {0,}.

上面是一些事例:

/d{2,4}/     // 匹配2到4个数字
/w{3}d?/    // 匹配3个字符和1个可选的数字,即该数字可以有也可以没有
/s javas /  // Match "java" with one or more spaces before and after
/[^"]*/       // Match zero or more non-quote characters

Be careful when using the * and ? repetition characters. Since these characters may match zero instances of whatever precedes them, they are allowed to match nothing. For example, the regular expression /a*/ actually matches the string "bbbb" because the string contains zero occurrences of the letter a!

    分选、分组和援用    

The regular-expression grammar includes special characters for specifying alternatives, grouping subexpressions, and referring to previous subexpressions. The | character separates alternatives. For example, /ab|cd|ef/ matches the string "ab" or the string "cd" or the string "ef". And /d{3}|[a-z]{4}/ matches either three digits or four lowercase letters.

Note that alternatives are considered left to right until a match is found. If the left alternative matches, the right alternative is ignored, even if it would have produced a "better" match. Thus, when the pattern /a|ab/ is applied to the string "ab", it matches only the first letter.

Parentheses have several purposes in regular expressions. One purpose is to group separate items into a single subexpression so that the items can be treated as a single unit by |, *, , ?, and so on. For example, /java(script)?/ matches "java" followed by the optional "script". And /(ab|cd) |ef)/ matches either the string "ef" or one or more repetitions of either of the strings "ab" or "cd".

Another purpose of parentheses in regular expressions is to define subpatterns within the complete pattern. When a regular expression is successfully matched against a target string, it is possible to extract the portions of the target string that matched any particular parenthesized subpattern. (You'll see how these matching substrings are obtained later in the chapter.) For example, suppose you are looking for one or more lowercase letters followed by one or more digits. You might use the pattern /[a-z] d /. But suppose you only really care about the digits at the end of each match. If you put that part of the pattern in parentheses (/[a-z] (d )/), you can extract the digits from any matches you find, as explained later.

A related use of parenthesized subexpressions is to allow you to refer back to a subexpression later in the same regular expression. This is done by following a character by a digit or digits. The digits refer to the position of the parenthesized subexpression within the regular expression. For example, 1 refers back to the first subexpression, and 3 refers to the third. Note that, because subexpressions can be nested within others, it is the position of the left parenthesis that is counted. In the following regular expression, for example, the nested subexpression ([Ss]cript) is referred to as 2:

/([Jj]ava([Ss]cript)?)siss(funw*)/

A reference to a previous subexpression of a regular expression does not refer to the pattern for that subexpression but rather to the text that matched the pattern. Thus, references can be used to enforce a constraint that separate portions of a string contain exactly the same characters. For example, the following regular expression matches zero or more characters within single or double quotes. However, it does not require the opening and closing quotes to match (i.e., both single quotes or both double quotes):

/['"][^'"]*['"]/

To require the quotes to match, use a reference:

/(['"])[^'"]*1/

The 1 matches whatever the first parenthesized subexpression matched. In this example, it enforces the constraint that the closing quote match the opening quote. This regular expression does not allow single quotes within double-quoted strings or vice versa. It is not legal to use a reference within a character class, so you cannot write:

/(['"])[^1]*1/

Later in this chapter, you'll see that this kind of reference to a parenthesized subexpression is a powerful feature of regular-expression search-and-replace operations.

In JavaScript 1.5 (but not JavaScript 1.2), it is possible to group items in a regular expression without creating a numbered reference to those items. Instead of simply grouping the items within ( and ), begin the group with (?: and end it with ). Consider the following pattern, for example:

/([Jj]ava(?:[Ss]cript)?)siss(funw*)/

Here, the subexpression (?:[Ss]cript) is used simply for grouping, so the ? repetition character can be applied to the group. These modified parentheses do not produce a reference, so in this regular expression, 2 refers to the text matched by (funw*).

Table 11-4 summarizes the regular-expression alternation, grouping, and referencing operators.

Alternatives

To select between several alternatives, allowing a match if either one is satisfied, use the pipe "|" symbol to separate the alternatives. For example, Zip Codes come in two flavors, one with 5 digits, the other with 9 digits and a hyphen. We can find either with this expression:

16. bd{5}-d{4}b|bd{5}b Five and nine digit Zip Codes

When using alternatives, the order is important since the matching algorithm will attempt to match the leftmost alternative first. If the order is reversed in this example, the expression will only find the 5 digit Zip Codes and fail to find the 9 digit ones. We can use alternatives to improve the expression for ten digit phone numbers, allowing the area code to appear either delimited by whitespace or parenthesis:

17. ((d{3})|d{3})s?d{3}[- ]d{4} Ten digit phone numbers, a better way

findall With Files

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

 # Open file
  f = open('test.txt', 'r')
  # Feed the file text into findall(); it returns a list of all the found strings
  strings = re.findall(r'some pattern', f.read())

E-mail的例子:

Table 11-4. Regular expression alternation, grouping, and reference characters

Character

Meaning

|

Alternation. Match either the subexpression to the left or the subexpression to the right.

(...)

Grouping. Group items into a single unit that can be used with *, , ?, |, and so on. Also remember the characters that match this group for use with later references.

(?:...)

Grouping only. Group items into a single unit, but do not remember the characters that match this group.

``n

Match the same characters that were matched when group number n was first matched. Groups are subexpressions within (possibly nested) parentheses. Group numbers are assigned by counting left parentheses from left to right. Groups formed with (?: are not numbered.

    分明相称岗位

    明确相称的开头与甘休地方,对于标准相配也相当的重大。

   

As described earlier, many elements of a regular expression match a single character in a string. For example, s matches a single character of whitespace. Other regular expression elements match the positions between characters, instead of actual characters. b, for example, matches a word boundarythe boundary between a w (ASCII word character) and a W (nonword character), or the boundary between an ASCII word character and the beginning or end of a string.[*] Elements such as b do not specify any characters to be used in a matched string; what they do specify, however, is legal positions at which a match can occur. Sometimes these elements are called regular-expression anchors because they anchor the pattern to a specific position in the search string. The most commonly used anchor elements are ^, which ties the pattern to the beginning of the string, and $, which anchors the pattern to the end of the string.

[*] Except within a character class (square brackets), where b matches the backspace character.

For example, to match the word "JavaScript" on a line by itself, you can use the regular expression /^JavaScript$/. If you want to search for "Java" used as a word by itself (not as a prefix, as it is in "JavaScript"), you can try the pattern /sJavas/, which requires a space before and after the word. But there are two problems with this solution. First, it does not match "Java" if that word appears at the beginning or the end of a string, but only if it appears with space on either side. Second, when this pattern does find a match, the matched string it returns has leading and trailing spaces, which is not quite what's needed. So instead of matching actual space characters with s, match (or anchor to) word boundaries with b. The resulting expression is /bJavab/. The element B anchors the match to a location that is not a word boundary. Thus, the pattern /B[Ss]cript/ matches "JavaScript" and "postscript", but not "script" or "Scripting".

In JavaScript 1.5 (but not JavaScript 1.2), you can also use arbitrary regular expressions as anchor conditions. If you include an expression within (?= and ) characters, it is a lookahead assertion, and it specifies that the enclosed characters must match, without actually matching them. For example, to match the name of a common programming language, but only if it is followed by a colon, you could use /[Jj]ava([Ss]cript)?(?=:)/. This pattern matches the word "JavaScript" in "JavaScript: The Definitive Guide", but it does not match "Java" in "Java in a NutShell" because it is not followed by a colon.

If you instead introduce an assertion with (?!, it is a negative lookahead assertion, which specifies that the following characters must not match. For example, /Java(?!Script)([A-Z]w*)/ matches "Java" followed by a capital letter and any number of additional ASCII word characters, as long as "Java" is not followed by "Script". It matches "JavaBeans" but not "Javanese", and it matches "JavaScrip" but not "JavaScript" or "JavaScripter".

Table 11-5 summarizes regular-expression anchors.

Grouping

Parentheses may be used to delimit a subexpression to allow repetition or other special treatment. For example:

18. (d{1,3}.){3}d{1,3} A simple IP address finder

The first part of the expression searches for a one to three digit number followed by a literal period ".". This is enclosed in parentheses and repeated three times using the "{3}" quantifier, followed by the same expression without the trailing period.

Unfortunately, this example allows IP addresses with arbitrary one, two, or three digit numbers separated by periods even though a valid IP address cannot have numbers larger than 255. It would be nice to arithmetically compare a captured number N to enforce N<256, but this is not possible with regular expressions alone. The next example tests various alternatives based on the starting digits to guarantee the limited range of numbers by pattern matching. This shows that an expression can become cumbersome even when looking for a pattern that is simple to describe.

19. ((2[0-4]d|25[0-5]|[01]?dd?).){3}(2[0-4]d|25[0-5]|[01]?dd?) IP finder

findall and Groups

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').

  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  tuples = re.findall(r'([w.-] )@([w.-] )', str)
  print tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]
  for tuple in tuples:
    print tuple[0]  ## username
    print tuple[1]  ## host

Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group.

Obscure optional feature:
Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.

  • Reference:
  1. Python Regular Expressions, Python course, Google for Education.
  2. 正则表明式30分钟入门教程

Thanks!

    str = 'purple [email protected] monkey dishwasher'
    match = re.search(r'[email protected]w ', str)
    if match:
   print match.group()  ## '[email protected]'
Table 11-5. Regular-expression anchor characters

Character

Meaning

^

Match the beginning of the string and, in multiline searches, the beginning of a line.

$

Match the end of the string and, in multiline searches, the end of a line.

b

Match a word boundary. That is, match the position between a w character and a W character or between a w character and the beginning or end of a string. (Note, however, that [b] matches backspace.)

B

Match a position that is not a word boundary.

(?=``p``)

A positive lookahead assertion. Require that the following characters match the pattern p, but do not include those characters in the match.

(?!``p``)

A negative lookahead assertion. Require that the following characters do not match the pattern p.

    **标志

**    正则表明式最后三个语法难点便是评释。有二种标记,如下表:

Character

Meaning

i

Perform case-insensitive matching.

g

Perform a global matchthat is, find all matches rather than stopping after the first match.

m

Multiline mode. ^ matches beginning of line or beginning of string, and $ matches end of line or end of string.

方式匹配的字符串方法  


    JavaScript中,为字符串提供了4个利用正则表达式的章程。
    String.search();
    String.replace();
    String.match();
    String.split();
    search(卡塔尔(英语:State of Qatar)的参数是一个正则表明式。要是在参数地方传递的不是正则表明式,会先将该参数字传送递给正则表明式的组织函数RegExp(卡塔尔(قطر‎,将其调换到正则表明式。

        "JavaScript".search(/script/i);

    search(卡塔尔(英语:State of Qatar)忽视g标识。不会进行全局查找,它的重返值是匹配字符的开第一人置。若无找到相配值,则赶回-1.上例中,再次回到4。

    replace(卡塔尔国实行“查找-替换”操作。第2个参数是正则表达式,第二个是替换字符串。
    replace(卡塔尔特别管用。能够应用下例的主意,将字符串两边的双引号,替换来五个单引号。

        var quote = /"([^"]*)"/g;
        text.replace(quote, "''$1''");

    match(卡塔尔国是最常用的方法。
       "1 plus 2 equals 3".match(/d /g) // returns ["1", "2", "3"]
    就算正则表明式不分包g标识,match不进行全局查找。仅仅查找到第三个卓绝的字符串截至,并赶回三个数组array。数组的首先个因素array[0]仓库储存匹配的字符串。下贰个因素array[1]积累相配第一个括号内表明式(parenthesized expression卡塔尔的字符串。今后的要素就这样推算。
    To draw a paralled with replace(), array[n]存款和储蓄的$n中的内容。
    例如:

            var url = /(w )://([w.] )/(S*)/;
            var text = "Visit my blog at ";
            var result = text.match(url);
            if (result != null)
            {
                var fullurl = result[0]; // Contains ""
                var protocol = result[1]; // Contains "http"
                var host = result[2]; // Contains "www.example.com"
                var path = result[3]; // Contains "~david"
            }

    若是正则表明式包罗g标识,match进行全局查找,重临的数组中,每一种成分积攒一个与正则表明式相相配的字符串。

    split(卡塔尔的参数日常是三个标识,用以分隔字符串。比如:

       "123,456,789".split(","); // Returns ["123","456","789"]
   
    也足以是贰个正则表明式。这一个技巧是该措施丰盛有效。举例,你能够运用正则表明式,去掉分隔字符两边的空格:

       "1, 2, 3, 4, 5".split(/s*,s*/); // Returns ["1","2","3","4","5"]

形式相称的RegExp对象方法


    RegExp对象也得以透过RegExp(卡塔尔布局函数生成。布局函数,是动态生成RegExp对象的好法子。它归纳三个要么五个字符串参数。第4个参数是正则表达式的剧情,第三个参数是注脚,比如:g, i, m等。
   
            // Find all five-digit numbers in a string. Note the double \ in this case.
            var zipcode = new RegExp("\d{5}", "g");

    RegExp
指标有二种格局验证字符串与正则表达式格局是还是不是协作。第多个形式就是exec( 卡塔尔国方法,雷同于match方法。不一样于match的是,exec方法无论是不是有 g 标记,它都只回去雷同的数组array。array的首先个因素array[0]储存完全相称的字符串,随后的元素叁遍累积与子字符类想相配的子字符串。当格局有 g 标识的时候,exec方法执行叁次之后,会活动将RegExp对象的叁个特种性质lastIndex置为本次相称的字符串的末段贰个字母的后一个地点。
    当通贰个正则表明式再次施行的时候,会在lastIndex地方伊始查找,并非0 地点开端查找。假设exec未有找到相配的字符串,它将自行将lastIndex置为 0。那些优秀的主意,可以很便利的大循环遍历整个字符串,以找到具备相配的子字符串。
    当然,你也得以在找到最终四个相配子字符串早前的专断时刻将lastIndex置为 0,然后用该RegExp对象实践此外的字符串。

    var pattern = /Java/g;
    var text = "JavaScript is more fun than Java!";
    var result;
    while((result = pattern.exec(text)) != null)
    {
        alert("Matched '" result[0] "'" " at position " result.index "; next search begins at " pattern.lastIndex);
    }

    RegExp
目的的别的三个实践相称的艺术是test( 卡塔尔(قطر‎,它要比exec( 卡塔尔(英语:State of Qatar)简单的多。它唯有二个字符串作为唯生机勃勃的参数,重回true也许在还没找到相称字符串是回来null。当RegExp有 g 标记时,test与exec对lastIndex实施同豆蔻梢头的操作  

**
例子:

**    将textbox传递给艺术checkDate,作为Object的值。查验textbox中输入的月度是不是为mm/dd/yyyy那样的格式:

 1 function checkDate(Object)
 2 {    
 3     var strValue=Object.value;
 4     var pattern = /(
 5         //大月
 6         (^(10|12|0?[13578])([/])(3[01]|[12][0-9]|0?[1-9])([/])((1[8-9]d{2})|([2-9]d{3}))$)|
 7         //小月
 8         (^(11|0?[469])([/])(30|[12][0-9]|0?[1-9])([/])((1[8-9]d{2})|([2-9]d{3}))$)|
 9         //2月
10         (^(0?2)([/])(2[0-8]|1[0-9]|0?[1-9])([/])((1[8-9]d{2})|([2-9]d{3}))$)|
11         (^(0?2)([/])(29)([/])([2468][048]00)$)|
12         (^(0?2)([/])(29)([/])([3579][26]00)$)|
13         (^(0?2)([/])(29)([/])([1][89][0][48])$)|
14         (^(0?2)([/])(29)([/])([2-9][0-9][0][48])$)|
15         (^(0?2)([/])(29)([/])([1][89][2468][048])$)|
16         (^(0?2)([/])(29)([/])([2-9][0-9][2468][048])$)|
17         (^(0?2)([/])(29)([/])([1][89][13579][26])$)|
18         (^(0?2)([/])(29)([/])([2-9][0-9][13579][26])$))/;
19     var message = "";
20 
21     if(strValue.match(pattern)==null)
22     {
23         return false;
24     }
25     else
26     {
27         return true;
28     }
29 }

**

   ** 查验textbox输入是还是不是相符email格式:

 1 function checkEmail(Object)
 2 {
 3     var strValue=Object.value;
 4 
 5     var pattern = /w ([- .]w )*@w ([-.]w )*.w ([-.]w )*/;
 6 
 7     if(strValue.match(pattern)==null)
 8     {
 9         return false;
10     }
11     else
12     {
13         return true;
14     }
15 }

**

**


Expresso Analyzer View

澳门新萄京官方网站 4

Figure 2. Expresso's analyzer view showing example 17

Expresso has a feature that diagrams expressions in a Tree structure, explaining what each piece means. When debugging an expression, this can help zoom in on the part that is causing trouble. Try this by selecting example (17) and then using the Analyze button. Select nodes in the tree and expand them to explore the structure of this regular expression as shown in the figure. After highlighting a node, you can also use the Partial Match or Exclude Match buttons to run a match using just the highlighted portion of the regular expression or using the regular expression with the highlighted portion excluded.

When subexpressions are grouped with parentheses, the text that matches the subexpression is available for further processing in a computer program or within the regular expression itself. By default, groups are numbered sequentially as encountered in reading from left to right, starting with 1. This automatic numbering can be seen in Expresso's skeleton view or in the results shown after a successful match.

A "backreference" is used to search for a recurrence of previously matched text that has been captured by a group. For example, "1" means, "match the text that was captured by group 1". Here is an example:

20. b(w )bs*1b Find repeated words

This works by capturing a string of at least one alphanumeric character within group 1 "(w )", but only if it begins and ends a word. It then looks for any amount of whitespace "s*" followed by a repetition of the captured text "1" ending at the end of a word.

It is possible to override the automatic numbering of groups by specifying an explicit name or number. In the above example, instead of writing the group as "(w )", we can write it as "(?<Word>w )" to name this capture group "Word". A backreference to this group is written "k<Word>". Try this example:

21. b(?<Word>w )bs*k<Word>b Capture repeated word in a named group

Test this in Expresso and expand the match results to see the contents of the named group.

Using parentheses, there are many special purpose syntax elements available. Some of the most common are summarized in this table:

 

Captures

(*exp*)

Match exp and capture it in an automatically numbered group

(?<*name>exp*)

Match exp and capture it in a group named name

(?:*exp*)

Match exp, but do not capture it

Lookarounds

(?=*exp*)

Match any position preceding a suffix exp

(?<=*exp*)

Match any position following a prefix exp

(?!*exp*)

Match any position after which the suffix exp is not found

(?<!*exp*)

Match any position before which the prefix exp is not found

Comment

(?#*comment*)

Comment

 

Table 4. Commonly used Group Constructs

We've already talked about the first two. The third "(?:exp)" does not alter the matching behavior, it just doesn't capture it in a named or numbered group like the first two.

这里极其的并不佳,上面包车型客车例证将对其张开改正!

Positive Lookaround

The next four are so-called lookahead or lookbehind assertions. They look for things that go before or after the current match without including them in the match. It is important to understand that these expressions match a position like "^" or "b" and never match any text. For this reason, they are known as "zero-width assertions". They are best illustrated by example:

"(?=exp)" is the "zero-width positive lookahead assertion". It matches a position in the text that precedes a given suffix, but doesn't include the suffix in the match:

22. bw (?=ingb) The beginning of words ending with "ing"

"(?<=exp)" is the "zero-width positive lookbehind assertion". It matches the position following a prefix, but doesn't include the prefix in the match:

23. (?<=bre)w b The end of words starting with "re"

Here is an example that could be used repeatedly to insert commas into numbers in groups of three digits:

24. (?<=d)d{3}b Three digits at the end of a word, preceded by a digit

Here is an example that looks for both a prefix and a suffix:

25. (?<=s)w (?=s) Alphanumeric strings bounded by whitespace

方括号:[]
方括号用于表示风流罗曼蒂克组字符中的八个,[abc]指的便是a或b或c,方括号中的 '.' 表示的正是五个'.'的情致。由此这里 [w.-]指的正是多少个假名/数字,只怕二个'.',也许一个'-'。

Negative Lookaround

Earlier, I showed how to search for a character that is not a specific character or the member of a character class. What if we simply want to verify that a character is not present, but don't want to match anything? For example, what if we are searching for words in which the letter "q" is not followed by the letter "u"? We could try:

26. bw*q[^u]w*b Words with "q" followed by NOT "u"

Run the example and you will see that it fails when "q" is the last letter of a word, as in "Iraq". This is because "[^q]" always matches a character. If "q" is the last character of the word, it will match the whitespace character that follows, so in the example the expression ends up matching two whole words. Negative lookaround solves this problem because it matches a position and does not consume any text. As with positive lookaround, it can also be used to match the position of an arbitrarily complex subexpression, rather than just a single character. We can now do a better job:

27. bw*q(?!u)w*b Search for words with "q" not followed by "u"

We used the "zero-width negative lookahead assertion", "(?!exp)", which succeeds only if the suffix "exp" is not present. Here is another example:

28. d{3}(?!d) Three digits not followed by another digit

Similarly, we can use "(?<!exp)", the "zero-width negative lookbehind assertion", to search for a position in the text at which the prefix "exp" is not present:

29. (?<![a-z ])w{7} Strings of 7 alphanumerics not preceded by a letter or space

Here is one more example using lookaround:

30. (?<=<(w )>).*(?=</1>) Text between HTML tags

This searches for an HTML tag using lookbehind and the corresponding closing tag using lookahead, thus capturing the intervening text but excluding both tags.

## Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes w, s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[w.-] @[w.-] ' to get the whole email address:

Comments please

Another use of parentheses is to include comments using the "(?#comment)" syntax. A better method is to set the "Ignore Pattern Whitespace" option, which allows whitespace to be inserted in the expression and then ignored when the expression is used. With this option set, anything following a number sign "#" at the end of each line of text is ignored. For example, we can format the preceding example like this:

31. Text between HTML tags, with comments

 

澳门新萄京官方网站 5 Collapse | Copy Code

(?<=    # Search for a prefix, but exclude it<br />  <(w )> # Match a tag of alphanumerics within angle brackets<br />)       # End the prefix

.* # Match any text

 

澳门新萄京官方网站 6 Collapse | Copy Code

(?=     # Search for a suffix, but exclude it<br />  </1>  # Match the previously captured tag preceded by "/"<br />)       # End the suffix
     match = re.search(r'[w.-][email protected][w.-] ', str)
     if match:
    print match.group()  ## '[email protected]'

Greedy and Lazy

When a regular expression has a quantifier that can accept a range of repetitions (like ".*"), the normal behavior is to match as many characters as possible. Consider the following regular expression:

  1. a.*b The longest string starting with a and ending with b

If this is used to search the string "aabab", it will match the entire string "aabab". This is called "greedy" matching. Sometimes, we prefer "lazy" matching in which a match using the minimum number of repetitions is found. All the quantifiers in Table 2 can be turned into "lazy" quantifiers by adding a question mark "?". Thus "*?" means "match any number of repetitions, but use the smallest number of repetitions that still leads to a successful match". Now let's try the lazy version of example (32):

  1. a.*?b The shortest string starting with a and ending with b

If we apply this to the same string "aabab" it will first match "aab" and then "ab".

 

*?

Repeat any number of times, but as few as possible

?

Repeat one or more times, but as few as possible

??

Repeat zero or one time, but as few as possible

{n,m}?

Repeat at least n, but no more than m times, but as few as possible

{n,}?

Repeat at least n times, but as few as possible

 

Table 5. Lazy quantifiers

方括号中还能用范围来表示,如[a-z]指的正是二个小写字母。
但[abc-]表示的就不是限量
此外,[^ab],除了a或b之外的字符

What did we leave out?

I've described a rich set of elements with which to begin building regular expressions; but I left out a few things that are summarized in the following table. Many of these are illustrated with additional examples in the project file. The example number is shown in the left-hand column of this table.

 

#

Syntax

Description

 

a

Bell character

 

b

Normally a word boundary, but within a character class it means backspace

 

t

Tab

34

r

Carriage return

 

v

Vertical tab

 

f

Form feed

35

n

New line

 

e

Escape

36

nnn

Character whose ASCII octal code is nnn

37

xnn

Character whose hexadecimal code is nn

38

unnnn

Character whose Unicode is nnnn

39

cN

Control N character, for example carriage return (Ctrl-M) is cM

40

A

Beginning of a string (like ^ but does not depend on the multiline option)

41

Z

End of string or before n at end of string (ignores multiline)

 

z

End of string (ignores multiline)

42

G

Beginning of the current search

43

p{name}

Any character from the Unicode class named name, for example p{IsGreek}

 

(?>exp)

Greedy subexpression, also known as a non-backtracking subexpression. This is matched only once and then does not participate in backtracking.

44

(?<x>-<y>exp) or (?-<y>exp)

Balancing group. This is complicated but powerful. It allows named capture groups to be manipulated on a push down/pop up stack and can be used, for example, to search for matching parentheses, which is otherwise not possible with regular expressions. See the example in the project file.

45

(?im-nsx:exp)

Change the regular expression options for the subexpression exp

46

(?im-nsx)

Change the regular expression options for the rest of the enclosing group

 

(?(exp)yes|no)

The subexpression exp is treated as a zero-width positive lookahead. If it matches at this point, the subexpression yes becomes the next match, otherwise no is used.

 

(?(exp)yes)

Same as above but with an empty no expression

 

(?(name)yes|no)

This is the same syntax as the preceding case. If name is a valid group name, the yes expression is matched if the named group had a successful match, otherwise the no expression is matched.

47

(?(name)yes)

Same as above but with an empty no expression

 

Table 6. Everything we left out. The left-hand column shows the number of an example in the project file that illustrates this construct.

组的抽出:
组的定义允许你从相配的文本中抽出你想要的部分。比如在e-mail的例子中,大家想要抽取出客商名和主机名八个单身的片段。这里就回用到 (卡塔尔(英语:State of Qatar) 。
like this: r'([w.-] )@([w.-] )'

Conclusion

We've given many examples to illustrate the essential features of .NET regular expressions, emphasizing the use of a tool like Expresso to test, experiment, and learn by example. If you get hooked, there are many online resources available to help you go further. You can start your search at the Ultrapico web site. If you want to read a book, I suggest the latest edition of Mastering Regular Expressions, by Jeffrey Friedl.

There are also a number of nice articles on The Code Project including the following tutorials:

  • An Introduction to Regular Expressions by Uwe Keim
  • Microsoft Visual C# .NET Developer's Cookbook: Chapter on Strings and Regular Expressions

小括号并不会改动相配的格局,小括号的作用的为同盟的公文建设构造组。
若相称成功的话,match.group(1卡塔尔再次来到的是左边手起首首先个小括号内的协作文本
match.group(2卡塔尔再次来到的是左臂开头第三个小括号内的协作文本
match.group(卡塔尔(قطر‎ 仍是相称的全部内容

License


     str = 'purple [email protected] monkey dishwasher'
     match = re.search('([w.-] )@([w.-] )', str)
     if match:
    print match.group()   ## '[email protected]' (the whole match)
    print match.group(1)  ## 'alice-b' (the username, group 1)
    print match.group(2)  ## 'google.com' (the host, group 2)

   ##Tips: A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

findall:
看名就能够猜到其意义就是找到全部的相配项的意趣。下面的 re.search(卡塔尔(英语:State of Qatar)找到的只是首先个相称项。
findall(卡塔尔(قطر‎ 用于相称全体的相称项,并将其放回到贰个字符串列表中。

## findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

     ## Suppose we have a text with many email addresses
     str = 'purple [email protected], blah monkey [email protected] blah dishwasher'

     ## Here re.findall() returns a list of all the found email strings
     emails = re.findall(r'[w.-][email protected][w.-] ', str) ## ['[email protected]', '[email protected]']
     for email in emails:
    # do something with each found email string
    print email

文件与findall:
## For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

     # Open file
     f = open('test.txt', 'r')
     # Feed the file text into findall(); it returns a list of all the found strings
     strings = re.findall(r'some pattern', f.read())

经过这种措施能够一向回到八个文书中的全数的相称项。

组 和 findall:
若是情势中有2个及以上的组, findall(卡塔尔 将回来一个元组列表,各类元组匹配一个情势。

     str = 'purple [email protected], blah monkey [email protected] blah dishwasher'
     tuples = re.findall(r'([w.-] )@([w.-] )', str)
     print tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]
     for tuple in tuples:
    print tuple[0]  ## username
    print tuple[1]  ## host

选项:

正则说明式的函数提供选项用来更正情势相称的一言一行。
##The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc.,
e.g.
re.search(pat, str, re.IGNORECASE).

IGNORECASE ##-- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
DOTALL ##-- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .* matches everything, but by default it does not go past the end of a line. Note that s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use s*
MULTILINE ##-- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.

替换:

## The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '1', '2' which refer to the text from group(1), group(2), and so on from the original matching text.

     str = 'purple [email protected], blah monkey [email protected] blah dishwasher'
     ## re.sub(pat, replacement, str) -- returns new string with all replacements,
     ## 1 is group(1), 2 group(2) in the replacement
     print re.sub(r'([w.-] )@([w.-] )', r'[email protected]', str)
     ## purple [email protected], blah monkey [email protected] blah dishwasher

选修学习:
see here:

Python Regular Expressions: ##Regular expressions are a powerful language for matching text patterns. ##The Python "re" module provides regular expression support. ##In Python a...

本文由澳门新萄京官方网站发布于www.8455.com,转载请注明出处:澳门新萄京官方网站:正则表明式指南,正则表

关键词: