Regular expressions are a powerful language for matching text
patterns. This page gives a basic introduction to regular expressions
themselves sufficient for our Python exercises and shows how regular
expressions work in Python. The Python "re" module provides regular
expression support.
In Python a regular expression search is typically written as:
15
Python Regular Expressions:
##Regular expressions are a powerful language for matching text
patterns.
##The Python "re" module provides regular expression support.
[python] view plaincopyprint?
import re
r1 = re.compile(r'(?im)(?P<name></html>)$')
content = """
<HTML>
boxsuch as 'box' and 'boxes', but not 'inbox'. In other words
box
<html>dsafdsafdas </html> </ahtml>
</html>
</HTML>
"""
reobj = re.compile("(?im)(?P<name></.*?html>)$")
for match in reobj.finditer(content):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
print "start>>", match.start()
print "end>>", match.end()
print "span>>", match.span()
print "match.group()>>", match.group()
print "*"*20
if r1.match(content): print 'match succeeds'
else: print 'match fails' # prints: match
fails
if r1.search(content): print 'search succeeds' # prints: search
succeeds
else: print 'search fails'
print r1.flags
print r1.groupindex
print r1.pattern
l = r1.split(content)
print "l>>", l
for item in r1.findall(content):
print "item>>", item
s = r1.sub("aa", content)
print "s>>", s
s_subn, s_sub_count = r1.subn("aaaaaaaaaaaa", content)
print "s_subn>>", s_subn
print "s_sub_count>>", s_sub_count
(文章内容主要摘自《JavaScript-The Definitive Guide》5th edition)
利用JavaScript提供的诀要,在客商端通过正则表明式(regular
expression卡塔尔(英语:State of Qatar)的措施,验证页面输入的合法性是很常用且很迅猛的做法。想要与给定的正则表明式的形式绝比较,不仅能够通过字符串提供的有些办法,也足以透过正则表达式对象(RegExp卡塔尔(英语:State of Qatar)提供的诀要完成。
match = re.search(pat, str)
##In Python a regular expression search is typically written as:
match = re.search(pat, str)
[ Team LiB ]
9.7 Regular Expressions and the re Module
A regular expression is a string that represents a pattern. With regular
expression functionality, you can compare that pattern to another string
and see if any part of the string matches the pattern.
The re module supplies all of Python's regular expression functionality.
The compile function builds a regular expression object from a pattern
string and optional flags. The methods of a regular expression object
look for matches of the regular expression in a string and/or perform
substitutions. Module re also exposes functions equivalent to a regular
expression's methods, but with the regular expression's pattern string
as their first argument.
Regular expressions can be difficult to master, and this book does not
purport to teach them桰 cover only the ways in which you can use them in
Python. For general coverage of regular expressions, I recommend the
book Mastering Regular Expressions, by Jeffrey Friedl (O'Reilly).
Friedl's book offers thorough coverage of regular expressions at both
the tutorial and advanced levels.
9.7.1 Pattern-String Syntax
The pattern string representing a regular expression follows a specific
syntax:
Alphabetic and numeric characters stand for themselves. A regular
expression whose pattern is a string of letters and digits matches the
same string.
Many alphanumeric characters acquire special meaning in a pattern when
they are preceded by a backslash ().
Punctuation works the other way around. A punctuation character is
self-matching when escaped, and has a special meaning when unescaped.
The backslash character itself is matched by a repeated backslash (i.e.,
the pattern \).
Since regular expression patterns often contain backslashes, you
generally want to specify them using raw-string syntax (covered in
Chapter 4). Pattern elements (e.g., r't', which is equivalent to the
non-raw string literal '\t') do match the corresponding special
characters (e.g., the tab character 't'). Therefore, you can use
raw-string syntax even when you do need a literal match for some such
special character.
Table 9-2 lists the special elements in regular expression pattern
syntax. The exact meanings of some pattern elements change when you use
optional flags, together with the pattern string, to build the regular
expression object. The optional flags are covered later in this
chapter.
Table 9-2. Regular expression pattern syntax
正则表明式的定义与语法
Some Simple Examples
The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):
## search后不常会跟 if 剖断语句
## following example which searches for the pattern 'word:' followed
by a 3 letter word (details below):
Element
Meaning
.
Matches any character except n (if DOTALL, also matches n)
^
Matches start of string (if MULTILINE, also matches after n)
$
Matches end of string (if MULTILINE, also matches before n)
*
Matches zero or more cases of the previous regular expression; greedy
(match as many as possible)
Matches one or more cases of the previous regular expression; greedy
(match as many as possible)
?
Matches zero or one case of the previous regular expression; greedy
(match one if possible)
*? , ?, ??
Non-greedy versions of *, , and ? (match as few as possible)
{m,n}
Matches m to n cases of the previous regular expression (greedy)
{m,n}?
Matches m to n cases of the previous regular expression (non-greedy)
[...]
Matches any one of a set of characters contained within the brackets
|
Matches expression either preceding it or following it
(...)
Matches the regular expression within the parentheses and also indicates
a group
(?iLmsux)
Alternate way to set optional flags; no effect on match
(?:...)
Like (...), but does not indicate a group
(?P<id>...)
Like (...), but the group also gets the name id
(?P=id)
Matches whatever was previously matched by group named id
(?#...)
Content of parentheses is just a comment; no effect on match
(?=...)
Lookahead assertion; matches if regular expression ... matches what
comes next, but does not consume any part of the string
(?!...)
Negative lookahead assertion; matches if regular expression ... does not
match what comes next, and does not consume any part of the string
(?<=...)
Lookbehind assertion; matches if there is a match for regular expression
... ending at the current position (... must match a fixed length)
(?<!...)
Negative lookbehind assertion; matches if there is no match for regular
expression ... ending at the current position (... must match a fixed
length)
number
Matches whatever was previously matched by group numbered number (groups
are automatically numbered from 1 up to 99)
A
Matches an empty string, but only at the start of the whole string
b
Matches an empty string, but only at the start or end of a word (a
maximal sequence of alphanumeric characters; see also w)
B
Matches an empty string, but not at the start or end of a word
d
Matches one digit, like the set [0-9]
D
Matches one non-digit, like the set [^0-9]
s
Matches a whitespace character, like the set [ tnrfv]
S
Matches a non-white character, like the set [^ tnrfv]
w
Matches one alphanumeric character; unless LOCALE or UNICODE is set, w
is like [a-zA-Z0-9_]
W
Matches one non-alphanumeric character, the reverse of w
Z
Matches an empty string, but only at the end of the whole string
\
Matches one backslash character
9.7.2 Common Regular Expression Idioms
'.*' as a substring of a regular expression's pattern string means "any
number of repetitions (zero or more) of any character." In other words,
'.*' matches any substring of a target string, including the empty
substring. '. ' is similar, but it matches only a non-empty substring.
For example:
'pre.*post'
matches a string containing a substring 'pre' followed by a later
substring 'post', even if the latter is adjacent to the former (e.g., it
matches both 'prepost' and 'pre23post'). On the other hand:
'pre. post'
matches only if 'pre' and 'post' are not adjacent (e.g., it matches
'pre23post' but does not match 'prepost'). Both patterns also match
strings that continue after the 'post'.
To constrain a pattern to match only strings that end with 'post', end
the pattern with Z. For example:
r'pre.*postZ'
matches 'prepost', but not 'preposterous'. Note that we need to express
the pattern with raw-string syntax (or escape the backslash by
doubling it into \), as it contains a backslash. Using raw-string
syntax for all regular expression pattern literals is good practice in
Python, as it's the simplest way to ensure you'll never fail to escape a
backslash.
Another frequently used element in regular expression patterns is b,
which matches a word boundary. If you want to match the word 'his' only
as a whole word and not its occurrences as a substring in such words as
'this' and 'history', the regular expression pattern is:
r'bhisb'
with word boundaries both before and after. To match the beginning of
any word starting with 'her', such as 'her' itself but also 'hermetic',
but not words that just contain 'her' elsewhere, such as 'ether', use:
r'bher'
with a word boundary before, but not after, the relevant string. To
match the end of any word ending with 'its', such as 'its' itself but
also 'fits', but not words that contain 'its' elsewhere, such as 'itsy',
use:
r'itsb'
with a word boundary after, but not before, the relevant string. To
match whole words thus constrained, rather than just their beginning or
end, add a pattern element w* to match zero or more word characters.
For example, to match any full word starting with 'her', use:
r'bherw*'
And to match any full word ending with 'its', use:
r'w*itsb'
9.7.3 Sets of Characters
You denote sets of characters in a pattern by listing the characters
within brackets ([ ]). In addition to listing single characters, you
can denote a range by giving the first and last characters of the range
separated by a hyphen (-). The last character of the range is included
in the set, which is different from other Python ranges. Within a set,
special characters stand for themselves, except , ], and -, which you
must escape (by preceding them with a backslash) when their position is
such that, unescaped, they would form part of the set's syntax. In a
set, you can also denote a class of characters by escaped-letter
notation, such as d or S. However, b in a set denotes a backspace
character, not a word boundary. If the first character in the set's
pattern, right after the [, is a caret (^), the set is complemented. In
other words, the set matches any character except those that follow ^ in
the set pattern notation.
A frequent use of character sets is to match a word, using a definition
of what characters can make up a word that differs from w's default
(letters and digits). To match a word of one or more characters, each of
which can be a letter, an apostrophe, or a hyphen, but not a digit
(e.g., 'Finnegan-O'Hara'), use:
r"[a-zA-z'-] "
It's not strictly necessary to escape the hyphen with a backslash in
this case, since its position makes it syntactically unambiguous.
However, the backslash makes the pattern somewhat more readable, by
visually distinguishing the hyphen that you want to have as a character
in the set from those used to denote ranges.
9.7.4 Alternatives
A vertical bar (|) in a regular expression pattern, used to specify
alternatives, has low precedence. Unless parentheses change the
grouping, | applies to the whole pattern on either side, up to the start
or end of the string, or to another |. A pattern can be made up of any
number of subpatterns joined by |. To match such a regular expression,
the first subpattern is tried first, and if it matches, the others are
skipped. If the first subpattern does not match, the second subpattern
is tried, and so on. | is neither greedy nor non-greedy, as it doesn't
take into consideration the length of the match.
If you have a list L of words, a regular expression pattern that matches
any of the words is:
'|'.join([r'b%sb' % word for word in L])
If the items of L can be more-general strings, not just words, you need
to escape each of them with function re.escape, covered later in this
chapter, and you probably don't want the b word boundary markers on
either side. In this case, use the regular expression pattern:
'|'.join(map(re.escape,L))
9.7.5 Groups
A regular expression can contain any number of groups, from none up to
99 (any number is allowed, but only the first 99 groups are fully
supported). Parentheses in a pattern string indicate a group. Element
(?P<id>...) also indicates a group, and in addition gives the
group a name, id, that can be any Python identifier. All groups, named
and unnamed, are numbered from left to right, 1 to 99, with group number
0 indicating the whole regular expression.
For any match of the regular expression with a string, each group
matches a substring (possibly an empty one). When the regular expression
uses |, some of the groups may not match any substring, although the
regular expression as a whole does match the string. When a group
doesn't match any substring, we say that the group does not participate
in the match. An empty string '' is used to represent the matching
substring for a group that does not participate in a match, except where
otherwise indicated later in this chapter.
For example:
r'(. )1 Z'
matches a string made up of two or more repetitions of any non-empty
substring. The (. ) part of the pattern matches any non-empty substring
(any character, one or more times), and defines a group thanks to the
parentheses. The 1 part of the pattern matches one or more
repetitions of the group, and the Z anchors the match to
end-of-string.
9.7.6 Optional Flags
A regular expression pattern element with one or more of the letters
"iLmsux" between (? and ) lets you set regular expression options within
the regular expression's pattern, rather than by the flags argument to
function compile of module re. Options apply to the whole regular
expression, no matter where the options element occurs in the pattern.
For clarity, options should always be at the start of the pattern.
Placement at the start is mandatory if x is among the options, since x
changes the way Python parses the pattern.
Using the explicit flags argument is more readable than placing an
options element within the pattern. The flags argument to function
compile is a coded integer, built by bitwise ORing (with Python's
bitwise OR operator, |) one or more of the following attributes of
module re. Each attribute has both a short name (one uppercase letter),
for convenience, and a long name (an uppercase multiletter identifier),
which is more readable and thus normally preferable:
I or IGNORECASE
Makes matching case-insensitive
L or LOCALE
Causes w, W, b, and B matches to depend on what the current
locale deems alphanumeric
M or MULTILINE
Makes the special characters ^ and $ match at the start and end of each
line (i.e., right after/before a newline), as well as at the start and
end of the whole string
S or DOTALL
Causes the special character . to match any character, including a
newline
U or UNICODE
Makes w, W, b, and B matches depend on what Unicode deems
alphanumeric
X or VERBOSE
Causes whitespace in the pattern to be ignored, except when escaped or
in a character set, and makes a # character in the pattern begin a
comment that lasts until the end of the line
For example, here are three ways to define equivalent regular
expressions with function compile, covered later in this chapter. Each
of these regular expressions matches the word "hello" in any mix of
upper- and lowercase letters:
import re
r1 = re.compile(r'(?i)hello')
r2 = re.compile(r'hello', re.I)
r3 = re.compile(r'hello', re.IGNORECASE)
The third approach is clearly the most readable, and thus the most
maintainable, even though it is slightly more verbose. Note that the
raw-string form is not necessary here, since the patterns do not include
backslashes. However, using raw strings is still innocuous, and is the
recommended style for clarity.
Option re.VERBOSE (or re.X) lets you make patterns more readable and
understandable by appropriate use of whitespace and comments.
Complicated and verbose regular expression patterns are generally best
represented by strings that take up more than one line, and therefore
you normally want to use the triple-quoted raw-string format for such
pattern strings. For example:
repat_num1 = r'(0[0-7]*|0x[da-fA-F] |[1-9]d*)L?Z'
repat_num2 = r'''(?x) # pattern matching integer numbers
(0 [0-7]* | # octal: leading 0, then 0 octal
digits
0x [da-f-A-F] | # hex: 0x, then 1 hex digits
[1-9] d* ) # decimal: leading non-0, then 0
digits
L?Z # optional trailing L, then end of
string
'''
The two patterns defined in this example are equivalent, but the second
one is made somewhat more readable by the comments and the free use of
whitespace to group portions of the pattern in logical ways.
9.7.7 Match Versus Search
So far, we've been using regular expressions to match strings. For
example, the regular expression with pattern r'box' matches strings such
as 'box' and 'boxes', but not 'inbox'. In other words, a regular
expression match can be considered as implicitly anchored at the start
of the target string, as if the regular expression's pattern started
with A.
Often, you're interested in locating possible matches for a regular
expression anywhere in the string, without any anchoring (e.g., find the
r'box' match inside such strings as 'inbox', as well as in 'box' and
'boxes'). In this case, the Python term for the operation is a search,
as opposed to a match. For such searches, you use the search method of a
regular expression object, while the match method only deals with
matching from the start. For example:
import re
r1 = re.compile(r'box')
if r1.match('inbox'): print 'match succeeds'
else print 'match fails' # prints: match
fails
if r1. search('inbox'): print 'search succeeds' # prints: search
succeeds
else print 'search fails'
9.7.8 Anchoring at String Start and End
The pattern elements ensuring that a regular expression search (or
match) is anchored at string start and string end are A and Z
respectively. More traditionally, elements ^ for start and $ for end are
also used in similar roles. ^ is the same as A, and $ is the same as
Z, for regular expression objects that are not multiline (i.e., that
do not contain pattern element (?m) and are not compiled with the flag
re.M or re.MULTILINE). For a multiline regular expression object,
however, ^ anchors at the start of any line (i.e., either at the start
of the whole string or at the position right after a newline character
n). Similarly, with a multiline regular expression, $ anchors at the
end of any line (i.e., either at the end of the whole string or at the
position right before n). On the other hand, A and Z anchor at the
start and end of the whole string whether the regular expression object
is multiline or not. For example, here's how to check if a file has any
lines that end with digits:
import re
digatend = re.compile(r'd$', re.MULTILINE)
if re.search(open('afile.txt').read( )): print "some lines end with
digits"
else: print "no lines end with digits"
A pattern of r'dn' would be almost equivalent, but in that case the
search would fail if the very last character of the file were a digit
not followed by a terminating end-of-line character. With the example
above, the search succeeds if a digit is at the very end of the file's
contents, as well as in the more usual case where a digit is followed by
an end-of-line character.
9.7.9 Regular Expression Objects
A regular expression object r has the following read-only attributes
detailing how r was built (by function compile of module re, covered
later in this chapter):
flags
The flags argument passed to compile, or 0 when flags is omitted
groupindex
A dictionary whose keys are group names as defined by elements
(?P<id>); the corresponding values are the named groups' numbers
pattern
The pattern string from which r is compiled
These attributes make it easy to get back from a compiled regular
expression object to its pattern string and flags, so you never have to
store those separately.
A regular expression object r also supplies methods to locate matches
for r's regular expression within a string, as well as to perform
substitutions on such matches. Matches are generally represented by
special objects, covered in the later Section 9.7.10.
findall
r.findall(s)
When r has no groups, findall returns a list of strings, each a
substring of s that is a non-overlapping match with r. For example,
here's how to print out all words in a file, one per line:
import re
reword = re.compile(r'w ')
for aword in reword.findall(open('afile.txt').read( )):
print aword
When r has one group, findall also returns a list of strings, but each
is the substring of s matching r's group. For example, if you want to
print only words that are followed by whitespace (not punctuation), you
need to change only one statement in the previous example:
reword = re.compile('(w )s')
When r has n groups (where n is greater than 1), findall returns a list
of tuples, one per non-overlapping match with r. Each tuple has n items,
one per group of r, the substring of s matching the group. For example,
here's how to print the first and last word of each line that has at
least two words:
import re
first_last = re.compile(r'^W*(w )b.*b(w )W*$',
re.MULTILINE)
for first, last in
first_last.findall(open('afile.txt').read( )):
print first, last
match
r.match(s,start=0,end=sys.maxint)
Returns an appropriate match object when a substring of s, starting at
index start and not reaching as far as index end, matches r. Otherwise,
match returns None. Note that match is implicitly anchored at the
starting position start in s. To search for a match with r through s,
from start onwards, call r.search, not r.match. For example, here's how
to print all lines in a file that start with digits:
import re
digs = re.compile(r'd ')
for line in open('afile.txt'):
if digs.match(line): print line,
search
r.search(s,start=0,end=sys.maxint)
Returns an appropriate match object for the leftmost substring of s,
starting not before index start and not reaching as far as index end,
that matches r. When no such substring exists, search returns None. For
example, to print all lines containing digits, one simple approach is as
follows:
import re
digs = re.compile(r'd ')
for line in open('afile.txt'):
if digs.search(line): print line,
split
r.split(s,maxsplit=0)
Returns a list L of the splits of s by r (i.e., the substrings of s that
are separated by non-overlapping, non-empty matches with r). For
example, to eliminate all occurrences of substring 'hello' from a
string, in any mix of lowercase and uppercase letters, one way is:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
astring = ''.join(rehello.split(astring))
When r has n groups, n more items are interleaved in L between each pair
of splits. Each of the n extra items is the substring of s matching r's
corresponding group in that match, or None if that group did not
participate in the match. For example, here's one way to remove
whitespace only when it occurs between a colon and a digit:
import re
re_col_ws_dig = re.compile(r'(:)s (d)')
astring = ''.join(re_col_ws_dig.split(astring))
If maxsplit is greater than 0, at most maxsplit splits are in L, each
followed by n items as above, while the trailing substring of s after
maxsplit matches of r, if any, is L's last item. For example, to remove
only the first occurrence of substring 'hello' rather than all of them,
change the last statement in the first example above to:
astring = ''.join(rehello.split(astring, 1))
sub
r.sub(repl,s,count=0)
Returns a copy of s where non-overlapping matches with r are replaced by
repl, which can be either a string or a callable object, such as a
function. An empty match is replaced only when not adjacent to the
previous match. When count is greater than 0, only the first count
matches of r within s are replaced. When count equals 0, all matches of
r within s are replaced. For example, here's another way to remove only
the first occurrence of substring 'hello' in any mix of cases:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
astring = rehello.sub('', astring, 1)
Without the final 1 argument to method sub, this example would remove
all occurrences of 'hello'.
When repl is a callable object, repl must accept a single argument (a
match object) and return a string to use as the replacement for the
match. In this case, sub calls repl, with a suitable match-object
argument, for each match with r that sub is replacing. For example, to
uppercase all occurrences of words starting with 'h' and ending with 'o'
in any mix of cases, you can use the following:
import re
h_word = re.compile(r'bhw ob', re.IGNORECASE)
def up(mo): return mo.group(0).upper( )
astring = h_word.sub(up, astring)
Method sub is a good way to get a callback to a callable you supply for
every non-overlapping match of r in s, without an explicit loop, even
when you don't need to perform any substitution. The following example
shows this by using the sub method to build a function that works just
like method findall for a regular expression without groups:
import re
def findall(r, s):
result = [ ]
def foundOne(mo): result.append(mo.group( ))
r.sub(foundOne, s)
return result
The example needs Python 2.2, not just because it uses lexically nested
scopes, but because in Python 2.2 re tolerates repl returning None and
treats it as if it returned '', while in Python 2.1 re was more pedantic
and insisted on repl returning a string.
When repl is a string, sub uses repl itself as the replacement, except
that it expands back references. A back reference is a substring of repl
of the form g<id>, where id is the name of a group in r (as
established by syntax (?P<id>) in r's pattern string), or dd,
where dd is one or two digits, taken as a group number. Each back
reference, whether named or numbered, is replaced with the substring of
s matching the group of r that the back reference indicates. For
example, here's how to enclose every word in braces:
import re
grouped_word = re.compile('(w )')
astring = grouped_word.sub(r'{1}', astring)
subn
r.subn(repl,s,count=0)
subn is the same as sub, except that subn returns a pair (new_string,
n) where n is the number of substitutions that subn has performed. For
example, to count the number of occurrences of substring 'hello' in any
mix of cases, one way is:
import re
rehello = re.compile(r'hello', re.IGNORECASE)
junk, count = rehello.subn('', astring)
print 'Found', count, 'occurrences of "hello"'
9.7.10 Match Objects
Match objects are created and returned by methods match and search of a
regular expression object. There are also implicitly created by methods
sub and subn when argument repl is callable, since in that case a
suitable match object is passed as the actual argument on each call to
repl. A match object m supplies the following attributes detailing how m
was created:
pos
The start argument that was passed to search or match (i.e., the index
into s where the search for a match began)
endpos
The end argument that was passed to search or match (i.e., the index
into s before which the matching substring of s had to end)
lastgroup
The name of the last-matched group (None if the last-matched group has
no name, or if no group participated in the match)
lastindex
The integer index (1 and up) of the last-matched group (None if no group
participated in the match)
re
The regular expression object r whose method created m
string
The string s passed to match, search, sub, or subn
A match object m also supplies several methods.
end, span, start
m.end(groupid=0)
m.span(groupid=0)
m.start(groupid=0)
These methods return the delimiting indices, within m.string, of the
substring matching the group identified by groupid, where groupid can be
a group number or name. When the matching substring is m.string[i:j],
m.start returns i, m.end returns j, and m.span returns (i, j). When the
group did not participate in the match, i and j are -1.
expand
m.expand(s)
Returns a copy of s where escape sequences and back references are
replaced in the same way as for method r.sub, covered in the previous
section.
group
m.group(groupid=0,*groupids)
When called with a single argument groupid (a group number or name),
group returns the substring matching the group identified by groupid, or
None if that group did not participate in the match. The common idiom
m.group( ), also spelled m.group(0), returns the whole matched
substring, since group number 0 implicitly means the whole regular
expression.
When group is called with multiple arguments, each argument must be a
group number or name. group then returns a tuple with one item per
argument, the substring matching the corresponding group, or None if
that group did not participate in the match.
groups
m.groups(default=None)
Returns a tuple with one item per group in r. Each item is the substring
matching the corresponding group, or default if that group did not
participate in the match.
groupdict
m.groupdict(default=None)
Returns a dictionary whose keys are the names of all named groups in r.
The value for each name is the substring matching the corresponding
group, or default if that group did not participate in the match.
9.7.11 Functions of Module re
The re module supplies the attributes listed in the earlier section
Section 9.7.6. It also provides a function that corresponds to each
method of a regular expression object (findall, match, search, split,
sub, and subn), each with an additional first argument, a pattern string
that the function implicitly compiles into a regular expression object.
It's generally preferable to compile pattern strings into regular
expression objects explicitly and call the regular expression object's
methods, but sometimes, for a one-off use of a regular expression
pattern, calling functions of module re can be slightly handier. For
example, to count the number of occurrences of substring 'hello' in any
mix of cases, one function-based way is:
import re
junk, count = re.subn(r'(?i)hello', '', astring)
print 'Found', count, 'occurrences of "hello"'
In cases such as this one, regular expression options (here, for
example, case insensitivity) must be encoded as regular expression
pattern elements (here, (?i)), since the functions of module re do not
accept a flags argument.
Module re also supplies error, the class of exceptions raised upon
errors (generally, errors in the syntax of a pattern string), and two
additional functions.
compile
compile(pattern,flags=0)
Creates and returns a regular expression object, parsing string pattern
as per the syntax covered in Section 9.7.1, and using integer flags as
in the section Section 9.7.6, both earlier in this chapter.
escape
escape(s)
Returns a copy of string s where each non-alphanumeric character is
escaped (i.e., preceded by a backslash ). This is handy when you need
to match string s literally as part (or all) of a regular expression
pattern string.
[ Team LiB ]
Searching for Elvis
Suppose you spend all your free time scanning documents looking for evidence that Elvis is still alive. You could search with the following regular expression:
1. elvis
Find elvis
This is a perfectly valid regular expression that searches for an exact sequence of characters. In .NET, you can easily set options to ignore the case of characters, so this expression will match "Elvis", "ELVIS", or "eLvIs". Unfortunately, it will also match the last five letters of the word "pelvis". We can improve the expression as follows:
2. belvisb
Find elvis as a whole word
Now things are getting a little more interesting. The "b
" is a
special code that means, "match the position at the beginning or end of
any word". This expression will only match complete words spelled
"elvis" with any combination of lower case or capital letters.
Suppose you want to find all lines in which the word "elvis" is followed
by the word "alive." The period or dot ".
" is a special code that
matches any character other than a newline. The asterisk "*
" means
repeat the previous term as many times as necessary to guarantee a
match. Thus, ".*
" means "match any number of characters other than
newline". It is now a simple matter to build an expression that means
"search for the word 'elvis' followed on the same line by the word
'alive'."
3. belvisb.*baliveb
Find text with "elvis" followed by
"alive"
With just a few special characters we are beginning to build powerful regular expressions and they are already becoming hard for we humans to read.
Let's try another example.
str = 'an example word:cat!!'
match = re.search(r'word:www', str)
# If-statement after search() tests if it succeeded
if match:
print 'found', match.group() ## 'found word:cat'
else:
print 'did not find'
str = 'an example word:cat!!'
match = re.search(r'word:www', str)
# If-statement after search() tests if it succeeded
if match:
print 'found', match.group() ## 'found word:cat'
else:
print 'did not find'
] view plaincopyprint? import re r1 = re.compile(r(?im)(?Pname/html)$) content = HTML boxsuch as box and boxes, but not inbox. In other words box htmldsafdsafdas /html /ahtm...
在JavaScrpt中,能够通过RegExp的布局函数RegExp(卡塔尔(英语:State of Qatar)来布局三个正则表明式对象,更分布的,也得以通过直接量的语法,定义七个正则表达式对象。与字符串近似,表明式内容的两边用斜线(/卡塔尔(英语:State of Qatar)标志。
直接量字符
反斜线起先的字符具备独特的意义
Determining the Validity of Phone Numbers
Suppose your web page collects a customer's seven-digit phone number and you want to verify that the phone number is in the correct format, "xxx-xxxx", where each "x" is a digit. The following expression will search through text looking for such a string:
4. bddd-dddd
Find seven-digit phone number
Each "d
" means "match any single digit". The "-
" has no
special meaning and is interpreted literally, matching a hyphen. To
avoid the annoying repetition, we can use a shorthand notation that
means the same thing:
5. bd{3}-d{4}
*Find seven-digit phone number a better way*
The "{3}
" following the "d
" means "repeat the preceding
character three times".
Let's learn how to test this expression.
The code match = re.search(pat, str) stores the search result in a
variable named "match". Then the if-statement tests the match -- if true
the search succeeded and match.group() is the matching text (e.g.
'word:cat'). Otherwise if the match is false (None to be more specific),
then the search did not succeed, and there is no matching text.
The 'r' at the start of the pattern string designates a python "raw"
string which passes through backslashes without change which is very
handy for regular expressions (Java needs this feature badly!). I
recommend that you always write pattern strings with the 'r' just as a
habit.
## The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat').
Character |
Matches |
---|---|
字符、数字 |
Itself |