prev | next |
"*"
in the shell's *.txt
«*»
and «+»
in
operatorDownload fixed_match.py 1 import re 2 3 dragons = [ 4 ['CTAGGTGTACTGATG', 'Antipodean Opaleye'], 5 ['AAGATGCGTCCGTAT', 'Common Welsh Green'], 6 ['AGTCGTGCTCGTTATATC', 'Hebridean Black'], 7 ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'], 8 ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback'] 9 ] 10 11 for (dna, name) in dragons: 12 if re.search('ATGCGT', dna): 13 print name
Download fixed_match.out 1 Common Welsh Green 2 Hungarian Horntail
Download or_match.py 1 import re 2 3 dragons = [ 4 ['CTAGGTGTACTGATG', 'Antipodean Opaleye'], 5 ['AAGATGCGTCCGTAT', 'Common Welsh Green'], 6 ['AGTCGTGCTCGTTATATC', 'Hebridean Black'], 7 ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'], 8 ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback'] 9 ] 10 11 for (dna, name) in dragons: 12 if re.search('ATGCGT|GCT', dna): 13 print name
Download or_match.out 1 Common Welsh Green 2 Hebridean Black 3 Hungarian Horntail 4 Norwegian Ridgeback
«|»
means “or”"ATGCGT"
or "GCT"
"ATA"
or "ATC"
(both of which code for isoleucine)?«ATA|C»
will not work: it matches either "ATA"
or "C"
«ATA|ATC»
will work, but it's a bit redundantDownload precedence.py 1 import re 2 3 tests = [ 4 ['ATA', True], 5 ['xATCx', True], 6 ['ATG', False], 7 ['AT', False], 8 ['ATAC', True] 9 ] 10 11 for (dna, expected) in tests: 12 actual = re.search('AT(A|C)', dna) is not None 13 assert actual == expected
assert
s will crash the program if any of the tests fail"|"
, "("
, or ")"
?«\|»
, «\(»
, or «\)»
in the RE«\\»
to match a backslash"\\|"
, "\\("
, "\\)"
, or "\\\\"
Figure 10.1: Double Compilation of Regular Expressions
r'abc'
or r"this\nand\nthat"
r'\n'
is a string containing the two characters "\"
and "n"
, not a newline"*"
matches zero or more characters«*»
is an operator that means, “match zero or more occurrences of a pattern”"TTA"
and "CTA"
are separated by any number of "G"
Download star.py 1 tests = [ 2 ['TTACTA', True], # separated by zero G's 3 ['TTAGCTA', True], # separated by one G 4 ['TTAGGGCTA', True], # separated by three G's 5 ['TTAXCTA', False], # an X in the way 6 ['TTAGCGCTA', False], # an embedded X in the way 7 ] 8 9 for (dna, expected) in tests: 10 actual = re.search('TTAG*CTA', dna) is not None 11 assert actual == expected
"TTACTA"
because «G*»
can match zero occurrences of "G"
Figure 10.2: Zero or More
«+»
matches one or more (i.e., won't match the empty string)Download plus.py 1 assert re.search('TTAG*CTA', 'TTACTA') 2 assert not re.search('TTAG+CTA', 'TTACTA')
Figure 10.3: One or More
«?»
operator means “optional”Download optional.py 1 assert re.search('AC?T', 'AT') 2 assert re.search('AC?T', 'ACT') 3 assert not re.search('AC?T', 'ACCT')
Figure 10.4: Zero or One
«[]»
to match sets of characters«[abcd]»
matches exactly one "a"
, "b"
, "c"
, or "d"
«[a-d]»
«*»
, «+»
, or «?»
«[aeiou]+»
matches any non-empty sequence of vowelsDownload find_numbers.py 1 import re 2 3 lines = [ 4 "Charles Darwin (1809-82)", 5 "Darwin's principal works, The Origin of Species (1859)", 6 "and The Descent of Man (1871) marked a new epoch in our", 7 "understanding of our world and ourselves. His ideas", 8 "were shaped by the Beagle's voyage around the world in", 9 "1831-36." 10 ] 11 12 for line in lines: 13 if re.search('[0-9]+', line): 14 print line
Download find_numbers.out 1 Charles Darwin (1809-82) 2 Darwin's principal works, The Origin of Species (1859) 3 and The Descent of Man (1871) marked a new epoch in our 4 1831-36.
Sequence | Equivalent | Explanation |
---|---|---|
«\d» | «[0-9]» | Digits |
«\s» | «[ \t\r\n]» | Whitespace |
«\w» | «[a-zA-Z0-9_]» | Word characters (i.e., those allowed in variable names) |
Table 10.1: Regular Expression Escapes in Python |
«[^abc]»
means “anything except the characters in this set”«.»
means “any character except the end of line”«[^\n]»
«\b»
matchs the break between word and non-word charactersFigure 10.5: Word/Non-Word Breaks
string.split
to break on spaces and newlines before applying REDownload end_in_vowel.py 1 import re 2 3 words = '''Born in New York City in 1918, Richard Feynman earned a 4 bachelor's degree at MIT in 1939, and a doctorate from Princeton in 5 1942. After working on the Manhattan Project in Los Alamos during 6 World War II, he became a professor at CalTech in 1951. Feynman won 7 the 1965 Nobel Prize in Physics for his work on quantum 8 electrodynamics, and served on the commission investigating the 9 Challenger disaster in 1986.'''.split() 10 11 end_in_vowel = set() 12 for w in words: 13 if re.search(r'[aeiou]\b', w): 14 end_in_vowel.add(w) 15 for w in end_in_vowel: 16 print w
Download end_in_vowel.out 1 a 2 Prize 3 degree 4 became 5 doctorate 6 the 7 he
re.search(r'\s*', line)
will match "start end"
«^»
matches the beginning of the string«$»
matches the endFigure 10.6: Anchoring Matches
Pattern | Text | Result |
---|---|---|
«b+» | "abbc" | Matches |
«^b+» | "abbc" | Fails (string doesn't start with b ) |
«c$» | "abbc" | Matches (string ends with c ) |
«^a*$» | aabaa | Fails (something other than "a" between start and end of string) |
Table 10.2: Regular Expression Anchors in Python |
"#"
, and extends to the end of the line"#"
Download comment_err.py 1 import sys, re 2 3 lines = '''Date: 2006-03-07 4 On duty: HP # 01:30 - 03:00 5 Observed: Common Welsh Green 6 On duty: RW #03:00-04:30 7 Observed: none 8 On duty: HG # 04:30-06:00 9 Observed: Hebridean Black 10 '''.split('\n') 11 12 for line in lines: 13 if re.search('#', line): 14 comment = line.split('#')[1] 15 print comment
Download comment_err.out 1 01:30 - 03:00 2 03:00-04:30 3 04:30-06:00
split
followed by strip
seems clumsyre.search
is actually a match object that records what what matched, and wheremo.group()
returns the whole string that matched the REmo.start()
and mo.end()
are the indices of the match's locationDownload match_object.py 1 import re 2 3 text = 'abbcb' 4 for pattern in ['b+', 'bc*', 'b+c+']: 5 match = re.search(pattern, text) 6 print '%s / %s => "%s" (%d, %d)' % \ 7 (pattern, text, match.group(), match.start(), match.end())
Download match_object.out 1 b+ / abbcb => "bb" (1, 3) 2 bc* / abbcb => "b" (1, 2) 3 b+c+ / abbcb => "bbc" (1, 4)
mo.group(3)
is the text that matched the third subexpression, m.start(3)
is where it startedDownload comment.py 1 import sys, re 2 3 lines = '''Date: 2006-03-07 4 On duty: HP # 01:30 - 03:00 5 Observed: Common Welsh Green 6 On duty: RW #03:00-04:30 7 Observed: none 8 On duty: HG # 04:30-06:00 9 Observed: Hebridean Black 10 '''.split('\n') 11 12 for line in lines: 13 match = re.search(r'#\s*(.+)', line) 14 if match: 15 comment = match.group(1) 16 print comment
Download comment.out 1 01:30 - 03:00 2 03:00-04:30 3 04:30-06:00
Download reverse_cols.py 1 import re 2 3 def reverse_columns(line): 4 match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line) 5 if not match: 6 return line 7 return match.group(2) + ' ' + match.group(1) 8 9 tests = [ 10 ['10 20', 'easy case'], 11 [' 30 40 ', 'padding'], 12 ['60 70 80', 'too many columns'], 13 ['90 end', 'non-numeric'] 14 ] 15 16 for (fixture, title) in tests: 17 actual = reverse_columns(fixture) 18 print '%s: "%s" => "%s"' % (title, fixture, actual)
Download reverse_cols.out 1 easy case: "10 20" => "20 10" 2 padding: " 30 40 " => "40 30" 3 too many columns: "60 70 80" => "60 70 80" 4 non-numeric: "90 end" => "90 end"
Figure 10.7: Regular Expressions as Finite State Machines
re.compile(pattern)
to get the compiled REre
modulematcher.search(text)
searches text
for matches to the RE that was compiled to create matcher
Download title_case.py 1 import re 2 3 # Put pattern outside 'find_all' so that it's only compiled once. 4 pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)') 5 6 def find_all(line): 7 result = [] 8 match = pattern.search(line) 9 while match: 10 result.append(match.group(1)) 11 match = pattern.search(match.group(2)) 12 return result 13 14 lines = [ 15 'This has several Title Case words', 16 'on Each Line (Some in parentheses).' 17 ] 18 for line in lines: 19 print line 20 for word in find_all(line): 21 print '\t', word
Download title_case.out 1 This has several Title Case words 2 This 3 Title 4 Case 5 on Each Line (Some in parentheses). 6 Each 7 Line 8 Some
findall
methodDownload findall.py 1 import re 2 3 lines = [ 4 'This has several Title Case words', 5 'on Each Line (Some in parentheses).' 6 ] 7 pattern = re.compile(r'\b([A-Z][a-z]*)\b') 8 for line in lines: 9 print line 10 for word in pattern.findall(line): 11 print '\t', word
Download findall.out 1 This has several Title Case words 2 This 3 Title 4 Case 5 on Each Line (Some in parentheses). 6 Each 7 Line 8 Some
Pattern | Matches | Doesn't Match | Explanation |
---|---|---|---|
«a*»
|
"" , "a" , "aa" , … |
"A" , "b" |
«*» means “zero or more” matching is case sensitive |
«b+»
|
"b" , "bb" , … |
""
|
«+» means “one or more” |
«ab?c»
|
"ac" , "abc" |
"a" , "abbc" |
«?» means “optional” (zero or one) |
«[abc]»
|
"a" , "b" , or "c" |
"ab" , "d" |
«[…]» means “one character from a set” |
«[a-c]»
|
"a" , "b" , or "c" |
Character ranges can be abbreviated | |
«[abc]*»
|
"" , "ac" , "baabcab" , … |
Operators can be combined: zero or more choices from "a" , "b" , or "c" |
|
Table 10.3: Regular Expression Operators |
Method | Purpose | Example | Result |
---|---|---|---|
split
|
Split a string on a pattern. |
re.split('\\s*,\\s*', 'a, b ,c , d')
|
['a', 'b', 'c', 'd']
|
findall
|
Find all matches for a pattern. |
re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.')
|
['Some', 'Title', 'Case']
|
sub
|
Replace matches with new text. |
re.sub('\\d+', 'NUM', 'If 123 is 456')
|
"If NUM is NUM"
|
Table 10.4: Regular Expression Object Methods |
«pat{N}»
to match exactly N occurrences of a pattern«pat{M,N}»
matches between M and N occurrencesExercise 10.1:
By default, regular expression matches are
greedy: the first term in the RE
matches as much as it can, then the second part, and so on. As a
result, if you apply the RE «X(.*)X(.*)»
to the string
"XaX and XbX"
, the first group will contain "aX and Xb"
,
and the second group will be empty.
It's also possible to make REs match
reluctantly, i.e., to have the
parts match as little as possible, rather than as much. Find out
how to do this, and then modify the RE in the previous paragraph
so that the first group winds up containing "a"
, and the
second group " and XbX"
.
Exercise 10.2:
What the easiest way to write a case-insensitive regular expression? (Hint: read the documentation on compilation options.)
Exercise 10.3:
What does the VERBOSE
option do when compiling a regular
expression? Use it to rewrite some of the REs in this lecture in
a more readable way.
Exercise 10.4:
What does the DOTALL
option do when compiling a regular
expression? Use it to get rid of the call to
string.split
in the example that finds words ending in
vowels.
prev | Copyright © 2005-06 Python Software Foundation. | next |