Python:字符串

时间:2023-03-09 03:04:07
Python:字符串

一、序列的概念

序列是容器类型,顾名思义,可以想象,“成员”们站成了有序的队列,我们从0开始进行对每个成员进行标记,0,1,2,3,...,这样,便可以通过下标访问序列的一个或几个成员,就像C语言中的数组一样,这很好理解。

二、序列类型操作符(以下操作符对所有序列类型都适用)

1.成员关系操作符(in,not in)

1
2
3
4
'x' in 'china'        #返回False                                                                                                                                                                                                                                                                                                                                                   
'e' not in 'pity'     #返回True                                                                                                                                                                                                                                                                                                                                                   
12 in [13, 32, 4, 0#返回False                                                                                                                                                                                                                                                                                                                                                   
'green' not in ['red', 'yellow', 'white'] #返回True

2.连接操作符“+”(只可用于同种类型序列连接)

1
2
3
4
5
6
7
8
9
10
str1 = 'aaa'
str2 = 'bbb'
str2 = str1 + str2                                                                                                                                                                                                                                                                                                                                              
str2      #返回'aaabbb',此时str2所指向的对象是新创建的对象                                                                                                                                                                                                                                                                                                                                              
          #因字符串是不可更新的标量,可以用id()测试                                                                                                                                                                                                                                                                                                                                              
numList = [1, 3, 5]                                                                                                                                                                                                                                                                                                                                              
numList += [6, 8]                                                                                                                                                                                                                                                                                                                                              
numList   #返回[1, 3, 5, 6, 8],此时的numList指向的对象还是                                                                                                                                                                                                                                                                                                                                              
          #原始对象,因其是可更改的容器,可以用id()测试                                                                                                                                                                                                                                                                                                                                              
(1, 3) + (5, 7#返回(1, 3, 5, 7),注意元组是不可更改的容器

3.重复操作符“*”

“*”用以将序列重复指定次数,如:

1
2
3
4
5
6
7
8
9
str = 'hello'
str *= 3
str    #返回'hellohellohello'                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
alphaList = ['a', 'b', 'c']                                                                                                                                                                                                                                                                                                                                        
alphaList *= 2
alphaList   #返回['a', 'b', 'c', 'a', 'b', 'c']                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
('ha', 'ya') *3  #返回('ha', 'ya', 'ha', 'ya', 'ha', 'ya')

4.切片操作符([], [:], [::])

通过切片功能可以访问序列的一个或者多个成员,和C一样,你要保证你访问下标的成员是存在的,否则会引发异常(C中叫做数组越界),如果用过Matlab,会对切片很熟悉。

访问单个成员“[]”

1
2
3
4
5
6
str = "hello,world"
str[0]    #返回'h'                                                                                                                                                                                                                                                                                                                         
str[10]   #返回'd'                                                                                                                                                                                                                                                                                                                         
str[11]   #这是一个错误,超出了序列范围,越界!!!!                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
str[-1]   #返回'd',方便吧?要善于使用这种用法。

访问连续的多个成员“[starting_index : ending_index]”

1
2
3
4
5
6
str = 'hello,world'
str[:]    #返回'hello,world',即全部成员                                                                                                                                                                                                                                                                                                                     
str[:5]   #返回'hello',省略第一个值,默认从0开始                                                                                                                                                                                                                                                                                                                     
str[5:]   #返回',world',省略第二个值,默认到结束                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
str[1:-1] #返回'ello,worl'

以等差数列形式的下标进行访问“[starting_index : ending_index : step_length]”

1
2
3
4
5
6
7
(1, 2, 3, 4, 5, 6)[0:6:2#返回(1, 3, 5)                                                                                                                                                                                                                                                                                                            
str = "abcdefg"
str[::-1]   #返回'gfedcba',瞬间反转,未添的参数默认为开始和结束                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
numList = [1, 34, 23, 0, 342]                                                                                                                                                                                                                                                                                                            
numList[-100:100:1#返回[1, 34, 23, 0, 342]                                                                                                                                                                                                                                                                                                            
                     #这时,下标超限不会引发错误

三、用于序列的内建函数

内建函数list(), str()和tuple()用于各种序列类型之间的“转换”(转换是表象,实质是新对象的创建,因这些函数都是‘工厂’函数):

list(iter)           把可迭代对象转换为列表

str(obj)            把obj对象转换为字符串

unicode(obj)   把obj对象转换为Unicode字符串

tuple(iter)       把一个可迭代对象转换为一个元组对象

1
2
3
4
5
6
7
8
9
10
tuple1 = (1, 2, 3 ,4)                                                                                                                                                                                                                                                                                            
list1 = list(tuple1)                                                                                                                                                                                                                                                                                            
list1  #返回[1, 2, 3, 4, 5]                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
list2 = ['gr', 'ct', 'ga']                                                                                                                                                                                                                                                                                            
tuple2 = tuple(list2)                                                                                                                                                                                                                                                                                            
tuple2 #返回('gr', 'ct', 'ga')                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
str(tuple2)     #返回"('gt', 'ct', 'ga')"                                                                                                                                                                                                                                                                                            
unicode(tuple2) #返回u"('gt', 'ct', 'ga')"

以下是一些其它常用函数:

len(seq)                返回序列长度

max()                    返回序列中的最大值,具体用法见例子

min()                    返回序列中最小值,具体用法见例子

sum()                    返回列表的元素之和

enumerate(iter)   接受一个可迭代对象,返回一enumerate对象,该对象生成iter的每

个成员的index值和item值构成的数组,见例子

reversed(seq)      返回一个序列的逆向迭代器

sorted()               对一个序列,排序,返回排好序的列表,可以指定排序方法,见help

zip()                    返回一个列表,其成员为元组,不好解释,看例子

1
2
3
4
#example of 'enumerate'                                                                                                                                                                                                                                                                 
numList = [1, 3, 5, 7]                                                                                                                                                                                                                                                                 
enumExample = enumerate(numList)                                                                                                                                                                                                                                                                 
list(enumExample)  #返回[(0, 1), (1, 3), (2, 5), (3, 7)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# example of len()                                                                                                                                                                                                                                                                
st = 'hello'
len(st)                  #返回5                                                                                                                                                                                                                                                                
listDemo = [1, 43, 4, 54]                                                                                                                                                                                                                                                                
len(listDemo)            #返回4                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
# example of max()                                                                                                                                                                                                                                                                
max(listDemo)            #返回54                                                                                                                                                                                                                                                                
max(234, 34, 34)         #返回234                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
# example of min(), it's like max(), ok?                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
# example sum()                                                                                                                                                                                                                                                                
sum(listDemo)            #返回102                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
# example of reversed()                                                                                                                                                                                                                                                                
tupleDemo = (1, 6, 4)                                                                                                                                                                                                                                                                
tupleIter = reversed(tupleDemo)                                                                                                                                                                                                                                                                
list(tupleIter)          #返回[4, 6, 1]                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
# example of sorted()                                                                                                                                                                                                                                                                
sorted([1, 90, 23, 8])   #返回[1, 8, 23, 90]                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
# example of zip()                                                                                                                                                                                                                                                                
st = "i love you"
zipDemo = zip(st[[0], st[-1], st[-2])                                                                                                                                                                                                                                                                
zipDemo                  #返回[('i', 'u', 'o')]
1
2
3
4
5
6
7
8
9
10
11
12
13
#example of zip() again!   
keys = 'abcdefg'
values = '!@#$%^&'
for key, value in zip(keys, values):   
    print key,'--->', value   
#以下是解释器返回的内容(zip()返回一个列表对象):       
#a ---> !   
#b ---> @   
#c ---> #   
#d ---> $   
#e ---> %   
#f ---> ^   
#g ---> &

四、字符串

在Python中,用双引号""和单引号''创建字符串没有什么区别(有些语言中,只有在双引号""中转义字符才能起作用)。Python中字符串有3中类型,Unicode字符串,常规字符串(str)和basestring,unicode和str是basestring的子类,basestring是个虚类,不可以被实例化!

1.创建字符串

1
2
3
4
5
6
7
8
9
10
11
string1 = 'welcome!'  #using single quote ''                                                                                                                                                                                                                                                      
string2 = "welcome!"  #using double quote ""                                                                                                                                                                                                                                                      
string3 = str("hello!"#使用工厂方法创建                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
#字符串中有单引号怎么办?                                                                                                                                                                                                                                                      
string4 = "What's the matter?"
#字符串中有双引号怎么办?                                                                                                                                                                                                                                                      
string5 = '" " can be used to creat strings!'
#如果单引号、双引号都有怎么办?                                                                                                                                                                                                                                                      
print """' ' and "" can be used to creat string!"""
print '''' ' and "" can be used to creat string!'''

当然,特殊情况还是比较多的,欲详细了解,请参考《Python Cookbook》第一章。

1
2
#落了一个,补上!                                                                                                                                                                                                                                            
string7 = unicode("I'm unicode string!")

2.访问字符串的值

1
2
3
4
strDemo = "hello"
strDemo              #返回'hello'                                                                                                                                                                                                                                         
strDemo[2]           #返回'l'                                                                                                                                                                                                                                         
strDemo[:]           #返回'hello'

3.改变字符串

字符串是标量,不支持更改,你只能通过重新赋值的方式来“更新”(实质是新对象被创建)

1
2
3
4
stringDemo = "hello"
stringDemo += " python"
stringDemo = stringDemo + " i love you!"
stringDemo   #返回'hello python i love you!'

4.删除字符和字符串

删除字符串很简单:

1
2
stringDemo = ''   #可以把它清空                                                                                                                                                                                                                             
del stringDemo    #这样是斩草除根!

前面反复重复了多次,字符串是标量,无法更改,故无法删除其中的字符,但技巧还是有的:

1
2
3
4
word = 'conxgratulations!"                                                                                                                                                                                                                         
word = word[0:3] + word[4:len(word) + 1]                                                                                                                                                                                                                         
word      #返回'congratulations!',去掉了字符'x'                                                                                                                                                                                                                         
          #我因为太懒,不想数word的长度,所以用了len()

5.操作符

标准类型操作,切片操作,成员操作(in,not in),连接操作(+),重复操作等已在“序列类型操作符”中讲述,这里不再赘述。

编译时字符串连接:

1
2
3
4
5
6
>>>f = urllib.urlopen('http://'  #protocol                                                                                                                                                                                                       
...'localhost'                   #hostname                                                                                                                                                                                                       
...':8000'                       #port                                                                                                                                                                                                       
...'/cgi-bin/friends2.py')       #file                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                  
#字符串也可以以这种方式连接

普通字符串和Unicode字符串连接会发生类型转换,得到Unicode字符串:

1
'hello world!' + u''   #返回u'hello world!'

格式化字符串,如果你对C中的printf()熟悉的话,这个很简单:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
"%x" % 108                #返回'6c'                                                                                                                                                                                              
"%#X" % 108               #返回'0X6C'
                                                                                                                                                                                                                                                                                                                                                                                                
'%f' % 1234.567890        #返回'1234.567890'                                                                                                                                                                                              
'%.2f' % 1234.567890      #返回'1234.57'                                                                                                                                                                                              
'%E' % 1234.567890        #返回'1.234568e+03'                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                
"%+d" % 4                 #返回'+4'                                                                                                                                                                                              
"%+d" % -4                #返回'-4'                                                                                                                                                                                              
"we are at %d%%" % 100    #返回'we are at 100%'                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                
'Your host is:%s' % 'earth'      #返回'Your host is: earth'                                                                                                                                                                                              
'Host:%s\tPort:%d' % ('mars', 80) #返回'Host:mars Port: 80'                                                                                                                                                                                              
"MM/DD/YY = %02d/%02d%d" % (2, 15, 67)                                                                                                                                                                                              
                          #返回'MM/DD/YY = 02/15/67'                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                
w, p = 'Web', 'page'
'http://xxx.yyy.zzz/%s/%s.html' % (w, p)                                                                                                                                                                                               
                   #返回'http://xxx.yyy.zzz/web/page.html'
1
2
3
4
5
6
7
8
9
10
11
12
13
#字符串格式化符号                                                                                                                                                                                            
'''                                                                                                                                                                                            
%c  转换成字符(ASCII码值,或者长度为1的字符串)                                                                                                                                                                                            
%r  优先用repr()函数进行字符串转换                                                                                                                                                                                            
%s  优先用str()函数进行字符串转换                                                                                                                                                                                            
%d/%i   转换成有符号十进制数                                                                                                                                                                                            
%u  转换成无符号十进制数                                                                                                                                                                                            
%o  转换成无符号八进制数                                                                                                                                                                                            
%x/%X   转换成无符号十六进制数(x/X代表转换后字符的大小写)                                                                                                                                                                                            
%e/%E   转成科学记数法(e/E控制输出e/E)                                                                                                                                                                                            
%f/%F   转换成浮点数                                                                                                                                                                                            
%g/%G   %e,%f和%E,%F的简写,选择最短的输出                                                                                                                                                                                            
%%  输出%                              '''
1
2
3
4
5
6
7
8
9
10
11
#格式化操作符辅助指令                                                                                                                                                                                           
'''                                                                                                                                                                                           
*   定义宽度或者小数点精度                                                                                                                                                                                           
-   用做左对齐                                                                                                                                                                                           
+   数字前显示正负号                                                                                                                                                                                           
<sp>  在正数前显示空格                                                                                                                                                                                           
#   在八进制数前显示0,在十六进制数前显示0x                                                                                                                                                                                           
0   显示的数字前填充0而不是空格                                                                                                                                                                                           
%   ‘%%’输出一个%,和C中一样                                                                                                                                                                                           
(var)   映射变量(字典参数)                                                                                                                                                                                           
m.n m是显示的最小总宽度,n是小数点后的位数   '''

字符串模版

字符串模版使得字符串格式化更加简单和直观,如果你有过Web模版开发的经验,这里便是那种思想的体现:

1
2
3
4
5
6
7
8
9
10
11
12
from string import Template                                                                                                                                                                                     
s = Template('There are ${howmany} ${lang}?')                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                              
print s.substitute(lang = 'Python', howmany = 3)                                                                                                                                                                                     
    #返回There are 3 Python                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                              
print s.substitute(lang = 'Python')                                                                                                                                                                                     
    #因为这里缺少一个key,所以会发生异常!                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                              
print s.safe_substitute(lang = 'Python')                                                                                                                                                                                     
    #返回There are ${howmany} Python                                                                                                                                                                                     
    #safe_substitute在缺少key的情况下照原样输出

原始字符串(r/R)

普通字符串中一些特殊字符组合是会进行转义的,在原始字符串中,所有字符都照其字面意义来使用,没有特殊转义或不能打印的字符,这对于正则表达式以及路径很有用:

1
2
3
4
5
6
>>>print 'C:\neroburing'
C:                                                                                                                                                                             
eroburing                #看到了吧,\n被作为了转义字符!                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                              
>>>print r'C:\neroburing'
C:\nerouring             #这才是你想要的!

如果要使用Unicode类型的原始字符串,可以这样:

1
ur'hello,world!'

6.用于字符串内建函数

raw_input()用于获得用户输入,并返回一个普通字符串

1
2
3
4
>>>name = raw_input('Input your name:')                                                                                                                                                          
:Green                                                                                                                                                          
>>>print 'hello', name                                                                                                                                                          
hello Green

str()和unicode()是工厂函数,用于生产普通字符串对象和Unicode字符串对象。

chr()用于接收一个0-255之间的整数(ASCII码),返回对应的ASCII字符

unichr()用于接收Unicode码,返回对于的Unicode字符

ord()的功能和chr()相反

7.字符串内建函数

注意不要把这部分和上一部分“用于字符串的内建函数”弄混了,用于字符串的内建函数都是__builtin__类型,是内置于Python中的函数,而本部分的函数可以认为是‘字符串类型’的方法,从用法上可以看出来:string_object.method(arguments)

因函数过多,这里不一一列举,可以通过help(str)或者dir(str)寻求帮助:

1
2
3
4
5
6
7
8
str_demo = 'welcome to Python!'
str_demo.capitalize()      #return 'Welcome to Python!'                                                                                                                           
                                                                                                                                                                                                                                                          
str_demo.count('o')        #return 2                                                                                                                           
                                                                                                                                                                                                                                                          
str_demo.endwith('thon')   # return True                                                                                                                           
                                                                                                                                                                                                                                                          
...

8.字符串的独特性

转义字符:Python的普通字符串支持转义字符,用法和C语言中是一样的,这里不再重述,不过还是要强调个别转义的意义:\\代表\, \"  代表  "  ,\'  代表  '

三引号:我们在前面曾经多此用过三引号,由单引号或者三引号构成的三引号包裹字符串可以使得程序员从引号和特殊字符串的泥潭中解脱出来,实现了字符串编辑的“所见即所得”

1
2
3
4
5
6
7
html = '''                                                                                                       
<html>                                                                                                       
  <head><title>My Website</title></head>                                                                                                       
<body>                                                                                                       
  <p>Hello, Welcome to my website!</p>                                                                                                       
</body>                                                                                                       
</html>     '''

五、Unicode字符串

1.概念

Unicode通常称为万国码或统一码,和ASCII码一样,也是一种计算机上使用的字符编码,而ASCII码的容量是有限的,因其只有8位,故只能表示256个字符。而Unicode使用16位存储,可以表示超过90000个字符。程序通常处理的数据和信息不仅仅只是英文,Unicode将世界上各种文字、符号统一编码,极大的方便了信息的计算和处理。

2.创建Unicode字符串:两种方法

1
2
uni_demo = u'hello'
uni_demo2 = unicode('hello')

3.CODEC

CODEC是Coder/Decoder(编码器/解码器)的组合,它定义了文本和二进制的转换方式。Unicode用的是多字节,这导致了Unicode支持多种不同的编码方式,我们耳熟能详的有:ASCII,UTF-8,UTF-16等。其中UTF-8是目前应用最广泛的编码方式。

4.一个编码解码的例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/usr/bin/env python                                                                         
'''                                                                         
例子:读取一个Unicode字符串,将其以UTF-8格式写入文件中并                                                                         
重新读回                                                                         
'''
                                                                                                                                                      
CODEC = 'utf-8'
FILE = 'unicode.txt'
                                                                                                                                                      
infor_str = raw_input("Enput string:")                                                                         
infor_str = unicode(infor_str)       #convert into unicode                                                                         
                                                                                                                                                      
bytes_out = infor_str.encode(CODEC)                                                                         
                                                                                                                                                      
f = open(FILE, "w")                                                                         
f.write(bytes_out)                                                                         
f.close()                                                                         
                                                                                                                                                      
f = open(FILE, "r")                                                                         
bytes_in = f.read()                                                                         
f.close()                                                                         
                                                                                                                                                      
infor_back = bytes_in.decode(CODEC)                                                                         
print infor_back,

因Unicode支持多种编码格式,所以在将字符串写入文件的时候,应该明确定义一种编码格式,当读取的时候,再以同样的格式解码,从而得到对应的Unicode对象,上面的例子便完成了这样的一个重要过程。

5.实际应用中应该注意的

在代码中尽可能的使用Unicode不是没有好处的,你的代码可能要和数据库连接,或者应用于Web开发框架等,你无时无刻都会面临编码问题,而这些问题也会让你头昏脑胀,下面的几点建议你应当听取:

  • 程序中出现字符串时一定要加个前缀u

  • 不要用str()函数,用unicode()代替

  • 不要用过时的string模块,它不支持Unicode,如果你传递给它一个Unicode,它会把一切搞砸

  • 不到万不得已不要在代码中解编码Unicode字符,在需要写入文件、数据库或者网络时,调用encode()函数,读回时,使用decode()函数

除了pickle模块,Python中的其它模块都支持Unicode。pickle支持ASCII,现在,二进制模式已成为pickle的默认格式,和二进制想匹配的,向数据库中存储时,相应的字段也应设为BLOB格式。

六、相关模块

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#与字符串类型有关的模块                                  
'''                                  
string      字符串操作相关函数与工具(已过时,不建议使用)                                  
re      正则表达式                                  
struct      字符串和二进制的转换                                  
c/StringIO  字符串缓冲对象                                  
base64      Base16、32、64数据解编码                                  
codecs      解码器注册和基类                                  
crypt       单方面加密                                  
difflib     找出序列间的不同                                  
hashlib     多种不同安全Hash算法和信息摘要算法的API                                  
hma     hmac信息鉴权算法的Python实现                                  
md5     RSA的MD5信息摘要鉴权                                  
rotor       提供多平台的加解密服务                                  
sha     NIAT的安全哈希算法SHA                                  
stringprep  提供用于IP协议的Unicode字符串                                  
textwarp    文本包装和填充                                  
unicodedata Unicode数据库      '''

获取相关的模块帮助,请先import,再help或者dir,或者module_name.__doc__

七、小结和最后的强调

和C不一样,str字符串和unicode字符串是Python提供的标准类型,采用工厂方法生产,这里体现了面向对象的实质。C中的字符串有结束符'\0',在Python中不存在这个问题。需要反复强调的是,字符串是标量,即不可改变的量。所有针对字符串的创建、修改(连接,重复,...)的操作的实质都是新对象的创建。如果你半信半疑,多用id()测试一下!

字符串是编程中应用最多的数据类型,字符串处理大多数时候也是最复杂的工作,建议仔细学习一下《Python Cookbook》的第一章,强化一下自己的字符串操作能力!