http://[blog.csdn.net/pipisorry][blog.csdn.net_pipisorry]

标准库pickle模块

我们已经知道输入输出都是字符串，要把一个对象存进文件，要将其转化为字符串；从文件中读出来的也是字符串，如果我们再要构建对象，则从读出来的字符串去做。那如果我们并不在乎文件存储对象的形式，只想得到一个字符串去代表对象，用于存储，或用于网络传递，有没有更好的方法呢？

这就是Python标准库的pickle模块。pickle模块提供了一套算法，用于对一个Python对象进行serializing（序列化为字符串）和de-serializing（从字符串构建对象），这个过程叫做pickle和unpickle。

python的pickle模块实现了基本的数据序列和反序列化。通过pickle模块的序列化操作我们能够将程序中运行的对象信息保存到文件中去，永久存储；通过pickle模块的反序列化操作，我们能够从文件中创建上一次程序保存的对象。

同时pickle模块（和下面的cpickle模块）在处理自引用类型时会变得更加“聪明”，它不会无限制的递归序列化自引用对象，对于同一对象的多次引用，它只会序列化一次。Python规范（Python-specific）提供了pickle的序列化规则。这就不必担心不同版本的Python之间序列化兼容性问题。默认情况下，pickle的序列化是基于文本的，我们可以直接用文本编辑器查看序列化的文本。我们也可以序列成二进制格式的数据，这样的结果体积会更小。更详细的内容，可以参考Python手册pickle模块。

基本接口

注意：python3和python2的接口还是有比较大的不同的，读取可能会因为版本不同出错。

pickle.dump(obj, file, protocol=None, *, fix_imports=True)

注解：将对象obj保存到文件file中去。将object序列化进file

参数

protocol：为序列化使用的协议版本，0：ASCII协议，所序列化的对象使用可打印的ASCII码表示；1：老式的二进制协议；2：2.3版本引入的新二进制协议，较以前的更高效。其中协议0和1兼容老版本的python。

pickle.HIGHEST_PROTOCOL

An integer, the highest protocol versionavailable. This value can be passed as a protocol value to functionsdump() and dumps() as well as the Picklerconstructor.也可以使用-1表示。

pickle.DEFAULT_PROTOCOL

An integer, the default protocol version usedfor pickling. May be less than HIGHEST_PROTOCOL. Currently thedefault protocol is 3, a new protocol designed for Python 3.
file：对象保存到的类文件对象。如果protocol>=1，文件对象需要是二进制模式打开的！！！

python2: file必须有write()接口， file可以是一个以’w’方式打开的文件或者一个StringIO对象或者其他任何实现write()接口的对象。

python3的不同：The file argument must have a write() method that accepts a single bytes argument. It can thus be an on-disk file opened for binary writing, an io.BytesIO instance, or any other custom object that meets thisinterface.

fix_imports: If fix_imports is true and protocol is less than 3, pickle will try tomap the new Python 3 names to the old module names used in Python 2, sothat the pickle data stream is readable with Python 2.

Note:多个对象一起dump: pickle.dump((x_train, y_train), open(r’d:\tmp\train.pkl’, ‘wb’))

pickle.load(file, *, fix_imports=True, encoding=”ASCII”, errors=”strict”)

注解：从file中读取一个字符串，并将它重构为原来的python对象。从file中解出当前位置的下一个对象
file:类文件对象，有read()和readline()接口。

Note

在反序列化的时候，必须能找到对应类的定义，否则反序列化将失败。在上面的例子中，如果取消#del Person的注释，在运行时将抛AttributeError异常，提示当前模块找不到Person的定义。
和marshal一样，并不是所有的类型都可以通过pickle序列化的。例如对于一个嵌套的类型，使用pickle序列化就失败。例如：

class
A(
object
):

class
B(
object
):

def
init(
self
, name):

self
.name
=
name

def
init(
self
):

print
‘init A’

b
=
A.B(
“my name”
)

print
b

c
=
pickle.dumps(b,
0
)
#失败

print
pickle.loads(c)

Python手册中的pickle模块，介绍了更高级的主题，例如自定义序列化过程。

pickle.dumps(obj, protocol=None, *, fix_imports=True)

Return the pickled representation of the object as a bytes object,instead of writing it to a file.

[pickle.dumps]

pickle支持的序列化类型

The following types can be pickled:

None, True, and False
integers, floating point numbers, complex numbers
strings, bytes, bytearrays
tuples, lists, sets, and dictionaries containing only picklable objects
functions defined at the top level of a module (using def, notlambda)
built-in functions defined at the top level of a module
classes that are defined at the top level of a module
instances of such classes whose __dict__ or the result ofcalling __getstate__() is picklable (see section Pickling Class Instances fordetails).

[What can be pickled and unpickled?¶]

BUGfix

pickle对象的属性范围

pickle.load(f) AttributeError: Can’t get attribute ‘’ on <module ‘pyspark.daemon’

pickle doesn’t actually store information about how a class/object is constructed, and needs access to the class when unpickling. See wiki on using Pickle for more details.

The class_def.py module:这个模块中pickle一个对象。

import pickle
class Foo(object):
    def __init__(self, name):
        self.name = name
def main():
    foo = Foo('a')
    with open('test_data.pkl', 'wb') as f:
        pickle.dump([foo], f, -1)
if __name__=='__main__':
    main()

You run the above to generate the pickle data.The main_module.py module:这个模块中unpickle上个模块中pickle的对象就会出错。

import pickle
import class_def
if __name__=='__main__':
    with open('test_data.pkl', 'rb') as f:
        users = pickle.load(f)

两种解决方案
You make the class available within the namespace of the top-level module (i.e. GUI or main_module) through an explicit import, or
You create the pickle file from the same top-level module as the one that you will open it in (i.e. call Settings.addUser from GUI, or class_def.main from main_module). This means that the pkl file will save the objects as Settings.Manager or class_def.Foo, which can then be found in the GUI`main_module` namespace.

Option 1 example:

将pickle对象的定义及属性加入到当前namespace中。

import pickle
import class_def
from class_def import Foo # Import Foo into main_module's namespace explicitly
if __name__=='__main__':
    with open('test_data.pkl', 'rb') as f:
        users = pickle.load(f)

Option 2 example:

import pickle
import class_def
if __name__=='__main__':
    class_def.main() # Objects are being pickled with main_module as the top-level
    with open('test_data.pkl', 'rb') as f:
        users = pickle.load(f)

[Unable to load files using pickle and multipile modules]

ValueError: insecure string pickle

out = open(‘xxx.dmp’, ‘w’)
cPickle.dump(d, out)
k = cPickle.load(open(‘xxx.dmp’, ‘r’))
Traceback (most recent call last):
File ““, line 1, in
ValueError: insecure string pickle

就是忘了关闭写时候的文件了。

[ValueError: insecure string pickle]

pickle序列化错误AttributeError: Can’t pickle local object

AttributeError: Can’t pickle local object ‘buildLocKDTree..‘

主要是要pickle对象的参数中有lambda函数作为参数。

解决将lambda函数（写在顶层也没用）改成def函数，且必须将def函数写在py文件顶层（而不是另一个函数的内部，即不能是闭包）。

如

# dist = lambda i, j: distance.vincenty(i, j).miles # unpicklable
def dist(i, j): return distance.vincenty(i, j).miles
loc_kdtree = neighbors.BallTree(l_array, metric='pyfunc', func=dist)

下面这种也不行

>>> def f():
...     class A: pass
...     return A
... 
>>> LocalA = f()
>>> la = LocalA()

[I can “pickle local objects” if I use a derived class?]

spark广播变量出错

cannot pickle a object with a function parameter

spark广播变量首先要load 这个pickle对象（或者直接是balltree对象），而这个对象有一个参数引用了外部的一个函数，这样在pickle.load时就找不到meetDist这个函数。

return pickle.load(f) AttributeError: Can’t get attribute ‘meetDist‘ on

loc_tree = neighbors.BallTree(ltu_df[[0, 1, 'Time(GMT)']][0:10], metric='pyfunc', func=self.meetDist)
loc_tree_bc = sc.broadcast(loc_tree)

皮皮blog

pickle使用示例

python3示例

Note: python3中使用pickle写入或者读入的文件要以‘b‘的方式、打开，否则出错TypeError: must be str, not bytes.

import pickle
# An arbitrary collection of objects supported by pickle.
data = {
    'a': [1, 2.0, 3, 4+6j],
    'b': ("character string", b"byte string"),
    'c': set([None, True, False])
}
with open('data.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)

The following example reads the resulting pickled data.

import pickle
with open('data.pickle', 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    data = pickle.load(f)

python2示例

0 import pickle
1 d = {1:'a', 2:'b', 3:'c'}
2 f = open("newfile", "wb+")
3 pickle.dump(d, f)
4 del d[2]
5 pickle.dump(d, f)
6 f.seek(0)
7 d2 = pickle.load(f)   #这里说明pickle可以区别出一个对象和另一个对象
8 d3 = pickle.load(f)
9 print(d2, d3)
  close(f)

使用Python2.7运行，输出结果：

({1: ‘a’, 2: ‘b’, 3: ‘c’}, {1: ‘a’, 3: ‘c’})

那么文件”newfile”中是些什么内容呢，cat newfile得到如下东西

(dp0
I1
S’a’
p1
sI2
S’b’
p2
sI3
S’c’
p3
s.(dp0
I1
S’a’
p1
sI3
S’c’
p2

Note:

不太看得明白这是两个dict对象?可以看出，pickle确实使用了一些算法。
有时测试例子始终报错，说pickle模块没有dump这个方法，是因为把文件名取为了pickle.py，所以根本没有import进标准的pickle模块。

py3 load py2的pickle文件

py2:

with open(‘a.pickle’, ‘wb’) as f:
pickle.dump(rf_preds, f)

py3:

with open(‘a.pickle’, ‘rb’) as f:
rf_preds = pickle.load(f)

可能会报错：UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xb1 in position 8: ordinal not in range(128)。

解决：

with open(‘a.pickle’, ‘rb’) as f:
rf_preds = pickle.load(f, encoding=’latin1’)

皮皮blog

其它对象序列化模块

pickle的兼容性问题一直让人诟病,除了python没有别的语言使用pickle,而如上表所示,pickle在各个版本的python中也不是默认通用的。

dill序列化模块

dill可以直接使用pip安装,使用也相当简单,只要替代pickle就行了,他们接口相同。与pickle不同,dill的序列化可以跨模块传递,事实上dill也是为了分布式计算传递python对象而设计的。

dill支持几乎所有的python数据（包括nested functions, lambdas cell等），还不支持的有：frame(帧),generator(生成器对象,因为包含帧状态),traceback(依然是因为无法保存帧状态)。

[dill用于序列化python对象]

[其它序列化]

cPickle模块

是使用C语言实现的，所以在运行效率上比pickle要高。但是cPickle模块中定义的类型不能被继承（其实大多数时候，我们不需要从这些类型中继承。）。cPickle和pickle的序列化/反序列化规则是一样的，我们可以使用pickle序列化一个对象，然后使用cPickle来反序列化。

marshal模块

功能比较薄弱，只支持部分内置数据类型的序列化/反序列化，对于用户自定义的类型就无能为力，同时marshal不支持自引用(递归引用)的对象的序列化。所以直接使用marshal来序列化/反序列化可能不是很方便。

python模块中还定义了两个类—分别用来序列化、反序列化对象。

class pickle.Pickler(file[, protocal]):
该类用于序列化对象。参数file是一个类文件对象(file-like object)，用于保存序列化结果。可选参数表示序列化模式。它定义了两个方法：
dump(obj):将对象序列化，并保存到类文件对象中。参数obj是要序列化的对象。
clear_memo():清空pickler的“备忘”。使用Pickler实例在序列化对象的时候，它会“记住”已经被序列化的对象引用，所以对同一对象多次调用dump(obj)，pickler不会“傻傻”的去多次序列化。

class pickle.Unpickler(file):
该类用于反序列化对象。参数file是一个类文件(file-like object)对象，Unpickler从该参数中获取数据进行反序列化。
load():反序列化对象。该方法会根据已经序列化的数据流，自动选择合适的反序列化模式。