用CSV格式来保存文件是个不错的主意,因为大部分程序设计语言和应用程序都能处理这种格式,所以交流起来非常方便。然而这种格式的存储效率不是很高,原因是CSV及其他纯文本格式中含有大量空白符;而后来发明的一些文件格式,如zip、bzip和gzip等,压缩率则有了显著提升。
首先导入模块:
In [1]: import numpy as np In [2]: import pandas as pd In [3]: from tempfile import NamedTemporaryFile In [4]: from os.path import getsize
这里我们将使用Python标准的NamedTemporaryFile来存储数据,这些临时文件随后会自动删除。
接下来获取CSV文件格式的大小:
In [5]: np.random.seed(42) In [6]: a = np.random.randn(365,4) In [7]: tmpf = NamedTemporaryFile() In [8]: np.savetxt(tmpf,a,delimiter=',') In [9]: print("Size CSV file",getsize(tmpf.name))Size CSV file 36693
下面首先以NumPy.npy格式来保存该数组,随后载入内存,并检查数组的形状以及.npy文件的大小:
In [10]: tmpf = NamedTemporaryFile() In [11]: np.save(tmpf,a) In [12]: tmpf.seek(0)Out[12]: 0 In [13]: loaded = np.load(tmpf) In [14]: print("Shape",loaded.shape)Shape (365, 4) In [15]: print("Size .npy file",getsize(tmpf.name))Size .npy file 11760
.npy文件的大小只有CSV文件的三分之一左右。实际上,利用Python可以存储任意复杂的数据结构。也可以序列化格式来存储pandas的DataFrame或者Series数据结构
在Python中,pickle是将Python对象存储到磁盘或其他介质时采用的一种格式,这个格式化的过程叫做序列化。之后,我们可以从存储器中重建该Python对象,这个逆过程称为反序列化。并非所有的Python对象都能够序列化;不过借助诸如dill之列的模块,可以将更多种类的Python对象序列化。
首先用前面生成的NumPy数组创建一个DataFame,接着用to_pickle()方法将其写入一个pickle对象中,然后用read_pickle()函数从这个pickle对象中检索该DataFrame:
In [16]: tmpf.nameOut[16]: '/tmp/tmpyy06safp' In [17]: df = pd.DataFrame(a) In [18]: df.to_pickle(tmpf.name) 是将DataFrame()写入到/tmp/tmpyy06safp中 In [19]: print("Size pickled dataframes",getsize(tmpf.name))Size pickled dataframes 12250 In [20]: tmpf.nameOut[20]: '/tmp/tmpyy06safp' In [21]: print("DF from pickle/n",pd.read_pickle(tmpf.name))DF from pickle 0 1 2 30 0.496714 -0.138264 0.647689 1.5230301 -0.234153 -0.234137 1.579213 0.7674352 -0.469474 0.542560 -0.463418 -0.4657303 0.241962 -1.913280 -1.724918 -0.5622884 -1.012831 0.314247 -0.908024 -1.4123045 1.465649 -0.225776 0.067528 -1.4247486 -0.544383 0.110923 -1.150994 0.3756987 -0.600639 -0.291694 -0.601707 1.8522788 -0.013497 -1.057711 0.822545 -1.2208449 0.208864 -1.959670 -1.328186 0.19686110 0.738467 0.171368 -0.115648 -0.30110411 -1.478522 -0.719844 -0.460639 1.05712212 0.343618 -1.763040 0.324084 -0.38508213 -0.676922 0.611676 1.031000 0.93128014 -0.839218 -0.309212 0.331263 0.97554515 -0.479174 -0.185659 -1.106335 -1.19620716 0.812526 1.356240 -0.072010 1.00353317 0.361636 -0.645120 0.361396 1.53803718 -0.035826 1.564644 -2.619745 0.82190319 0.087047 -0.299007 0.091761 -1.98756920 -0.219672 0.357113 1.477894 -0.51827021 -0.808494 -0.501757 0.915402 0.32875122 -0.529760 0.513267 0.097078 0.96864523 -0.702053 -0.327662 -0.392108 -1.46351524 0.296120 0.261055 0.005113 -0.23458725 -1.415371 -0.420645 -0.342715 -0.80227726 -0.161286 0.404051 1.886186 0.17457827 0.257550 -0.074446 -1.918771 -0.02651428 0.060230 2.463242 -0.192361 0.30154729 -0.034712 -1.168678 1.142823 0.751933.. ... ... ... ...335 0.160574 0.003046 0.436938 1.190646336 0.949554 -1.484898 -2.553921 0.934320337 -1.366879 -0.224765 -1.170113 -1.801980338 0.541463 0.759155 -0.576510 -2.591042339 -0.546244 0.391804 -1.478912 0.183360340 -0.015310 0.579291 0.119580 -0.973069341 1.196572 -0.158530 -0.027305 -0.933268342 -0.443282 -0.884803 -0.172946 1.711708343 -1.371901 -1.613561 1.471170 -0.209324344 -0.669073 1.039905 -0.605616 1.826010345 0.677926 -0.487911 2.157308 -0.605715346 0.742095 0.299293 1.301741 1.561511347 0.032004 -0.753418 0.459972 -0.677715348 2.013387 0.136535 -0.365322 0.184680349 -1.347126 -0.971614 1.200414 -0.656894350 -1.046911 0.536653 1.185704 0.718953351 0.996048 -0.756795 -1.421811 1.501334352 -0.322680 -0.250833 1.328194 0.556230353 0.455888 2.165002 -0.643518 0.927840354 0.057013 0.268592 1.528468 0.507836355 0.538296 1.072507 -0.364953 -0.839210356 -1.044809 -1.966357 2.056207 -1.103208357 -0.221254 -0.276813 0.307407 0.815737358 0.860473 -0.583077 -0.167122 0.282580359 -0.248691 1.607346 0.490975 0.734878360 0.662881 1.173474 0.181022 -1.296832361 0.399688 -0.651357 -0.528617 0.586364362 1.238283 0.021272 0.308833 1.702215363 0.240753 2.601683 0.565510 -1.760763364 0.753342 0.381158 1.289753 0.673181 [365 rows x 4 columns]
新闻热点
疑难解答