近期在做爬虫时有时会遇到网站只提供pdf的情况,这样就不能使用scrapy直接抓取页面内容了,只能通过解析PDF的方式处理,目前的解决方案大致只有pyPDF和PDFMiner。因为据说PDFMiner更适合文本的解析,而我需要解析的正是文本,因此最后选择使用PDFMiner(这也就意味着我对pyPDF一无所知了)。
首先说明的是解析PDF是非常蛋疼的事,即使是PDFMiner对于格式不工整的PDF解析效果也不怎么样,所以连PDFMiner的开发者都吐槽PDF is evil. 不过这些并不重要。官方文档在此:http://www.unixuser.org/~euske/python/pdfminer/index.html
一.安装:
1.首先下载源文件包 http://pypi.python.org/pypi/pdfminer/,解压,然后命令行安装即可:python setup.py install
2.安装完成后使用该命令行测试:pdf2txt.py samples/simple1.pdf,如果显示以下内容则表示安装成功:
Hello World Hello World H e l l o W o r l d H e l l o W o r l d
3.如果要使用中日韩文字则需要先编译再安装:
# make cmappython tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txtreading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...writing 'CNS1_H.py'......(this may take several minutes) # python setup.py install
二.使用
由于解析PDF是一件非常耗时和内存的工作,因此PDFMiner使用了一种称作lazy parsing的策略,只在需要的时候才去解析,以减少时间和内存的使用。要解析PDF至少需要两个类:PDFParser 和 PDFDocument,PDFParser 从文件中提取数据,PDFDocument保存数据。另外还需要PDFPageInterpreter去处理页面内容,PDFDevice将其转换为我们所需要的。PDFResourceManager用于保存共享内容例如字体或图片。
Figure 1. Relationships between PDFMiner classes
比较重要的是Layout,主要包括以下这些组件:
LTPage
Represents an entire page. May contain child objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.
LTTextBox
Represents a group of text chunks that can be contained in a rectangular area. Note that this box is created by geometric analysis and does not necessarily represents a logical boundary of the text. It contains a list of LTTextLine objects. get_text() method returns the text content.
LTTextLine
Contains a list of LTChar objects that represent a single text line. The characters are aligned either horizontaly or vertically, depending on the text's writing mode. get_text() method returns the text content.
LTChar
LTAnno
Represent an actual letter in the text as a Unicode string. Note that, while a LTChar object has actual boundaries, LTAnno objects does not, as these are "virtual" characters, inserted by a layout analyzer according to the relationship between two characters (e.g. a space).
新闻热点
疑难解答