Python 爬虫之Beautiful Soup模块使用指南

2020-02-15 22:11:54

字体：大中小

来源：转载

供稿：网友

爬取网页的流程一般如下：

选着要爬的网址（url）使用 python 登录上这个网址（urlopen、requests 等）读取网页信息（read() 出来）将读取的信息放入 BeautifulSoup 使用 BeautifulSoup 选取 tag 信息等

可以看到，页面的获取其实不难，难的是数据的筛选，即如何获取到自己想要的数据。本文就带大家学习下 BeautifulSoup 的使用。

BeautifulSoup 官网介绍如下：

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库，它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式，能够帮你节省数小时甚至数天的工作时间。

1 安装

可以利用 pip 直接安装：

$ pip install beautifulsoup4

BeautifulSoup 不仅支持 HTML 解析器，还支持一些第三方的解析器，如 lxml，XML，html5lib 但是需要安装相应的库。如果我们不安装，则 Python 会使用 Python 默认的解析器，其中 lxml 解析器更加强大，速度更快，推荐安装。

$ pip install html5lib$ pip install lxml

2 BeautifulSoup 的简单使用

首先我们先新建一个字符串，后面就以它来演示 BeautifulSoup 的使用。

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""

使用 BeautifulSoup 解析这段代码，能够得到一个 BeautifulSoup 的对象，并能按照标准的缩进格式的结构输出:

>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc, "lxml")>>> print(soup.prettify())

篇幅有限，输出结果这里不再展示。

另外，这里展示下几个简单的浏览结构化数据的方法：

>>> soup.title<title>The Dormouse's story</title>>>> soup.title.name'title'>>> soup.title.string"The Dormouse's story">>> soup.p['class']['title']>>> soup.a<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>>>> soup.find_all('a')[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]>>> soup.find(id='link1')<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

上一篇：python中csv文件的若干读写方法小结

下一篇：python处理数据,存进hive表的方法