python爬虫之xpath的基本使用详解

2020-02-22 23:42:48

字体：大中小

来源：转载

供稿：网友

一、简介

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

二、安装

pip3 install lxml

三、使用

1、导入

from lxml import etree

2、基本使用

from lxml import etreewb_data = """    <div>      <ul>         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>       </ul>     </div>    """html = etree.HTML(wb_data)print(html)result = etree.tostring(html)print(result.decode("utf-8"))

从下面的结果来看，我们打印机html其实就是一个python对象，etree.tostring(html)则是不全里html的基本写法，补全了缺胳膊少腿的标签。

 <Element html at 0x39e58f0><html><body><div>      <ul>         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>       </li></ul>     </div>    </body></html>

3、获取某个标签的内容(基本使用)，注意，获取a标签的所有内容，a后面就不用再加正斜杠，否则报错。

写法一

html = etree.HTML(wb_data)html_data = html.xpath('/html/body/div/ul/li/a')print(html)for i in html_data:  print(i.text)<Element html at 0x12fe4b8>first itemsecond itemthird itemfourth itemfifth item

上一篇：使用python读取txt文件的内容,并删除重复的行数方法

下一篇：对Python中range()函数和list的比较

学习交流

笔记本开机提示error loading os错误的问

笔记本开机提示error loading os错误的问题怎么解决...

热门图片

猜你喜欢的新闻

猜你喜欢的关注