首页 > 编程 > Python > 正文

python爬虫之xpath的基本使用详解

2020-02-22 23:42:48
字体:
来源:转载
供稿:网友

一、简介

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素,并且 XQuery 和 XPointer 都构建于 XPath 表达之上。 

二、安装

pip3 install lxml 

三、使用

1、导入

from lxml import etree 

2、基本使用

from lxml import etreewb_data = """    <div>      <ul>         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>       </ul>     </div>    """html = etree.HTML(wb_data)print(html)result = etree.tostring(html)print(result.decode("utf-8")) 

从下面的结果来看,我们打印机html其实就是一个python对象,etree.tostring(html)则是不全里html的基本写法,补全了缺胳膊少腿的标签。

 <Element html at 0x39e58f0><html><body><div>      <ul>         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>       </li></ul>     </div>    </body></html> 

3、获取某个标签的内容(基本使用),注意,获取a标签的所有内容,a后面就不用再加正斜杠,否则报错。

写法一

html = etree.HTML(wb_data)html_data = html.xpath('/html/body/div/ul/li/a')print(html)for i in html_data:  print(i.text)<Element html at 0x12fe4b8>first itemsecond itemthird itemfourth itemfifth item             
发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表