首页 > 编程 > Python > 正文

python爬虫库scrapy简单使用实例详解

2020-02-15 21:17:20
字体:
来源:转载
供稿:网友

最近因为项目需求,需要写个爬虫爬取一些题库。在这之前爬虫我都是用node或者php写的。一直听说python写爬虫有一手,便入手了python的爬虫框架scrapy.

下面简单的介绍一下scrapy的目录结构与使用:

首先我们得安装scrapy框架

pip install scrapy

接着使用scrapy命令创建一个爬虫项目:

scrapy startproject questions

相关文件简介:

scrapy.cfg: 项目的配置文件

questions/: 该项目的python模块。之后您将在此加入代码。

questions/items.py: 项目中的item文件.

questions/pipelines.py: 项目中的pipelines文件.

questions/settings.py: 项目的设置文件.

questions/spiders/: 放置spider代码的目录.

questions/spiders/xueersi.py: 实现爬虫的主体代码.

xueersi.py  爬虫主体

# -*- coding: utf-8 -*-import scrapyimport timeimport numpyimport refrom questions.items import QuestionsItemclass xueersiSpider(scrapy.Spider):  name = "xueersi" # 爬虫名字  allowed_domains = ["tiku.xueersi.com"] # 目标的域名  # 爬取的目标地址  start_urls = [    "http://tiku.xueersi.com/shiti/list_1_1_0_0_4_0_1",    "http://tiku.xueersi.com/shiti/list_1_2_0_0_4_0_1",    "http://tiku.xueersi.com/shiti/list_1_3_0_0_4_0_1",  ]  levels = ['偏易','中档','偏难']  subjects = ['英语','语文','数学']   # 爬虫开始的时候,自动调用该方法,如果该方法不存在会自动调用parse方法  # def start_requests(self):  #   yield scrapy.Request('http://tiku.xueersi.com/shiti/list_1_2_0_0_4_0_39',callback=self.getquestion)  # start_requests方法不存在时,parse方法自动被调用  def parse(self, response):     # xpath的选择器语法不多介绍,可以直接查看官方文档    arr = response.xpath("//ul[@class='pagination']/li/a/text()").extract()    total_page = arr[3]     # 获取分页    for index in range(int(total_page)):      yield scrapy.Request(response.url.replace('_0_0_4_0_1',"_0_0_4_0_"+str(index)),callback=self.getquestion) # 发出新的请求,获取每个分页所有题目  # 获取题目  def getquestion(self,response):    for res in response.xpath('//div[@class="main-wrap"]/ul[@class="items"]/li'):      item = QuestionsItem() # 实例化Item类      # 获取问题      questions = res.xpath('./div[@class="content-area"]').re(r'<div class="content-area">?([/s/S]+?)<(table|//td|div|br)')      if len(questions):        # 获取题目        question = questions[0].strip()        item['source'] = question        dr = re.compile(r'<[^>]+>',re.S)        question = dr.sub('',question)        content = res.extract()        item['content'] = question        # 获取课目        subject = re.findall(ur'http:////tiku/.xueersi/.com//shiti//list_1_(/d+)',response.url)        item['subject'] = self.subjects[int(subject[0])-1]        # 获取难度等级        levels = res.xpath('//div[@class="info"]').re(ur'难度:([/s/S]+?)<')        item['level'] = self.levels.index(levels[0])+1                # 获取选项        options = re.findall(ur'[A-D][/..]([/s/S]+?)<(//td|//p|br)',content)        item['options'] = options        if len(options):          url = res.xpath('./div[@class="info"]/a/@href').extract()[0]          request = scrapy.Request(url,callback=self.getanswer)          request.meta['item'] = item # 缓存item数据,传递给下一个请求          yield request      #for option in options:  # 获取答案        def getanswer(self,response):        res = response.xpath('//div[@class="part"]').re(ur'<td>([/s/S]+?)<//td>')    con = re.findall(ur'([/s/S]+?)<br>[/s/S]+?([A-D])',res[0]) # 获取含有解析的答案    if con:      answer = con[0][1]      analysis = con[0][0] # 获取解析    else:      answer = res[0]      analysis = ''    if answer:      item = response.meta['item'] # 获取item      item['answer'] = answer.strip()      item['analysis'] = analysis.strip()      item['answer_url'] = response.url      yield item # 返回item,输出管道(pipelines.py)会自动接收该数据             
发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表