1、Create a PRoject
win+R-->cmd-->cd desktop-->scrapy startproject tutorial #this step will create a folder in the desktop.
2、Define an item
open the items.py file,the code:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DmozItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()3、Start a Scrapycreate a new file named dmoz_spider.py then start coding:
import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ['dmoz.org'] start_urls = [ 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/' ] def parse(slef,response): filename = response.url.split("/")[-2] with open(filename,'wb')import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ['dmoz.org'] start_urls = [ 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/' ] def parse(slef,response): '''filename = response.url.split("/")[-2] with open(filename,'wb') as f: f.write(response.body)''' sel = scrapy.selector.Selector(response) sites = sel.xpath('//div[@class="title-and-desc"]') for site in sites: title = site.xpath('a/div[@class="site-title"]/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('div[@class="site-descr "]/text()').extract() print(title,link,desc) import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ['dmoz.org'] start_urls = [ 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/' ] def parse(slef,response): '''filename = response.url.split("/")[-2] with open(filename,'wb') as f: f.write(response.body)''' sel = scrapy.selector.Selector(response) sites = sel.xpath('//div[@class="title-and-desc"]') for site in sites: title = site.xpath('a/div[@class="site-title"]/text()').extract() link = site.xpath('a/@href').extract() desc = site.xpath('div[@class="site-descr "]/text()').extract() print(title,link,desc)f.write(response.body)4、cmd:-->cd desktop
-->cd tutorial
-->scrapy crawl dmoz
when the running ends,you will find two new files named Books and Resources are created in desktop/tutorial folder
5、cmd:(not necessary)
-->Desktop/tutorial>scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books" #this step get the response(must )
-->respose.body
>>> response.headers{'Cteonnt-Length': ['46147'], 'Content-Language': ['en'], 'Set-Cookie': ['JsessionID=CDE6228CA4B21EA2DE64C22A5578133C; Path=/; HttpOnly'], 'Server': ['Apache'],'Date': ['Sun, 19 Feb 2017 12:50:19 GMT'], 'Content-Type': ['text/html;charset=UTF-8']}>>> response.xpath('//title')[<Selector xpath='//title' data=u'<title>DMOZ - Computers: Programming: La'>]>>> response.xpath('//title').extract()[u'<title>DMOZ - Computers: Programming: Languages: Python: Books</title>']>>>
6、dmoz_spider.py code:
import scrapyfrom tutorial.items import DmozItemclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ['dmoz.org'] start_urls = [ 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/' ] def parse(slef,response): '''filename = response.url.split("/")[-2] with open(filename,'wb') as f: f.write(response.body)''' sel = scrapy.selector.Selector(response) sites = sel.xpath('//div[@class="title-and-desc"]') items = [] for site in sites: item = DmozItem() item['title'] = site.xpath('a/div[@class="site-title"]/text()').extract() item['link'] = site.xpath('a/@href').extract() item['desc'] = site.xpath('div[@class="site-descr "]/text()').extract() #print(title,link,desc) items.append(item) return items7、cmd-->cd tutorial
-->scrapy crawl dmoz -o items.json -t json
"items.json" file will be created
新闻热点
疑难解答