编写爬虫程序的神器

2024-04-27 14:34:15

字体：大中小

来源：转载

供稿：网友

编写爬虫程序的神器 - Groovy + Jsoup + Sublime

写过很多个爬虫小程序了，之前几次主要用C# + Html Agility Pack来完成工作。由于.NET FCL只提供了"底层"的HttpWebRequest和"中层"的WebClient，故对HTTP操作还是需要编写很多代码的。加上编写C#需要使用Visual Studio这个很"重"的工具，开发效率长期以来处于一种低下的状态。

最近项目里面接触到了一种神奇的语言Groovy -- 一种全面兼容java语言且提供了大量额外语法功能的动态语言。加上网络上有开源的Jsoup项目 -- 一个轻量级的使用CSS选择器来解析HTML内容的类库，这样的组合编写爬虫简直如沐春风。

抓cnblogs首页新闻标题的脚本

Jsoup.connect("http://cnblogs.com").get().select("#post_list > div > div.post_item_body > h3 > a").each {    PRintln it.text()   }

output

抓cnblogs首页新闻详细信息

Jsoup.connect("http://cnblogs.com").get().select("#post_list > div").take(5).each { def url = it.select("> div.post_item_body > h3 > a").attr("href") def title = it.select("> div.post_item_body > h3 > a").text() def description = it.select("> div.post_item_body > p").text() def author = it.select("> div.post_item_body > div > a").text() def comments = it.select("> div.post_item_body > div > span.article_comment > a").text() def view = it.select("> div.post_item_body > div > span.article_view > a").text()

println "" println "新闻: $title" println "链接: $url" println "描述: $description" println "作者: $author, 评论: $comments, 阅读: $view" }

output

怎么样，很方便是吧。是不是找到一种编写前端Javascript和jQuery代码的感觉，那就对了！

这里说一个窍门，编写CSS选择器的时候可以借助Google Chrome浏览器的开发工具，如图：

再来看看Groovy是如何快速处理JSON和xml的。一句话：方便到家。

抓cnblogs的feeds

new XmlSlurper().parse("http://feed.cnblogs.com/blog/sitehome/rss").with { xml -> def title = xml.title.text() def subtitle = xml.subtitle.text() def updated = xml.updated.text()

println "feeds" println "title -> $title" println "subtitle -> $subtitle" println "updated -> $updated"

def entryList = xml.entry.take(3).collect { def id = it.id.text() def subject = it.title.text() def summary = it.summary.text() def author = it.author.name.text() def published = it.published.text() [id, subject, summary, author, published] }.each { println "" println "article -> ${it[1]}" println it[0] println "author -> ${it[3]}" } }

output

抓msdn订阅的产品分类信息

new JsonSlurper().parse(new URL("http://msdn.microsoft.com/en-us/subscriptions/json/GetProductCategories?brand=MSDN&localeCode=en-us")).with { rs -> println rs.collect{ it.Name } }

output