PHPCrawl爬虫库实现抓取酷狗歌单的方法示例讲解

2020-03-22 19:44:16

字体：大中小

来源：转载

供稿：网友

这篇文章主要介绍了PHPCrawl爬虫库实现抓取酷狗歌单的方法,涉及PHPCrawl爬虫库的使用及正则匹配相关操作技巧,需要的朋友可以参考下

本文实例讲述了PHPCrawl爬虫库实现抓取酷狗歌单的方法。分享给大家供大家参考，具体如下：

本人看了网络爬虫相关的视频后，手痒痒，想爬点什么。最近Facebook上表情包大战很激烈，就想着把所有表情包都爬下来，却一时没有找到合适的VPN，因此把酷狗最近一月精选歌曲和简单介绍抓取到本地。代码写得有点乱，自己不是很满意，并不想放上来丢人现眼。不过转念一想，这好歹是自己第一次爬虫，于是...就有了如下不堪入目的代码~~~（由于抓取的数据量较小，所以没有考虑多进程什么的，不过我看了一下PHPCrawl的文档，发现PHPCrawl库已经把我能想到的功能都封装好了，实现起来很方便）

 ?phpheader( Content-type:text/html;charset=utf-8 // It may take a whils to crawl a site ...set_time_limit(10000);include( libs/PHPCrawler.class.php class MyCrawler extends PHPCrawler { function handleDocumentInfo($DocInfo) { // Just detect linebreak for output ( /n in CLI-mode, otherwise br ). if (PHP_SAPI == cli ) $lb = /n  else $lb = br /  $url = $DocInfo-  $pat = /http:////www/.kugou/.com//yy//special//single///d+/.html/  if(preg_match($pat,$url) 0){ $this- parseSonglist($DocInfo); flush(); public function parseSonglist($DocInfo){ $content = $DocInfo- content; $songlistArr = array(); $songlistArr[ raw_url ] = $DocInfo-  //解析歌曲介绍 $matches = array(); $pat = / span 名称： //span ([^( br)]+) br/  $ret = preg_match($pat,$content,$matches); if($ret 0){ $songlistArr[ title ] = $matches[1]; }else{ $songlistArr[ title ] =  //解析歌曲 $pat = / a title=/ ([^/ ]+)/ hidefocus=/ /  $matches = array(); preg_match_all($pat,$content,$matches); $songlistArr[ songs ] = array(); for($i = 0;$i count($matches[0]);$i++){ $song_title = $matches[1][$i]; array_push($songlistArr[ songs ],array( title = $song_title)); echo pre  print_r($songlistArr); echo /pre $crawler = new MyCrawler();// URL to crawl$start_url= http://www.kugou.com/yy/special/index/1-0-2.html $crawler- setURL($start_url);// Only receive content of files with content-type text/html $crawler- addContentTypeReceiveRule( #text/html# //链接扩展$crawler- addURLFollowRule( #http://www/.kugou/.com/yy/special/single//d+/.html$# i $crawler- addURLFollowRule( #http://www.kugou/.com/yy/special/index//d+-/d+-2/.html$# i // Store and send cookie-data like a browser does$crawler- enableCookieHandling(true);// Set the traffic-limit to 1 MB(1000 * 1024) (in bytes,// for testing we dont want to suck the whole site)//爬取大小无限制$crawler- setTrafficLimit(0);// Thats enough, now here we go$crawler- go();// At the end, after the process is finished, we print a short// report (see method getProcessReport() for more information)$report = $crawler- getProcessReport();if (PHP_SAPI == cli ) $lb = /n else $lb = br / echo Summary: .$lb;echo Links followed: .$report- links_followed.$lb;echo Documents received: .$report- files_received.$lb;echo Bytes received: .$report- bytes_received. bytes .$lb;echo Process runtime: .$report- process_runtime. sec .$lb; ?

PS：这里再为大家提供2款非常方便的正则表达式工具供大家参考使用：

JavaScript正则表达式在线测试工具：
http://tools.jb51.net/regex/javascript

正则表达式在线生成工具：
http://tools.jb51.net/regex/create_reg

您可能感兴趣的文章:

PHP实现生成模糊图片的方法示例讲解

Laravel 5.5基于内置的Auth模块实现前后台登陆的详解

PHP二维数组实现去除重复项的方法

以上就是PHPCrawl爬虫库实现抓取酷狗歌单的方法示例讲解的详细内容，PHP教程

郑重声明：本文版权归原作者所有，转载文章仅为传播更多信息之目的，如作者信息标记有误，请第一时间联系我们修改或删除，多谢。

上一篇：列举ThinkPHP5与ThinkPHP3的一些异同点

下一篇：关于php 开发中加密的问题