使用BeautifulSoup提取网页信息并自动存储

2019-11-14 11:37:47

字体：大中小

来源：转载

供稿：网友

关于BeautifulSoup类的实例方法和属性的说明，不再赘述，还是拿示例分析，看一看使用BeautifulSoup是如何提取网站信息并自动存储的。

下面的例子是用提供的网站域名作为文件夹名称，把提取到的图像文件存储到文件夹中。

from bs4 import BeautifulSoupimport requestsimport osfrom urllib.request import urlopenfrom urllib.parse import urlparse'''if len(sys.argv) < 2:    PRint("用法：python bs4FileTest.py 网址")    exit(1)'''url = 'http://www.abvedu.com/appcpzs'domain = "{}://{}".format(urlparse(url).scheme, urlparse(url).hostname)#http://www.abvedu.comsrc = requests.get(url)print(type(src))src.encoding = 'bgk'#获得以标记为元素的文本列表html  = src.text#对超文本标记语言进行解析,生成一个BeautifulSoup实例bsbs = BeautifulSoup(html,'html.parser')#搜索的目标是<img>标签,把搜索到的符合条件的标签存放到列表all_imgs中all_imgs = bs.find_all(['a','img'])#all_imgs = bs.find_all(['img'])#迭代列表for link in all_imgs:    #提取属性值，即从<img..../>标签中提取属性    src = link.get('src')    print("-----",src,"------------")    href = link.get('href')    print("**********",href,"**********")    #创建一个列表    targets = [src, href]    for t in targets:        if t != None and ('.jpg' in t or '.png' in t or 'gif' in t):            if t.startswith('http'): full_path = t            else:                     full_path = domain+t            print(full_path)            image_dir = url.split('/')[-1]            #检查要存取的文件夹是否存在，如果不存在就创建一个新的            if not os.path.exists(image_dir): os.mkdir(image_dir)            filename = full_path.split('/')[-1]            ext = filename.split('.')[-1]            filename = filename.split('.')[-2]            if  'jpg' in ext: filename = filename + '.jpg'            else:              filename = filename + '.png'            image = urlopen(full_path)            fp = open(os.path.join(image_dir,filename),'wb')            fp.write(image.read())            fp.close()

上一篇：1012_畅通工程

下一篇：CAS单点登录(SSO)完整教程(2012-02-01更新)

学习交流

索泰发布一款GTX 1070 Mini迷你版本:小机

索泰发布一款GTX 1070 Mini迷你版本:小机箱大爱...

热门图片

猜你喜欢的新闻

猜你喜欢的关注