Python爬豆瓣小说首页的图片(多线程版)

Python与爬虫 Haran 8年前 (2016-09-13) 3837次浏览 0个评论

网址:https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4

现在将单线程版改为多线程

运行:可以到看到运行时间不到1分钟,相较单线程版的5~6分钟,速度提高了6倍

Python爬豆瓣小说首页的图片(多线程版)

源码如下:

from bs4 import BeautifulSoup
import requests
import os
import urllib.request
import random
import time
import threading

user_agent = ['Mozilla/5.0 (Windows NT 6.1)\
AppleWebKit/537.11 (KHTML, like Gecko)\
Chrome/23.0.1271.64 Safari/537.11','Mozilla/5.0 (Windows NT 6.1; WOW64)\
AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/47.0.2526.106 Safari/537.36','Mozilla/5.0 \
(Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0',"Mozilla/5.0\
(X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko)\
Chrome/24.0.1312.56 Safari/537.17",'Mozilla/5.0\
(Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0']


def downloadurl(url):


	b=0
	time.sleep(1)
	agent = random.choice(user_agent)
	header= {
	    'Connection': 'Keep-Alive',
	    'Accept': 'text/html, application/xhtml+xml, */*',
	    'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
	    'User-Agent': '%s' %agent}

	soup=BeautifulSoup(requests.get(url,headers = header).text,"html.parser")
	items=soup('li','subject-item')
	
	for item in items:
		b+=1			
			
		urllib.request.urlretrieve(item.find('div','pic').img.get('src'),
			os.path.basename(item.find('div','info').a.get('title')+'.jpg'))		
		print(b)
		

if __name__=='__main__':
	l=threading.Lock()
	Threads=[]
	for i in range(0,1000,20):
		geturl='https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={0}&type=T'.format(i)
		print("now to get "+geturl)
		t=threading.Thread(target=downloadurl,args=([geturl]))
		Threads.append(t)
	for i in range(0,50):
		Threads[i].start()
	for i in range(0,50):
		Threads[i].join()

如有疑问,可以在文章底部留言或邮件(haran.huang@ichdata.com) 我~
喜欢 (1)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址