好资源导航 » 文章资讯 » python网络爬虫学习笔记（1）

python网络爬虫学习笔记（1）

2023-09-14 09:07:06 160

本文实例为大家分享了python网络爬虫的笔记，供大家参考，具体内容如下

（一）三种网页抓取方法

1、正则表达式：

模块使用C语言编写，速度快，但是很脆弱，可能网页更新后就不能用了。

2、BeautifulSoup

模块使用Python编写，速度慢。

安装：

pipinstallbeautifulsoup4

3、Lxml

模块使用C语言编写，即快速又健壮，通常应该是最好的选择。

（二） Lxml安装

pipinstalllxml

如果使用lxml的css选择器，还要安装下面的模块

pipinstallcssselect

（三）使用lxml示例

importurllib.requestasre
importlxml.html
#下载网页并返回HTML
defdownload(url,user_agent='Socrates',num=2):
print('下载:'+url)
#设置用户代理
headers={'user_agent':user_agent}
request=re.Request(url,headers=headers)
try:
#下载网页
html=re.urlopen(request).read()
exceptre.URLErrorase:
print('下载失败'+e.reason)
html=None
ifnum>0:
#遇到5XX错误时，递归调用自身重试下载，最多重复2次
ifhasattr(e,'code')and500<=e.code<600:
returndownload(url,num-1)
returnhtml
html=download('https://tieba.baidu.com/p/5475267611')
#将HTML解析为统一的格式
tree=lxml.html.fromstring(html)
#img=tree.cssselect('img.BDE_Image')
#通过lxml的xpath获取src属性的值，返回一个列表
img=tree.xpath('//img[@class="BDE_Image"]/@src')
x=0
#迭代列表img,将图片保存在当前目录下
foriinimg:
re.urlretrieve(i,'%s.jpg'%x)
x+=1

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持毛票票。

返回顶部
3162201930
czq8825@qq.com

python网络爬虫学习笔记（1）

热门推荐

随机推荐