【笔记】Urllib学习笔记

前言

爬虫Urllib学习笔记

引入包

  • python3
1
import urllib.request
  • python2
1
import urllib2

定义请求url和参数列表

无参

1
2
url = "http://www.baidu.com"
response = urllib.request.urlopen(url)

有参

GET

1
2
3
4
url = "http://www.baidu.com"
params = "wd=张三"
url = url + params
response = urllib.request.urlopen(url)
JSON
1
2
3
4
5
6
7
8
url = "http://www.baidu.com"
params = {
"wd": "张三"
}
# JSON转换为字符串
params = urllib.parse.urlencode(params)
url = url + params
response = urllib.request.urlopen(url)

POST

1
2
3
4
5
url = "http://www.baidu.com"
params = "wd=张三"
# 编码
params = params.encode("utf-8")
response = urllib.request.urlopen(url, params)
JSON
1
2
3
4
5
6
7
8
9
url = "http://www.baidu.com"
params = {
"wd": "张三"
}
# JSON转换为字符串
params = urllib.parse.urlencode(params)
# 编码
params = params.encode("utf-8")
response = urllib.request.urlopen(url, params)

中文问题

  • python是解释性语言,解析器只支持ascii码0-127之间,不支持中文
  • 所以需要对有中文的参数列表进行转义为urlencoding

引入包

1
2
import urllib.parse
import string

转义

params:包含中文的参数列表

1
urllib.parse.quote(params, safe=string.printable)

处理响应

  • 解码响应数据

  • decode()不写参数默认为系统字符编码集

response:响应回来的数据

1
data = response.read().decode()

指定字符集

1
data = response.read().decode("utf-8")

爬取数据类型转换

  • python爬取的类型:strbytes

  • 如果爬取回来的是bytes类型,但是你写入的时候需要str类型:

1
decode("utf-8")
  • 如果爬取回来的是str类型,但是你写入的时候需要bytes类型:
1
encode("utf-8")

持久化

  • 把爬取到的网页写入为文件

data:处理后的响应数据

1
2
with open("baidu.html", "w", encoding="utf-8") as f:
f.write(data)

请求头与响应头

创建请求对象

url:访问的地址

1
request = urllib.request.Request(url)

请求头

  • 通过请求对象获取请求头信息
1
print(request.headers)

响应头

  • 通过请求头信息发送请求,获取响应头信息
1
2
response = urllib.request.urlopen(request)
print(response.headers)

在请求添加请求头参数

  • 添加User-Agent参数
1
2
3
4
5
6
url = "http://www.baidu.com"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
request = urllib.request.Request(url, headers=headers)
print(request.headers)

打印指定请求头参数

User-agent:参数要求首字母大写,其他字母小写

1
2
request_headers = request.get_header("User-agent")
print(request_headers)

动态生成head数据

1
2
3
4
url = "http://www.baidu.com"
request = urllib.request.Request(url)
request.add_header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
print(request.headers)

获取完整url

1
2
request = urllib.request.Request(url)
request.get_full_url()

随机User-Agent

引入包

1
import random

随机User-Agent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
url = "http://www.baidu.com"
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
]
random_user_agent = random.choice(user_agent_list)
request = urllib.request.Request(url)
request.add_header("User-Agent", random_user_agent)
response = urllib.request.urlopen(request)
print(request.get_header("User-agent"))

完成

参考文献

哔哩哔哩——随风kali
CSDN——蓝星花