简单爬虫ajax+pymongo保存

在平时爬取网页的时候,可能都遇到过有些网页直接请求得到的 HTML 代码里面,并没有我们需要的数据,也就是我们在浏览器中看到的内容。

这就是因为这些信息是通过Ajax加载的,并且通过js渲染生成的。



分析网站

打开“检查” 点击“network”和“xhr” 找到基本网址请求参数
wufadakai

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
base_url = 'https://www.guokr.com/apis/minisite/article.json?' #基本网址
# 单独一个网页请求
import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException

def get_page_index(offset):
data = {
'retrieve_type': 'by_subject',
'limit': '20',
'offset': offset
}
try:
url = base_url+urlencode(data)
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
print('请求失败!')
return None

打开 “preview” result下有多个index,每个index中有URL 就是一个新闻网址

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import json

def get_page(url):
try:
resp = requests.get(url)
if resp.status_code == 200:
# print(resp.text)
return resp.text
return None
except ConnectionError:
print('Error.')
return None

def parse_json(text):
try:
result = json.loads(text)
if result:
for i in result.get('result'):
print(i.get('url'))
yield i.get('url')
except:
pass

我们打开网址的源代码 找到我们想要爬取的data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from bs4 import BeautifulSoup

def parse_page(response,url):
try:
soup = BeautifulSoup(response,'lxml')
content = soup.find('div', class_ = 'content')
title = content.find('h1',id="articleTitle").get_text()
author = content.find('div', class_="content-th-info").find('a').get_text()
article_content = content.find('div', class_="document").find_all('p')
all_p = []
img = []
for i in article_content:
if not i.find('img') and not i.find('a'):
all_p.append(i.get_text())
elif i.find('img'):
img.append(i.find('img').get('src'))
#article = '\n'.join(all_p)
#imgs = '\n'.join(img)
data = {
'url':url,
'title': title,
'author': author,
'article': all_p,
'images':img
}
return data
except:
pass

链接MongoDB

1
2
3
4
5
6
7
8
9
10
import pymongo
client = pymongo.MongoClient(host='localhost', port=27017)
db = client['guoke_ajax']

def save_data(data):
collection = db['test']
if collection.insert(data):
print('save data successful')
return True
return False

主函数

1
2
3
4
5
6
7
8
def main(offset):
html = get_page_index(offset)
all_url = parse_json(html)
for url in all_url:
response = get_page(url)
data = parse_page(response,url)
if data:
save_data(data)

多线程和循环

1
2
3
4
5
6
7
8
from multiprocessing import Pool

if __name__ =="__main__":
pool =Pool()
offset = [i*20+18 for i in range(500)]
pool.map(main, offset)
pool.close()
pool.join()

其中,网站会有反扒机制,会图灵测试。