当前位置：首页 > news >正文

别再只用Requests了！Aiohttp异步爬虫入门：以抓取小说网站为例，聊聊协程与性能提升

news 2026/6/11 8:32:45

突破Requests性能瓶颈：Aiohttp异步爬虫实战与协程思维重塑

当你的爬虫脚本在抓取数百个小说章节页面时，是否经历过这样的煎熬？看着进度条像蜗牛般缓慢移动，CPU使用率却低得可怜，网络请求的等待时间占据了整个流程的90%以上。这正是同步请求库如Requests的天花板——它让我们的代码像排队买奶茶一样，必须等前一个请求完全结束后才能开始下一个。而Aiohttp带来的异步范式，则像开通了VIP快速通道，让多个请求可以并行处理。

1. 同步与异步的本质差异：从排队到并发的思维跃迁

传统同步爬虫的工作模式就像单线程下载管理器：发送请求→等待响应→处理数据→下一个请求。这种线性流程在I/O密集型任务中会造成巨大的资源浪费。以一个包含100个小说页面的抓取任务为例：

# 同步请求的典型模式（伪代码） for page in range(1, 101): html = requests.get(f"https://example.com/novel/page_{page}") parse_content(html) # 必须等待请求完成才能执行

而异步爬虫的工作模式则彻底改变了这种串行逻辑：

# 异步请求的典型模式（伪代码） async def fetch_page(page): async with session.get(f"https://example.com/novel/page_{page}") as resp: return await resp.text() tasks = [fetch_page(page) for page in range(1, 101)] await asyncio.gather(*tasks) # 所有请求同时发起

关键性能指标对比：

指标	Requests同步模式	Aiohttp异步模式
100个页面耗时	~50秒	~5秒
网络利用率	10%-15%	70%-90%
CPU占用率	20%-30%	40%-60%
内存消耗	较低(~100MB)	中等(~300MB)

提示：异步编程真正的优势不在于单次请求的速度，而在于并发处理能力。当请求数量超过50个时，性能差距会呈指数级扩大。

2. Aiohttp核心架构解析：事件循环与协程调度

理解Aiohttp需要先掌握三个核心概念：

事件循环(Event Loop)：异步程序的心脏，负责调度所有协程的执行
协程(Coroutine)：使用async/await定义的异步函数，可暂停/恢复执行
Future/Task：表示异步操作结果的容器，由事件循环管理

典型的Aiohttp爬虫包含以下组件：

import aiohttp import asyncio async def main(): # 创建全局共享的ClientSession async with aiohttp.ClientSession() as session: tasks = [] for url in urls: task = asyncio.create_task(fetch(session, url)) tasks.append(task) await asyncio.gather(*tasks) async def fetch(session, url): async with session.get(url) as response: return await response.text() # 启动事件循环 asyncio.run(main())

连接池配置最佳实践：

conn = aiohttp.TCPConnector( limit=100, # 最大连接数 limit_per_host=20, # 单域名最大连接 enable_cleanup_closed=True, # 自动清理关闭连接 force_close=False # 保持长连接 ) timeout = aiohttp.ClientTimeout( total=30, # 总超时 connect=10, # 连接超时 sock_read=15 # 读取超时 ) async with aiohttp.ClientSession( connector=conn, timeout=timeout, headers=headers ) as session: # 业务代码

3. 实战：异步抓取小说排行榜全流程

让我们以某文学网站24小时热销榜为例，构建完整的异步爬虫解决方案。该案例需要处理：

分页URL动态生成
异步并发请求
HTML解析与数据清洗
异常处理和重试机制
数据存储与导出

项目结构：

novel_spider/ ├── __init__.py ├── config.py # 配置文件 ├── crawler.py # 核心爬虫逻辑 ├── models.py # 数据模型 ├── storage.py # 存储模块 └── utils.py # 工具函数

核心爬取逻辑：

# crawler.py async def fetch_page(session: aiohttp.ClientSession, page: int): url = f"https://www.example.com/rank/hotsales/page{page}/" try: async with session.get(url) as resp: if resp.status == 200: return await resp.text() elif resp.status == 429: await asyncio.sleep(10) # 速率限制时等待 return await fetch_page(session, page) else: logging.warning(f"Page {page} failed: {resp.status}") except Exception as e: logging.error(f"Error fetching page {page}: {str(e)}") return None async def parse_html(html: str): soup = BeautifulSoup(html, 'lxml') books = [] for item in soup.select('.book-list li'): books.append({ 'title': item.select_one('.title').text.strip(), 'author': item.select_one('.author').text.strip(), 'score': float(item.select_one('.score').text), 'update': item.select_one('.update').text.strip() }) return books async def crawl_all_pages(max_page: int = 50): async with aiohttp.ClientSession(headers=HEADERS) as session: tasks = [fetch_page(session, p) for p in range(1, max_page+1)] pages = await asyncio.gather(*tasks) parse_tasks = [parse_html(p) for p in pages if p] results = await asyncio.gather(*parse_tasks) return [book for sublist in results for book in sublist]

性能优化技巧：

使用asyncio.Semaphore控制并发度，避免被反爬
实现指数退避的重试策略
对响应内容进行流式处理，减少内存占用
复用TCP连接，避免重复握手开销

4. 异步编程的陷阱与解决方案

尽管异步编程能大幅提升性能，但也引入了一些新的挑战：

常见问题及对策：

问题现象	根本原因	解决方案
程序突然停止无响应	未await协程	确保所有异步函数被正确await
内存泄漏	未关闭ClientSession	使用async with管理资源
连接数过多被禁	无并发控制	使用Semaphore限制并发
部分请求超时失败	未设置合理超时	配置ClientTimeout参数
CPU占用100%但速度慢	同步阻塞代码混用	使用run_in_executor包装

调试异步代码的特殊技巧：

使用asyncio.debug=True启用调试模式

在协程内添加日志点：

logger.debug(f"Start fetching {url}, running in {asyncio.current_task().get_name()}")

监控事件循环状态：

loop = asyncio.get_event_loop() print(f"Pending tasks: {len(asyncio.all_tasks(loop))}")

注意：避免在异步函数中直接调用同步I/O操作（如文件读写、数据库查询），这会导致整个事件循环阻塞。必要时使用asyncio.to_thread或loop.run_in_executor。

5. 进阶：分布式异步爬虫架构

当单机性能达到瓶颈时，可以考虑以下扩展方案：

分布式架构设计：

消息队列+工作者模式：
- 使用Redis/RabbitMQ作为任务队列
- 多个爬虫节点消费任务
- 结果统一存储到数据库

Scrapy+Aiohttp集成：

class AsyncSpider(scrapy.Spider): async def parse(self, response): async with aiohttp.ClientSession() as session: api_data = await fetch_api(session) yield process_data(api_data)

容器化部署：

FROM python:3.9 WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt CMD ["python", "main.py", "--concurrency=100"]

性能监控指标：

# 在爬虫中埋点监控 metrics = { 'request_count': 0, 'success_count': 0, 'failure_count': 0, 'avg_response_time': 0, 'qps': 0 } async def fetch_with_metrics(session, url): start = time.time() metrics['request_count'] += 1 try: async with session.get(url) as resp: metrics['success_count'] += 1 return await resp.text() except: metrics['failure_count'] += 1 raise finally: duration = time.time() - start metrics['avg_response_time'] = ( metrics['avg_response_time'] * (metrics['request_count']-1) + duration ) / metrics['request_count']

在实际项目中，我们通常会遇到各种意料之外的边缘情况。比如某次抓取时发现目标网站启用了动态渲染，直接请求HTML获取不到关键数据。这时就需要：

async def fetch_dynamic_content(url): from pyppeteer import launch browser = await launch(headless=True) page = await browser.newPage() await page.goto(url, {'waitUntil': 'networkidle2'}) content = await page.content() await browser.close() return content

这种混合使用异步HTTP客户端和无头浏览器的方案，既保持了异步的高效，又能应对现代Web应用的复杂性。记住，好的爬虫不仅要跑得快，还要足够健壮和灵活。

查看全文

http://www.zskr.cn/news/1502777.html