我用 Python 搭了一套网页数据采集系统：从反爬绕过到结构化存储，附 5 个实战案例-尧图网络科技

我用 Python 搭了一套网页数据采集系统：从反爬绕过到结构化存储，附 5 个实战案例

适合需要从网站批量采集数据（文章、商品、评论等）的开发者和运营。
本文用 Python + Playwright + BeautifulSoup 搭了一套完整的网页采集系统，附反爬绕过方案和 5 个真实案例。

背景：为什么需要网页采集

内容创作者和运营每天都需要从网上获取数据：

采集竞品文章标题和阅读量
监控商品价格变化
收集用户评论做分析
抓取行业资讯做选题参考

手动复制粘贴效率极低。用 Python 自动化，一天能采集几万条数据。

技术选型

方案	适合场景	反爬能力	学习成本
requests + BeautifulSoup	静态页面	弱	⭐
Playwright	动态页面（JS 渲染）	强	⭐⭐
Scrapy	大规模采集	中	⭐⭐⭐
Selenium	兼容旧系统	中	⭐⭐

我的组合：Playwright（处理动态页面）+ BeautifulSoup（解析 HTML）+ SQLite（存储）

模块 1：基础采集器

fromplaywright.sync_apiimportsync_playwrightfrombs4importBeautifulSoupimportsqlite3importtimeimportrandomclassWebScraper:"""网页采集器"""def__init__(self,db_path="scraper.db"):self.db=sqlite3.connect(db_path)self._init_db()def_init_db(self):self.db.execute(""" CREATE TABLE IF NOT EXISTS scraped ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, title TEXT, content TEXT, scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP ) """)self.db.commit()deffetch(self,url,wait_for=None):"""获取页面内容（支持 JS 渲染）"""withsync_playwright()asp:browser=p.chromium.launch(headless=True)page=browser.new_page()# 设置 User-Agent 模拟真实浏览器page.set_extra_http_headers({"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"})page.goto(url,wait_until="networkidle")ifwait_for:page.wait_for_selector(wait_for)html=page.content()browser.close()returnhtmldefparse(self,html,selectors):"""解析 HTML 提取数据"""soup=BeautifulSoup(html,"html.parser")results=[]items=soup.select(selectors["container"])foriteminitems:data={}forkey,selectorinselectors["fields"].items():elem=item.select_one(selector)data[key]=elem.get_text(strip=True)ifelemelse""results.append(data)returnresultsdefsave(self,url,data_list):"""保存到数据库"""fordataindata_list:self.db.execute("INSERT INTO scraped (url, title, content) VALUES (?, ?, ?)",(url,data.get("title",""),data.get("content","")))self.db.commit()print(f"保存{len(data_list)}条数据")defscrape(self,url,selectors,delay=2):"""完整的采集流程"""print(f"采集:{url}")html=self.fetch(url)data=self.parse(html,selectors)self.save(url,data)# 随机延迟，避免被封time.sleep(delay+random.uniform(0,2))returndata

模块 2：反爬绕过方案

方案 1：随机 User-Agent

USER_AGENTS=["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/119.0.0.0","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/17.2",]defget_random_ua():returnrandom.choice(USER_AGENTS)

方案 2：代理 IP 轮换

PROXY_LIST=["http://proxy1:8080","http://proxy2:8080","http://proxy3:8080",]deffetch_with_proxy(url,proxy_list):"""使用代理 IP 采集"""proxy=random.choice(proxy_list)withsync_playwright()asp:browser=p.chromium.launch(headless=True,proxy={"server":proxy})page=browser.new_page()page.goto(url,wait_until="networkidle")html=page.content()browser.close()returnhtml

方案 3：模拟人类行为

defhuman_like_scroll(page):"""模拟人类滚动行为"""for_inrange(random.randint(3,8)):page.mouse.wheel(0,random.randint(200,600))time.sleep(random.uniform(0.5,1.5))defhuman_like_click(page,selector):"""模拟人类点击（带随机偏移）"""elem=page.locator(selector)box=elem.bounding_box()ifbox:x=box["x"]+random.uniform(5,box["width"]-5)y=box["y"]+random.uniform(5,box["height"]-5)page.mouse.click(x,y)time.sleep(random.uniform(0.3,0.8))

方案 4：Cookie 持久化

defsave_cookies(context,path="cookies.json"):"""保存登录 Cookie"""cookies=context.cookies()importjsonwithopen(path,"w")asf:json.dump(cookies,f)defload_cookies(context,path="cookies.json"):"""加载已保存的 Cookie"""importjsonifos.path.exists(path):withopen(path)asf:cookies=json.load(f)context.add_cookies(cookies)

方案 5：请求频率控制

classRateLimiter:"""请求频率限制器"""def__init__(self,max_requests_per_minute=20):self.max_rpm=max_requests_per_minute self.requests=[]defwait_if_needed(self):"""如果请求太快就等待"""now=time.time()# 清理 1 分钟前的记录self.requests=[tfortinself.requestsifnow-t<60]iflen(self.requests)>=self.max_rpm:wait_time=60-(now-self.requests[0])print(f"频率限制，等待{wait_time:.1f}秒")time.sleep(wait_time)self.requests.append(time.time())

模块 3：5 个实战案例

案例 1：采集 CSDN 热榜文章

defscrape_csdn_hot():"""采集 CSDN 人工智能热榜"""scraper=WebScraper()selectors={"container":".blog-list-item-top","fields":{"title":".blog-list-item-top a","link":".blog-list-item-top a@href","views":".blog-list-item-top .view-num"}}url="https://blog.csdn.net/nav/ai"data=scraper.scrape(url,selectors)foritemindata[:10]:print(f"{item['title']}-{item.get('views','N/A')}阅读")returndata

案例 2：采集新闻资讯

defscrape_news(url,selectors):"""通用新闻采集"""scraper=WebScraper()returnscraper.scrape(url,selectors)# 示例：采集某个技术博客selectors={"container":"article.post","fields":{"title":"h2 a","summary":".post-excerpt","date":".post-date"}}

案例 3：监控商品价格

defmonitor_price(url,price_selector,product_name):"""监控商品价格变化"""scraper=WebScraper()html=scraper.fetch(url)soup=BeautifulSoup(html,"html.parser")price_elem=soup.select_one(price_selector)ifprice_elem:price_text=price_elem.get_text(strip=True)# 提取数字importre price=re.search(r'[\d,]+\.?\d*',price_text)ifprice:price_val=float(price.group().replace(",",""))print(f"{product_name}: ¥{price_val}")# 存入数据库scraper.db.execute("INSERT INTO prices (product, price, recorded_at) VALUES (?, ?, datetime('now'))",(product_name,price_val))scraper.db.commit()returnprice_valreturnNone

案例 4：采集评论数据

defscrape_comments(url,comment_selector,max_pages=5):"""采集多页评论"""scraper=WebScraper()all_comments=[]forpage_numinrange(1,max_pages+1):page_url=f"{url}?page={page_num}"html=scraper.fetch(page_url)soup=BeautifulSoup(html,"html.parser")comments=soup.select(comment_selector)forcommentincomments:all_comments.append(comment.get_text(strip=True))print(f"第{page_num}页:{len(comments)}条评论")# 随机延迟time.sleep(random.uniform(2,5))print(f"共采集{len(all_comments)}条评论")returnall_comments

案例 5：定时采集 + 变化通知

importscheduledefscheduled_scrape():"""定时采集并检查变化"""scraper=WebScraper()# 采集当前数据current=scraper.scrape("https://example.com/data",selectors)# 对比上次数据last=scraper.db.execute("SELECT content FROM scraped ORDER BY scraped_at DESC LIMIT 10").fetchall()# 检查新增内容last_contents={row[0]forrowinlast}new_items=[itemforitemincurrentifitem.get("content")notinlast_contents]ifnew_items:print(f"发现{len(new_items)}条新内容")# 发送通知（邮件/企业微信等）send_notification(new_items)# 每小时采集一次schedule.every(1).hours.do(scheduled_scrape)