Python 爬虫入门：10行代码抓取任意网页数据-冉冉博客

爬虫是 Python 最受欢迎的应用场景之一。掌握爬虫，你可以自动采集数据、监控价格、批量下载资源。本文用最简单的方式带你入门。

01 环境准备

安装两个库就够了：

pip install requests beautifulsoup4

requests 负责发送 HTTP 请求，BeautifulSoup 负责解析 HTML。

02 最简单的爬虫：10行代码

import requests
from bs4 import BeautifulSoup

url = 'https://news.ycombinator.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

titles = soup.select('.titleline a')
for title in titles[:10]:
    print(title.text)

就这 10 行，已经能抓取 Hacker News 的头条新闻了。

03 处理反爬：加请求头

很多网站会检测 User-Agent，直接请求会被拒绝。加上浏览器请求头就能绕过：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)

04 抓取多页数据

大多数网站都有分页，用循环批量抓取：

for page in range(1, 6):
    url = f'https://example.com/list?page={page}'
    response = requests.get(url, headers=headers)
    # 解析当前页数据
    print(f'第{page}页抓取完成')

05 保存数据到 CSV

import csv

with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['标题', '链接'])
    for item in data:
        writer.writerow([item['title'], item['url']])