Scraping Baidu: How to Extract Organic Search Results
Baidu is China's answer to Google, a search engine with immense reach and valuable data. But scraping its search results? That’s not a walk in the park. Between CAPTCHAs, rotating IPs, and dynamic content, Baidu makes it hard for bots to grab its information. But, if you’re looking to scrape Baidu’s organic search results using Python, we’re here to guide you through the process.
What to Find on Baidu’s SERP
Baidu's Search Engine Results Page (SERP) packs a punch with different sections, but let’s break it down:
- Organic Results: The backbone of Baidu’s search—these are the links that best match the search intent.
- Paid Results: These results come with a "广告" tag. Companies pay for these positions to appear at the top.
- Related Searches: Need more? Baidu suggests related queries, often at the bottom of the page.
The Hurdles of Scraping Baidu
Baidu isn’t keen on giving up its data. Anti-scraping techniques include CAPTCHAs, dynamic HTML that’s constantly changing, and IP bans. That’s why scraping Baidu can be a hassle without the right tools.
The key? Stay nimble. You’ll need a scraper that adapts to these changes. Enter Swiftproxy API, which provides the resources you need to gather public information without the headache.
Step-by-Step Guide to Scrape Baidu Search Results
We’ll use Swiftproxy API and Python to gather the data. If you're ready to get started, here’s the breakdown.
1. Set Up Your Environment
Install the necessary libraries:
pip install requests beautifulsoup4 pprint
2. Import Libraries
In your Python file, import the libraries you'll need:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
3. Configuring the API URL
Here’s the API endpoint we’ll hit to scrape the data:
url = 'https://realtime.swiftproxy.net/v1/queries'
4. Authentication Credentials
You’ll need your Swiftproxy username and password. Once you’ve got those, plug them into your script like this:
auth = ('your_api_username', 'your_api_password')
5. Set Up the Payload
The payload contains all the parameters for the Baidu query:
payload = {
'source': 'universal',
'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
'geo_location': 'United States',
}
6. Send the Request
Time to make the API request:
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status() # Check for errors
7. Load and Validate Data
Check if the request returned results:
json_data = response.json()
if not json_data["results"]:
print("No results found.")
return
Parsing Baidu Search Results
Once you’ve got the raw HTML, it’s time to parse it. We’ll use BeautifulSoup to extract useful data like titles and URLs.
Parsing Function:
def parse_baidu_html_results(html_content: str) -> list[dict]:
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = block.select_one("h3.t a") or block.select_one("h3.c-title-en a")
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
Storing Data in CSV Format
You can store the parsed data in a CSV file using pandas. Install pandas first:
pip install pandas
Store to CSV:
import pandas as pd
def store_to_csv(data: list[dict]):
df = pd.DataFrame(data)
df.to_csv("baidu_results.csv", index=False)
Full Example Code
Let’s combine everything into one cohesive script.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
def store_to_csv(data: list[dict]):
"""Store parsed data to CSV"""
df = pd.DataFrame(data)
df.to_csv("baidu_results.csv", index=False)
def parse_baidu_html_results(html_content: str) -> list[dict]:
"""Parse Baidu HTML content into a list of dictionaries"""
parsed_results = []
soup = BeautifulSoup(html_content, "html.parser")
result_blocks = soup.select("div.c-container[id]")
for block in result_blocks:
title_tag = block.select_one("h3.t a") or block.select_one("h3.c-title-en a")
if not title_tag:
continue
title_text = title_tag.get_text(strip=True)
href = title_tag.get("href")
if title_text and href:
parsed_results.append({"title": title_text, "url": href})
return parsed_results
def main():
url = "https://realtime.swiftproxy.net/v1/queries"
payload = {
"source": "universal",
"url": "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50",
"geo_location": "United States",
}
auth = ("your_api_username", "your_api_password")
response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()
json_data = response.json()
if not json_data["results"]:
print("No results found for the given query.")
return
html_content = json_data["results"][0]["content"]
parsed_data = parse_baidu_html_results(html_content)
store_to_csv(parsed_data)
if __name__ == "__main__":
main()
Scraping with Residential Proxies
Want to avoid being blocked by Baidu? You can use Residential Proxies to route requests through real IPs. This makes scraping less detectable.
Adding Residential Proxies:
proxy_entry = "http://customer-<your_username>:<your_password>@pr.swiftproxy.net:10000"
proxies = {"http": proxy_entry, "https": proxy_entry}
response = requests.get(url, proxies=proxies, timeout=180)
Conclusion
Baidu’s anti-scraping measures can make it a tough nut to crack. But with the right tools, like Swiftproxy API and Residential Proxies, you’ll be scraping like a pro in no time.
Whether you use the API for ease or Residential Proxies for more control, remember that scraping Baidu is not only about collecting data but doing it correctly.