Scraping Baidu: How to Extract Organic Search Results

in #webscraping13 days ago

Baidu is China's answer to Google, a search engine with immense reach and valuable data. But scraping its search results? That’s not a walk in the park. Between CAPTCHAs, rotating IPs, and dynamic content, Baidu makes it hard for bots to grab its information. But, if you’re looking to scrape Baidu’s organic search results using Python, we’re here to guide you through the process.

What to Find on Baidu’s SERP

Baidu's Search Engine Results Page (SERP) packs a punch with different sections, but let’s break it down:

  • Organic Results: The backbone of Baidu’s search—these are the links that best match the search intent.
  • Paid Results: These results come with a "广告" tag. Companies pay for these positions to appear at the top.
  • Related Searches: Need more? Baidu suggests related queries, often at the bottom of the page.

The Hurdles of Scraping Baidu

Baidu isn’t keen on giving up its data. Anti-scraping techniques include CAPTCHAs, dynamic HTML that’s constantly changing, and IP bans. That’s why scraping Baidu can be a hassle without the right tools.
The key? Stay nimble. You’ll need a scraper that adapts to these changes. Enter Swiftproxy API, which provides the resources you need to gather public information without the headache.

Step-by-Step Guide to Scrape Baidu Search Results

We’ll use Swiftproxy API and Python to gather the data. If you're ready to get started, here’s the breakdown.

1. Set Up Your Environment

Install the necessary libraries:

pip install requests beautifulsoup4 pprint

2. Import Libraries

In your Python file, import the libraries you'll need:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

3. Configuring the API URL

Here’s the API endpoint we’ll hit to scrape the data:

url = 'https://realtime.swiftproxy.net/v1/queries'

4. Authentication Credentials

You’ll need your Swiftproxy username and password. Once you’ve got those, plug them into your script like this:

auth = ('your_api_username', 'your_api_password')

5. Set Up the Payload

The payload contains all the parameters for the Baidu query:

payload = {
   'source': 'universal',
   'url': 'https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50',
   'geo_location': 'United States',
}

6. Send the Request

Time to make the API request:

response = requests.post(url, json=payload, auth=auth, timeout=180)
response.raise_for_status()  # Check for errors

7. Load and Validate Data

Check if the request returned results:

json_data = response.json()

if not json_data["results"]:
    print("No results found.")
    return

Parsing Baidu Search Results

Once you’ve got the raw HTML, it’s time to parse it. We’ll use BeautifulSoup to extract useful data like titles and URLs.

Parsing Function:

def parse_baidu_html_results(html_content: str) -> list[dict]:
    parsed_results = []
    soup = BeautifulSoup(html_content, "html.parser")
    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = block.select_one("h3.t a") or block.select_one("h3.c-title-en a")
        if not title_tag:
            continue
        
        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")

        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})

    return parsed_results

Storing Data in CSV Format

You can store the parsed data in a CSV file using pandas. Install pandas first:

pip install pandas

Store to CSV:

import pandas as pd

def store_to_csv(data: list[dict]):
    df = pd.DataFrame(data)
    df.to_csv("baidu_results.csv", index=False)

Full Example Code

Let’s combine everything into one cohesive script.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint

def store_to_csv(data: list[dict]):
    """Store parsed data to CSV"""
    df = pd.DataFrame(data)
    df.to_csv("baidu_results.csv", index=False)

def parse_baidu_html_results(html_content: str) -> list[dict]:
    """Parse Baidu HTML content into a list of dictionaries"""
    parsed_results = []
    soup = BeautifulSoup(html_content, "html.parser")
    result_blocks = soup.select("div.c-container[id]")

    for block in result_blocks:
        title_tag = block.select_one("h3.t a") or block.select_one("h3.c-title-en a")
        if not title_tag:
            continue
        
        title_text = title_tag.get_text(strip=True)
        href = title_tag.get("href")

        if title_text and href:
            parsed_results.append({"title": title_text, "url": href})

    return parsed_results

def main():
    url = "https://realtime.swiftproxy.net/v1/queries"
    payload = {
        "source": "universal",
        "url": "https://www.baidu.com/s?ie=utf-8&wd=nike&rn=50",
        "geo_location": "United States",
    }
    
    auth = ("your_api_username", "your_api_password")
    
    response = requests.post(url, json=payload, auth=auth, timeout=180)
    response.raise_for_status()
    
    json_data = response.json()
    
    if not json_data["results"]:
        print("No results found for the given query.")
        return
    
    html_content = json_data["results"][0]["content"]
    parsed_data = parse_baidu_html_results(html_content)
    store_to_csv(parsed_data)

if __name__ == "__main__":
    main()

Scraping with Residential Proxies

Want to avoid being blocked by Baidu? You can use Residential Proxies to route requests through real IPs. This makes scraping less detectable.

Adding Residential Proxies:

proxy_entry = "http://customer-<your_username>:<your_password>@pr.swiftproxy.net:10000"
proxies = {"http": proxy_entry, "https": proxy_entry}

response = requests.get(url, proxies=proxies, timeout=180)

Conclusion

Baidu’s anti-scraping measures can make it a tough nut to crack. But with the right tools, like Swiftproxy API and Residential Proxies, you’ll be scraping like a pro in no time.
Whether you use the API for ease or Residential Proxies for more control, remember that scraping Baidu is not only about collecting data but doing it correctly.