Scrape Instagram Data: Python Practices and Tools

in #web-scraping2 months ago (edited)

Imagine that you need data from Instagram to understand trends, analyze engagement, or gather insights for marketing strategies—but scraping Instagram isn't exactly straightforward. With their anti-bot measures and complex login requirements, getting that data can feel like navigating a maze. But don't worry; there’s a solution that can save you time and effort. Let’s dive into how you can efficiently scrape Instagram data using Python.

Set Up Your Tools

Before diving into the code, make sure you have the necessary Python libraries installed:

pip install requests python-box
  • Requests: The workhorse for making HTTP requests.
  • Python-box: Makes dealing with complex JSON data easier by converting it into Python objects that you can access using dot notation.

Now, let's break this down into digestible chunks: sending API requests, parsing the data, using proxies, and simplifying the JSON handling with Box. This is where the magic happens.

Step 1: Build the API Request

Instagram hides much of its data behind complex front-end security, but the backend? That's a different story. Instagram’s backend API allows us to access detailed profile information without needing to authenticate. Here's how to get that information.
Explanation:

  • Headers: Instagram can detect bot-like activity by analyzing the request headers. By mimicking a real browser, we can make the request look like it’s coming from an Instagram app.
  • API Endpoint: The URL https://i.instagram.com/api/v1/users/web_profile_info/?username={username} is your goldmine. It returns everything you need about a public profile, from follower counts to bio details.

Here’s how you can set it up in Python:

import requests

# Headers to mimic a real browser request
headers = {
    "x-ig-app-id": "936619743392459", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "*/*",
}

# Replace this with the username you want to scrape
username = 'testtest'

# Send the request to Instagram's backend
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json()  # Parse the response into a JSON object

Step 2: Bypass Rate-Limiting with Proxies

Instagram isn’t a fan of repeated requests from the same IP address. So, if you’re scraping on a large scale, proxies are your best friend. They rotate your requests through different IPs, reducing the chances of detection.
Setting Up Proxies:

proxies = {
    'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
    'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}

response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)

Step 3: Parsing JSON with Ease Using Box

Instagram’s API returns complex, nested JSON data. Navigating this with traditional dictionary access can be a pain. Enter Box: This library turns JSON into an object that you can access with simple dot notation, making data extraction a breeze.
Using Box for Simplicity:

from box import Box

response_json = Box(response.json())  # Convert the response into a Box object

# Extract profile data
user_data = {
    'full_name': response_json.data.user.full_name,
    'followers': response_json.data.user.edge_followed_by.count,
    'biography': response_json.data.user.biography,
    'profile_pic_url': response_json.data.user.profile_pic_url_hd,
}

Step 4: Scrape Videos and Timeline Data

Once you have profile data, it's time to scrape Instagram posts and videos. The data includes view counts, likes, comments, and even video durations.
Here’s how to extract the timeline data:

# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
    video_data = {
        'id': element.node.id,
        'video_url': element.node.video_url,
        'view_count': element.node.video_view_count,
    }
    profile_video_data.append(video_data)

# Extract timeline media (photos and videos)
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
    media_data = {
        'id': element.node.id,
        'media_url': element.node.display_url,
        'like_count': element.node.edge_liked_by.count,
    }
    profile_timeline_media_data.append(media_data)

Step 5: Writing Data to JSON Files

Once you've gathered the data, it’s time to store it. Python’s json module lets you easily write the data to files, ready for further analysis.

import json

# Save user data to JSON
with open(f'{username}_profile_data.json', 'w') as file:
    json.dump(user_data, file, indent=4)

# Save video data to JSON
with open(f'{username}_video_data.json', 'w') as file:
    json.dump(profile_video_data, file, indent=4)

# Save timeline media data to JSON
with open(f'{username}_timeline_media_data.json', 'w') as file:
    json.dump(profile_timeline_media_data, file, indent=4)

Full Code Example

Now that you have all the building blocks, here’s the complete script that scrapes Instagram user profile data, video data, and timeline media, using proxies and handling the complexities of the data format:

import requests
from box import Box
import json

# Define headers and proxies as before...

# Send the request and parse the response
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
response_json = Box(response.json())

# Extract user data, videos, and timeline media as shown earlier...

# Save the extracted data to JSON files

Final Thoughts

Scraping Instagram data with Python is a powerful skill that can help you gather valuable insights—whether you’re tracking user engagement, understanding influencer activity, or analyzing trends. Remember, though, to comply with Instagram’s terms of service. Always scrape responsibly and ensure that your efforts align with their policies.