The Method to Crawl Sitemaps with Python

urussword377 (36)in #web-scraping • 29 days ago

Some websites have hundreds of thousands of pages. Scraping them one by one? Nightmare. But what if you could access every URL in minutes instead of hours? That’s exactly what sitemaps allow. Think of them as a website’s blueprint—every page a site wants search engines to index, neatly listed and structured.
But here’s the catch. Many websites use nested index sitemaps where one sitemap links to others, which in turn link to even more. Trying to parse all of these manually is exhausting and prone to errors.
Enter ultimate-sitemap-parser (usp). This Python library handles the mess for you. Let’s show you how to crawl sitemaps and extract URLs quickly and reliably.

Things You’ll Need

Before we jump in, ensure your environment is ready.

1. Install Python

Python is required to run the scripts. If it’s not installed yet:
Download the latest version from python.org.
Verify the installation:

python3 --version

2. Install ultimate-sitemap-parser

Grab the library via pip:

pip install ultimate-sitemap-parser

How to Scrape Sitemaps with ultimate-sitemap-parser

With usp installed, extracting URLs is surprisingly simple. Let’s break it down.

1. Sitemap Fetching and URL Extraction

Forget XML headaches. usp does the heavy lifting in one step:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

Every URL on ASOS is now at your fingertips. Simple. Fast. Reliable.

2. Deal with Nested Sitemaps Automatically

Big websites often split their sitemaps into sections: products, categories, blogs. Normally, you’d have to fetch each manually. With usp? It’s automatic.
It will:
Detect index sitemaps.
Fetch child sitemaps without extra code.
Recursively extract all URLs.
Result: a complete dataset, effortlessly.

3. Extract Just the URLs You Need

Want only product pages? Filter by URL patterns:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Targeted extraction. Maximum efficiency. Zero wasted effort.

4. Save URLs for Future Use

Instead of printing URLs, store them for analysis:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

urls = [page.url for page in tree.all_pages()]

csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Now you’ve got a clean, reusable CSV ready for SEO audits, scraping, or analysis.

Wrapping Up

Using ultimate-sitemap-parser makes crawling sitemaps simple and efficient. You can pull all URLs with just a few lines of code, manage nested sitemaps automatically, and keep only the links you actually need. Whether your goal is site audits, competitor research, or building a scraping workflow, USP converts tedious hours of work into a smooth, repeatable routine.

#sitemap

29 days ago in #web-scraping by urussword377 (36)

$0.00

1 vote