The Method to Crawl Sitemaps with Python
Some websites have hundreds of thousands of pages. Scraping them one by one? Nightmare. But what if you could access every URL in minutes instead of hours? That’s exactly what sitemaps allow. Think of them as a website’s blueprint—every page a site wants search engines to index, neatly listed and structured.
But here’s the catch. Many websites use nested index sitemaps where one sitemap links to others, which in turn link to even more. Trying to parse all of these manually is exhausting and prone to errors.
Enter ultimate-sitemap-parser (usp). This Python library handles the mess for you. Let’s show you how to crawl sitemaps and extract URLs quickly and reliably.
Things You’ll Need
Before we jump in, ensure your environment is ready.
1. Install Python
Python is required to run the scripts. If it’s not installed yet:
Download the latest version from python.org.
Verify the installation:
python3 --version
2. Install ultimate-sitemap-parser
Grab the library via pip:
pip install ultimate-sitemap-parser
How to Scrape Sitemaps with ultimate-sitemap-parser
With usp installed, extracting URLs is surprisingly simple. Let’s break it down.
1. Sitemap Fetching and URL Extraction
Forget XML headaches. usp does the heavy lifting in one step:
from usp.tree import sitemap_tree_for_homepage
url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)
for page in tree.all_pages():
print(page.url)
Every URL on ASOS is now at your fingertips. Simple. Fast. Reliable.
2. Deal with Nested Sitemaps Automatically
Big websites often split their sitemaps into sections: products, categories, blogs. Normally, you’d have to fetch each manually. With usp? It’s automatic.
It will:
Detect index sitemaps.
Fetch child sitemaps without extra code.
Recursively extract all URLs.
Result: a complete dataset, effortlessly.
3. Extract Just the URLs You Need
Want only product pages? Filter by URL patterns:
product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]
for url in product_urls:
print(url)
Targeted extraction. Maximum efficiency. Zero wasted effort.
4. Save URLs for Future Use
Instead of printing URLs, store them for analysis:
import csv
from usp.tree import sitemap_tree_for_homepage
url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)
urls = [page.url for page in tree.all_pages()]
csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["URL"])
for url in urls:
writer.writerow([url])
print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")
Now you’ve got a clean, reusable CSV ready for SEO audits, scraping, or analysis.
Wrapping Up
Using ultimate-sitemap-parser makes crawling sitemaps simple and efficient. You can pull all URLs with just a few lines of code, manage nested sitemaps automatically, and keep only the links you actually need. Whether your goal is site audits, competitor research, or building a scraping workflow, USP converts tedious hours of work into a smooth, repeatable routine.