Extract URL by Crawling Sitemaps with Python

urussword377 (32)in #web-scraping • 23 days ago

Every website’s sitemap is a goldmine waiting to be tapped. It’s the blueprint websites hand to search engines, revealing every important URL. Instead of chasing links endlessly, why not grab the entire map in one sweep?
However, many sites use index sitemaps — big master files pointing to dozens of smaller sitemaps. Parsing these manually is a pain. Hundreds, even thousands of URLs spread across layers of XML files. It’s tedious and error-prone.
Enter ultimate-sitemap-parser (usp), a Python library that cuts through this complexity. It fetches, parses, and dives into nested sitemaps automatically. With a simple function call, you extract every URL without breaking a sweat.
Let’s break down how you can start using usp to crawl the ASOS sitemap efficiently and cleanly.

Requirements

1. Python Installed
Make sure Python is ready on your system. No Python? Download it from the official site. Confirm installation with:

python3 --version

2. ultimate-sitemap-parser Library
Install usp via pip:

pip install ultimate-sitemap-parser

Step 1: Grab Every URL with Minimal Code

Forget manual XML headaches. usp makes it effortless:

from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)

for page in tree.all_pages():
    print(page.url)

This snippet fetches the sitemap, parses every nested layer, and spits out all URLs. No extra coding needed. Simple. Elegant. Powerful.

Step 2: Handle Nested Sitemaps Like a Boss

Large sites often organize URLs into categories — product pages, blogs, categories — each with their own sitemap. Manually pulling each is a nightmare.

usp automatically detects index sitemaps, follows their child links, and pulls URLs recursively. You get a full list, all nested depths included, with zero extra work.

Step 3: Target Specific URLs

Need only product pages or blog posts? Filtering is straightforward. For example, to extract only ASOS product pages where URLs include /product/:

product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

for url in product_urls:
    print(url)

Pinpoint your crawl. Waste no time on irrelevant links.

Step 4: Save Your Results for Later Use

Instead of dumping URLs on screen, save them neatly into a CSV file. Here’s how:

import csv
from usp.tree import sitemap_tree_for_homepage

url = "https://www.asos.com/"
tree = sitemap_tree_for_homepage(url)
urls = [page.url for page in tree.all_pages()]

with open("asos_sitemap_urls.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to asos_sitemap_urls.csv")

Now your data is stored and ready for any analysis, SEO audit, or scraping project.

Wrapping It Up

ultimate-sitemap-parser transforms sitemap crawling from a tedious chore into a clean, automated flow. No wrestling with XML. No juggling nested indexes. Just straight-up URL extraction, fast and reliable.
If you work with SEO, data scraping, or site auditing, usp should be your go-to Python tool.

#crawlsitemaps

23 days ago in #web-scraping by urussword377 (32)

$0.00