Scraping the Web With Python

rabm (30)in #programming • 7 years ago (edited)

Good day. So today I was working on a project that I had to do some web scraping. It was, basically, about extracting remote data from web pages, just HTML code, that you have to parse to extract what you need. So having that html code parsed, I had to find and extract some info contained between tables rows, information that was going to be used later in data analisis / data science. And I thought that, maybe, somebody can benefit from the simple and easy to use Python code that I made, well "part of it", to achieve something similar, in the same or different scenarios.

Let's set a simple scenario. Let's say I have a big bunch of images I composed together with Gimp, and used them in my Busy blog. The cat pulled the power cord and y laptop fell, smashed into the ground, and now my mechanical hard drive is dead. Yeah, I know we have SSD now, but anyways, imagine situation :-). Oh no!, I just remembered that I didn't backup these images!, what do I do, what do I do!... Oh, I have these images in my blog. Hmmm, I have two options; either copy the photos from their IPFS links, and expend few minutes saving each photos, one by one, "remember I had and used a big bunch of images", or I can write a super easy Python script that will pull the photos for me. Hmmm, yeah I think it is better to use Python , right? , so let's do it!

So, which python package should I use? Well, I will need to pull remote data from the web, so definitely the best option is to use Requests. Then, I will need to parse that information, in a way that will make it easy to extract what I need. Best option is BeautifulSoup. I will also need to read the image data, which will be in binary, and I will need to work with images, so I will use Pillow to work with the image, and IO to read the data easily into Pillow.

Alright. That's basically it. Now let's code!

Let's Import the Packages first.:

Ok, we have our packages in place. Now let's define a string variable with the URL I want to use to pull information. In this case, I will use my blog post url:

Perfect. Next, a very complicated code!?. Let's pull all html code from that URL into a string variable we will call "html", by using Requests!

Amazing. Now we have our html code! I had to write so much, I need a coffee! But there's still work to do. Now we need to parse our html, so we can then pull all img tags into a python list:

Done! We used the amazing BeautifulSoup package to parse and extract all img tags into a python list, in just 2 lines!

Alright. Now, having all tags in our "images" list variable, we will need to pull the url from the src attribute. So let's declare a new list that will contain all urls, then use a for loop to travel through all items in the list and pull src attribute content using BeautifulSoup's awesome syntax:

Great!, at this point we have all of our ipfs urls as a python list variable. Now, for the final part, we will need to travel the new list, using a for loop in this case, and then make a Requests request to get the content of the url, create a pillow image object by pulling the binary data caught by our request, and because pillow image open method can be a file object in binary mode, we will make use of a binary memory stream, easily, thanks to Python's IO package. Then, we will extract the "hash" from the url, simply using split, and lastly we will save the image as JPEG, keeping the quality at 100%... Let's do it!

And that's it guys! This is the basic code required to scrap my web site, extract the images, and save it locally. If everything goes well (do you have Internet connection? ;-), we will have a list of images named as ".jpg", where is the IPFS hash like this:

Isn't that amazing? How easy it is to do this in Python with the help of all these awesome packages? And you can reuse this over and over... Of course, this is a limited script and you might run into some issues, like when Pillow is not able to recognize or load the image format, in which case you will need to make conditions with conditions using if statements, but nothing crazy. Also there's an IPFS package for python that you can use/implement to figure figure out the real names of the files, and things like that.

If you want to try this code, I wrote and share it publicly on my Repl.it account.

Peace and love to everybody,
And, thanks for reading
Roberto.

#tecnology #python #development #scraping

7 years ago in #programming by rabm (30)

Sort:

ilovecoding (39) 7 years ago

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

$0.00

2 votes

[-]

crokkon (68) 7 years ago

Hey @rabm, for scraping general webpages, BS is certainly a good candidate. For things like Steem/Busy-related content, there are python libraries to directly access the Blockchain: checkout beem, lightsteem or steem-python! No need to parse HTML with those :)

$0.00

1 vote

[-]

rabm (30) 7 years ago

Thanks, didn't know about Steem specific libraries, good data! But my idea behind this was just to share a simple how to do web scraping to extract data (for data science, or any other data), and I used the post as an example on how to extract the images. But again, good comment. Thanks!.

$0.00

[-]

crokkon (68) 7 years ago

Yes, for general web scraping, your approach is of course correct. You've used a busy-post as an example, so I wasn't sure in which direction this will be going :)

$0.00

[-]

steemitboard (66) 7 years ago

Congratulations @rabm! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

Award for the number of upvotes received

_{Click on the badge to view your Board of Honor.}
_{If you no longer want to receive notifications, reply to this comment with the word STOP}

Do not miss the last post from @steemitboard:

	SteemitBoard - Witness Update
	SteemFest³ - SteemitBoard support the Travel Reimbursement Fund.

Support SteemitBoard's project! Vote for its witness and get one more award!

$0.00

1 vote

[-]

rabm (30) 7 years ago (edited)

So, I came to clarify because somebody, somewhere, asked me why I was using PIL if I wasn't doing anything with the images... I know, I know, I did it this this way just so you know there's PIL, and it allows you to treat images... But yeah, if it's about just pull files, in this case images, and save them locally, you just use you do something like this:

If you want to test it: https://repl.it/@iBobX/pullingremotewebimages

So, there you go. ;)
Thanks.
Roberto.

$0.00