Scraping with Python

I made a tool based on the beautifulsoup package to scrap web pages in a Pythonic way.

Demo

I made a quick and dirty scraper with the library. You can call a URL and get all links or image URLs, for example:

python3 pextractor.py -u https://mytarget.com -e links -o links.txt
python3 pextractor.py -u https://mytarget.com -e img -o images.txt

See the repo

Why?

I wanted to test how long it takes to build a basic scraper in Python. I’m not surprised by the easiness, as the ecosystem is extremely powerful, but the possibilities for reconnaissance, brute-force attacks and footprinting are amazing.

It’s very accurate, and you can target specific elements in the whole HTML tree, and even specific attributes:

python3 pextractor.py -u https://mytarget.com -e generator

The above command gets the meta generator if it finds it.

More advanced usages

There are many more options like prettify, which is helpful when you don’t know what you’re looking for, or interesting helpers like the one that extracts the raw text. You can also scrap entire websites recursively if you combine with other packages.

Types of scrapers

Here are a few examples:

  • Spiders
  • HTML parsers ^^
  • Screenscrapers (e.g., PhantomJS)
  • Human copycat (plagiarism)

How to protect against unwanted scraping

As you can see, it’s relatively easy to setup. There are various scripts that rely on the library actually.

Preventing scraping is not an easy task. Many techniques involve special cookies or js-based solutions that display content conditionally, but it can create more problems than it solves.

Likewise, obfuscating data and using captchas (or honeypots) is not necessarily the best approach and can harm accessibility and user experience.

What you can do is rate-limiting requests and keep logs of user’s requests (authentication is not required). This way, you may “identify” scrapers or, at least, make their life significantly harder. If they reach the rate limit, they’ll get a 429 error (Too many requests).

You can also target specific IP ranges to block them more effectively and stop those who use known bypasses like IP rotation.

There are ways to catch automated traffic (bots), especially with some metrics like high volumes. If your adversaries are stronger than usual, you can try to block the traffic by User-Agent (or empty User-Agent) and more complex patterns.

You may also modify your HTML to see if your scrapers are happy with that.

Last but not least, I would recommend to be extra-careful with your APIs and other JSON/XML feeds. While it is handy for many purposes, it can also feed scrapers if it’s too open or misconfigured.

Wrap up

Web scraping is not evil, but black hats and other malicious actors use it too. Active monitoring is recommended.

See Also