Tuesday, February 19, 2013

Scraping the Web Without a Proxy on Heroku

403 Forbidden: One of the biggest issues scraping websites.  Eventually after bombarding any reasonably intelligent site with hundreds of requests per minute, they're going to cut you off for a period of time, if not outright ban.  The common work around for this has usually been to get a list of proxies and rotate your requests through them.  Thus, your traffic appears to come from different places and is less noticeable.  However, there's a couple issues with this.

Proxies are slow

The nature of using a proxy should at least double your latency.  Instead of going from A to B, you need to go from A to C to B.  Furthermore, you're not likely the only one using it.  Most public proxies get swarmed with requests and this adds bandwidth issues into the mix.

Proxies only accept certain requests

Most public proxies only accept GET requests, and may limit the domains you can access for a variety of reasons.  This isn't the case with all of them, but it could easily be an issue.

Proxies expire

When using proxy servers, you'll need to keep a constantly updated list of available servers.  They go down without notice and new servers surface all the time.

A Better Solution 

We can get around these issues by using Heroku Scheduler.  The beauty of Heroku is each one has a different IP address.  Their distributed around Amazon Web Services, which contains hundreds of thousands, if not millions of IP addresses.  Every time you spin up a new dyno, you get a new IP address.

Another advantage is that Heroku prorates to the second.  It doesn't matter how many dynos you spin up, just how long they stay alive.  I've found it usually takes a rails dyno about 10 seconds to start up which is a pretty small penalty since you can usually run them for a few minutes before being blocked.  You'll be easily saving the costs by not killing time in proxies.

To take full advantage of this, write your scripts to fail fast.  After a few unsuccessful requests, kill the dyno.  Then set up your scheduling to run constantly.  There's a minimum time interval of 10 minutes for the scheduler, but you can set up multiples of 10 minutes.  This way, you'll actually be able to run through thousands of different IP addresses a day without fear of getting cut off.

1 comment:

  1. Great to know about but i can open in seconds just one click and you can open any blocked site just a click on this
    FirstRow Sports UK proxy

    ReplyDelete