Tuesday, February 19, 2013

Scraping the Web Without a Proxy on Heroku

403 Forbidden: One of the biggest issues scraping websites.  Eventually after bombarding any reasonably intelligent site with hundreds of requests per minute, they're going to cut you off for a period of time, if not outright ban.  The common work around for this has usually been to get a list of proxies and rotate your requests through them.  Thus, your traffic appears to come from different places and is less noticeable.  However, there's a couple issues with this.

Proxies are slow

The nature of using a proxy should at least double your latency.  Instead of going from A to B, you need to go from A to C to B.  Furthermore, you're not likely the only one using it.  Most public proxies get swarmed with requests and this adds bandwidth issues into the mix.

Proxies only accept certain requests

Most public proxies only accept GET requests, and may limit the domains you can access for a variety of reasons.  This isn't the case with all of them, but it could easily be an issue.

Proxies expire

When using proxy servers, you'll need to keep a constantly updated list of available servers.  They go down without notice and new servers surface all the time.

A Better Solution 

We can get around these issues by using Heroku Scheduler.  The beauty of Heroku is each one has a different IP address.  Their distributed around Amazon Web Services, which contains hundreds of thousands, if not millions of IP addresses.  Every time you spin up a new dyno, you get a new IP address.

Another advantage is that Heroku prorates to the second.  It doesn't matter how many dynos you spin up, just how long they stay alive.  I've found it usually takes a rails dyno about 10 seconds to start up which is a pretty small penalty since you can usually run them for a few minutes before being blocked.  You'll be easily saving the costs by not killing time in proxies.

To take full advantage of this, write your scripts to fail fast.  After a few unsuccessful requests, kill the dyno.  Then set up your scheduling to run constantly.  There's a minimum time interval of 10 minutes for the scheduler, but you can set up multiples of 10 minutes.  This way, you'll actually be able to run through thousands of different IP addresses a day without fear of getting cut off.

27 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. The information on this blog is very useful and very interesting. If someone needs to know about the just click
    Mp3Juices UK proxy

    ReplyDelete
  3. How can i change my ip address by httparty gem

    ReplyDelete
  4. How can i change my ip address by httparty gem

    ReplyDelete
  5. The common work around for this has usually been to get a list of proxies and rotate your requests through them. Thus, your traffic appears to come from different places and is less noticeable. However, there's a couple issues with this. Proxy Sites

    ReplyDelete
  6. I don't that much about privacy, I am more interested in bypassing ipblocks from both our ISP and the server, I use Hide My Ass (HMA) and I am pretty satisfied with it, sometimes using their VPN works faster without, less ping in some games, faster youtube, also playing some IP restricted games, Vindictus EU and PSO2 (Asia), Dark Souls 2 had issues with our ISP firewall as well, using VPN worked like a charm.

    My ISP sucks, but I can't change it, it is the only one available in my area, the other ISP also uses the same infrastructure and partially owned by the first, a ruse to hide monopoly existence.

    VPN Services
    Best Dark net markets

    ReplyDelete
  7. A small business (online or local) is considered a fragile investment. Almost everything has to work as planned or else the investment will fail without notice. Funds, logistics and advertising have to work as expected as everything is essential to its success. dig this

    ReplyDelete

  8. Thanks for providing nice tips and tricks to use this Proxy Sites for YouTube to unblock sites.

    ReplyDelete
  9. This post has helped me for an article which I am writing. Thank you for giving me another point of view on this topic. Now I can easily complete my article. Cheers UK

    ReplyDelete
  10. Site-to-site and remote access are two kinds of VPN services. Remote access refers to a LAN connection that is utilized by an organization so that its employees can connect from remote locations to the private network. why use VPN

    ReplyDelete
  11. I admit, I have not been on this web page in a long time... however it was another joy to see Microleaves It is such an important topic and ignored by so many, even professionals. I thank you to help making people more aware of possible issues.

    ReplyDelete
  12. . Certain proxy sites enable you to surf the web for nothing, while some need a login.mexico proxy

    ReplyDelete
  13. Proxy websites are available for free and many people use proxies to make money. Certain proxy websites allow you to surf the internet for free, while some need a login.Web Proxy

    ReplyDelete
  14. Verify that the User Name you entered is right and retype the Password before attempting the association once more.https://novavpn.com/blog/popcorn-time/

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. Regular visits listed here are the easiest method to appreciate your energy, which is why why I am going to the website everyday, searching for new, interesting info. Many, thank you  click here

    ReplyDelete
  17. I admit, I have not been on this web page in a long time... however it was another joy to see It is such an important topic and ignored by so many, even professionals. I thank you to help making people more aware of possible issues scopri di piu

    ReplyDelete
  18. Thanks for the tips guys. They were all great. I have been having issues with being fat both mentally and physically. Thanks to you guys i have been showing improvements. Do post more. besuche die Website

    ReplyDelete
  19. There are a lot of blogs and articles out there on this topic, but you have acquired another side of the subject. This is reliable content thank you for sharing it. https://allertaprivacy.it

    ReplyDelete
  20. This article is an appealing wealth of informative data that is interesting and well-written. I commend your hard work on this and thank you for this information. You’ve got what it takes to get attention. privacy online

    ReplyDelete
  21. Great post! I am actually getting ready to across this information, is very helpful my friend. Also great blog here with all of the valuable information you have. Keep up the good work you are doing here. privacyenbescherming

    ReplyDelete
  22. Your work here on this blog has been top notch from day 1. You've been continously providing amazing articles for us all to read and I just hope that you keep it going on in the future as well. Cheers! weneedprivacy.com

    ReplyDelete
  23. It was extremely all around composed and straightforward. Not at all like different online journals I have perused which are truly not that good.Thanks a lot https://internetprivatsphare.ch

    ReplyDelete
  24. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon. lesmeilleursvpn

    ReplyDelete
  25. I respect this article for the very much investigated substance and magnificent wording. I got so included in this material that I couldn't quit perusing. I am awed with your work and aptitude. Much obliged to you to such an extent. schweiz vpn

    ReplyDelete
  26. An Android VPN will give you an additional layer of security to complete things without stressing over uncovering individual data. can isp track vpn

    ReplyDelete