Monday, November 25, 2013

NCAA Football Rankings Algorithm

Released a NCAA football rankings algorithm today.  It's pretty simple, but remarkably effective and unbiased.  You can read more about it on Github.

Thursday, May 16, 2013

New Website Launched

Threw up a new site today that posts information about developers.  Right now, you can only find people by programming language, but I'll add location too.  Designs really a disaster at the moment, maybe I'll fix that.

Find ya some devs

Sunday, March 31, 2013

Google Analytics April Fools

Seems the International Space Station has been surfing all my sites today :)  The bubble moves around to wherever the station is orbiting at the moment.  You can see it in the Real-Time overview (https://www.google.com/analytics/web/?hl=en&pli=1#realtime/rt-overview)


Saturday, March 23, 2013

A Response to a Complaint by Bitcoin Socially

The owner of Bitcoin Socially has complained about this page on one of my websites.  Here's his email:
You did not ask to use our images nor to post our email address on badappreviews.com/apps/150044
You will be given 10 days to take down our content or our lawyers will attempt to have the entire site taken down via the Digital Millennium Copyright Act. We will also file a suit for the illegally used content.
Please take this matter seriously,
Bitcoin Socially
Since his email server appears not to be functioning and I can't reply to him, I've decided to post my response here.

To start, I find it highly unlikely that I've violated any of your copyrights.  Email addresses are likely not copyrightable.  According to the US Copyright Office, "Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed."  An email address appears to me to be a fact, just simply an address at which someone can be contacted.  Regardless, I've written to them regarding the matter and should hear back next week.

My use of your images seems to be protected based on the decision in Perfect 10 v. Amazon.com.  According to this decsion, "the owner of a computer that does not store and serve the electronic information to a user is not displaying that information, even if such owner in-line links to or frames the electronic information."  This is exactly the case on my website.  My servers neither store nor serve these images, I simply provide browser instructions to display images Bitcoin Socially has made publicly available.

Update:

The US Copyright Office got back to me.  As I expected, you cannot copyright an email address.

Saturday, February 23, 2013

Bad App Reviews Now Has iOS Apps

Bad App Reviews, now has iOS apps.  We've got about 90k of them listed now, but we're still filling in reviews.  About 5k apps have reviews right now, adding at a rate of 3k apps/day.  You can see them under the search or index, right along side their Android counterparts.

Mashable's HTML Intro

Noticed this ASCII art at the top of Mashable's HTML today.  Seems it gets sent for every page on their site.



<!--
o o     o     +              o
+   +     +             o     +       +
            +
o  +    +        o  +           +        +
     __  __           _           _     _
~_,-|  \/  | __ _ ___| |__   __ _| |__ | | ___
    | |\/| |/ _` / __| '_ \ / _` | '_ \| |/ _ \,-~_,- - - ,
~_,-| |  | | (_| \__ \ | | | (_| | |_) | |  __/    |   /\_/\
    |_|  |_|\__,_|___/_| |_|\__,_|_.__/|_|\___|  ~=|__( ^ .^)
~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,-~_,""   ""
o o     o     +              o
+   +     +             o     +       +
            +
o  +    +        o  +           +        +
-->

Tuesday, February 19, 2013

Scraping the Web Without a Proxy on Heroku

403 Forbidden: One of the biggest issues scraping websites.  Eventually after bombarding any reasonably intelligent site with hundreds of requests per minute, they're going to cut you off for a period of time, if not outright ban.  The common work around for this has usually been to get a list of proxies and rotate your requests through them.  Thus, your traffic appears to come from different places and is less noticeable.  However, there's a couple issues with this.

Proxies are slow

The nature of using a proxy should at least double your latency.  Instead of going from A to B, you need to go from A to C to B.  Furthermore, you're not likely the only one using it.  Most public proxies get swarmed with requests and this adds bandwidth issues into the mix.

Proxies only accept certain requests

Most public proxies only accept GET requests, and may limit the domains you can access for a variety of reasons.  This isn't the case with all of them, but it could easily be an issue.

Proxies expire

When using proxy servers, you'll need to keep a constantly updated list of available servers.  They go down without notice and new servers surface all the time.

A Better Solution 

We can get around these issues by using Heroku Scheduler.  The beauty of Heroku is each one has a different IP address.  Their distributed around Amazon Web Services, which contains hundreds of thousands, if not millions of IP addresses.  Every time you spin up a new dyno, you get a new IP address.

Another advantage is that Heroku prorates to the second.  It doesn't matter how many dynos you spin up, just how long they stay alive.  I've found it usually takes a rails dyno about 10 seconds to start up which is a pretty small penalty since you can usually run them for a few minutes before being blocked.  You'll be easily saving the costs by not killing time in proxies.

To take full advantage of this, write your scripts to fail fast.  After a few unsuccessful requests, kill the dyno.  Then set up your scheduling to run constantly.  There's a minimum time interval of 10 minutes for the scheduler, but you can set up multiples of 10 minutes.  This way, you'll actually be able to run through thousands of different IP addresses a day without fear of getting cut off.