Monday, October 01, 2007

Google captured me again

D'oh. I've been CAPTCHA'D.

Sitting here in Texspresso after work. I decided to let my program run at top speed. I wasn't sure whether it would take a fixed amount of time to catch me, or whether it's mainly based on the number of page hits. I reduced my sleep time so that I get a new web page every two seconds. It only took them twenty minutes to make me stop, so the speed at which I hit them is definitely a big factor.

Oh well. In that time I managed to collect 1100 new clusters, which finishes off the month of September 2006 (the month that Paris Hilton got arrested, which make some entertaining analysis). But I only managed to pick up 100 stories, so I've got more to do.

Nephlm mentioned a program called Tor that hides your IP address, so maybe I'll try that and see if it works.

Update: Tor works! It works like a charm! Nephlm, I owe you a beer. Come to Austin sometime and I'll pay up.

Tor is a product of the Electronic Freedom Foundation, and what it does is rout your web requests through various remote servers so that the Google server can't tell where you're really coming from.

But an amusing side effect is: When I logged in to blogger, everything was in German. I must be sending requests through a host in Germany somewhere, and now Blogger sees my destination and thinks I want the German version of Google.

Oh well, who cares, as long as I'm getting my data. :) "Post veröffentlichen" means "publish this post," right?


  1. I, personally, am fond of Googling in BorkBorkBork

    And sometimes for amusement, I choose the Spanish option at the ATM; however, there is a new ATM near RLM that has a plethora of language choices -- French, Russian, Japanese, etc. I could really do a lot of damage to my account by choosing the wrong language for the transaction.

  2. I'm glad tor worked. Though as a curtesy you probably shouldn't run at FULL speed. Throw a sleep() in there for half a second or so.

    You probaby only want to run tor while you need its features. There is the language side effect and the encrytion and routing slow network connections quite a bit.

    Mmmmm.... Beer.

  3. Yeah, actually I thought better of it and I'm sleeping for 2 seconds.

    But also, my database access rate is kind of slow anyway (I upload 20 new items per page, plus association rows, plus checking against existing rows to avoid duplicates). So in practice the sleep is actually much shorter than the natural delay anyway.

  4. Oh, and also, I had to set up the database to run through a proxy, but I don't have to use a proxy in Firefox. So, no more German.

    Lucky for me that Google News doesn't seem to redirect you to German news sources, or I'd be in trouble.