My web skimming program has been having a field day with the Google news archive. I'm currently pulling stories from back to a year and a half ago. Before dinner tonight, I picked up 2000 new Google clusters on "John Edwards." I was pretty cheered by this progress.
When I got home, I fired up the program again and started searching the year for clusters of "Anna Nicole Smith"... and got nothing. Not a single hit.
This was kind of bewildering to me. I tried a few more times, digging through it with the debugger. Nothing. So finally I pulled out the URL of the search page my program was looking at, and pasted it into my browser. I got this message:
... but your query looks similar to automated requests from a computer
virus or spyware application. To protect our users, we can't process
your request right now.
We'll restore your access as quickly as possible, so try again soon.
In the meantime, if you suspect that your computer or network has been
infected, you might want to run a virus checker or spyware remover to
make sure that your systems are free of viruses and other spurious
We apologize for the inconvenience, and hope we'll see you again on
To continue searching, please type the characters you see below:
[Typical captcha text returned]
UH-oh. I experienced a bit of temporary jumpiness as I realized that Google noticed I've been hitting their server really hard and really fast. I typed in the confirmation text, of course, and it let me view the page. But I tried the program again, and it still didn't work.
I did some research, winding up at this post. I don't really get the details, but it sounds like Google has been targeted by malicious spyware programs in the past, which do tons of web searches that somehow uncover target servers that are vulnerable to attack. Then they install copies of themselves on those target servers, which in turn do more malicious searches on Google's site.
So, yeah, that's pretty neato that they catch bad guys. Unfortunately, they also catch me. That's bad. I have a thesis that needs finishing.
I decided to wait a few hours, and in the meantime I put in some code that makes it pause for five seconds before it gets a web page. I don't want to annoy them.
A few hours later, the spam catcher stopped harassing me. I let the program run for a while longer, and it managed to walk through a couple thousand more clusters, all from the month of March. But then it stopped again, with the same message. This time I had a break in there to kill the program before it started failing a bunch more challenges.
This is going to be a slow process. I want my data. Now. I might consider bumping the delay up to thirty seconds in the morning.
Also, I suspect that Google is making a note of my ISP to determine that I am an evildoer. If that's the case, then maybe I can get around it by wandering around town with my laptop. I'll go from one wireless hotspot to the next, grabbing a few thousand entries here and there, until I've got the whole year's worth of material.