Saturday, September 29, 2007

An unexpected hazard of mining other people's websites for information:

Sorry for the deluge of long computer sciency posts. The thing is, it's helping me to blog about my thesis. Earlier this week when I posted some comments about my research, I pasted the whole post into my paper and got another three pages out of it. Awesome! It needs some editing, but there's plenty of solid material in there. So, let's see if I can get away with writing the whole paper just by blogging. You, dear readers, will just have to decide whether to suffer through these posts or skip them. Unfortunately, tonight's commentary is about a big setback.

My web skimming program has been having a field day with the Google news archive. I'm currently pulling stories from back to a year and a half ago. Before dinner tonight, I picked up 2000 new Google clusters on "John Edwards." I was pretty cheered by this progress.

When I got home, I fired up the program again and started searching the year for clusters of "Anna Nicole Smith"... and got nothing. Not a single hit.

This was kind of bewildering to me. I tried a few more times, digging through it with the debugger. Nothing. So finally I pulled out the URL of the search page my program was looking at, and pasted it into my browser. I got this message:

403 forbidden

We're sorry...

... but your query looks similar to automated requests from a computer
virus or spyware application. To protect our users, we can't process
your request right now.

We'll restore your access as quickly as possible, so try again soon.
In the meantime, if you suspect that your computer or network has been
infected, you might want to run a virus checker or spyware remover to
make sure that your systems are free of viruses and other spurious

We apologize for the inconvenience, and hope we'll see you again on
To continue searching, please type the characters you see below:
[Typical captcha text returned]

UH-oh. I experienced a bit of temporary jumpiness as I realized that Google noticed I've been hitting their server really hard and really fast. I typed in the confirmation text, of course, and it let me view the page. But I tried the program again, and it still didn't work.

I did some research, winding up at this post. I don't really get the details, but it sounds like Google has been targeted by malicious spyware programs in the past, which do tons of web searches that somehow uncover target servers that are vulnerable to attack. Then they install copies of themselves on those target servers, which in turn do more malicious searches on Google's site.

So, yeah, that's pretty neato that they catch bad guys. Unfortunately, they also catch me. That's bad. I have a thesis that needs finishing.

I decided to wait a few hours, and in the meantime I put in some code that makes it pause for five seconds before it gets a web page. I don't want to annoy them.

A few hours later, the spam catcher stopped harassing me. I let the program run for a while longer, and it managed to walk through a couple thousand more clusters, all from the month of March. But then it stopped again, with the same message. This time I had a break in there to kill the program before it started failing a bunch more challenges.

This is going to be a slow process. I want my data. Now. I might consider bumping the delay up to thirty seconds in the morning.

Also, I suspect that Google is making a note of my ISP to determine that I am an evildoer. If that's the case, then maybe I can get around it by wandering around town with my laptop. I'll go from one wireless hotspot to the next, grabbing a few thousand entries here and there, until I've got the whole year's worth of material.


  1. Is there a reason you only use Google?
    There are other search engines.

  2. You could try to ask Google for permission to crawl. They might even hire you if your data-mining goes well.

  3. John, I'm not using Google as a search engine for web sites, I'm using Google News as an aggregator for mainstream news stories. And yes, there is a reason I'm only searching on one site.

    While there may be a few other news sites that do the same thing, it's not a trivial matter to start searching multiple sites. Each time I set up a site, I have to write specific code that recognizes exactly what format their results come in, so I can identify the headlines, dates, and target URLs of the sites listed.

    Bugsoup, I did post a message on the Google help forum, as well as emailing the group owner. Whether they'll ever get around to responding to me is another matter.

  4. Your posts are relitively empty of jargon and I found this one pretty easy to follow.

    Your talking me out of doing further post grad study at the moment and my wallet is thanking you.

  5. If whatever you are using to actually do the harvest (wget, curl, etc) allows you to (most likely), I recommend changing your agent string regularly.

    If that doesn't do it, you could also route your trafic through the anonymous tor network so its point of origination changes at least once a minute if not once a request.

    Also make sure your query string looks as much like a normal querystring using the UI as possible.

    I suspect one or both of those will get you past the filter.

    Good luck.