Monday, September 24, 2007

Data mining the news (ongoing work)

My thesis is about using data mining to analyze the relative emphasis that traditional media outlets give to various types of stories. Then I'll be comparing this data to the emphasis that actual news consumers who inhabit give to the same stories. My point is to discover which types of stories are overplayed or underplayed, and come to some sort of conclusions about which types of news sources best reflect the pubilc interest.

To that end, I've written a big Java program around an online MySQL database. In the last few days I've cataloged about 22,000 news pages, although only a small number of them will ultimately turn out to be important to the study. I've labeled roughly a dozen web sites and a dozen news topics as "interesting." The sites are:
The stories are:
  1. Rudolph Giuliani
  2. Anna Nicole Smith
  3. Harry Potter
  4. Tiger Woods
  5. Rupert Murdoch
  6. Barack Obama
  7. Gulf Coast
  8. Mitt Romney
  9. New Orleans
  10. Hillary Clinton
  11. Britney Spears
  12. Blackwater
  13. Ron Paul
Crazy lists, aren't they? There is some method to this madness. With the stories, I tried to get a reasonable sample of popular topics, some of which are serious and some of which are decidedly unserious. I have a lot of presidential candidates in there since I'll be especially interested to compare who's being covered vs. who people WANT to be covered. For instance, my hunch is that expecting that Ron Paul is a topic of interest much more for Digg readers than for media outlets. Ron Paul seems to have some kind of word of mouth campaign going on where libertarian fans of his call shows like C-Span and post on blogs all over the place, whereas the news seems to be largely ignoring him. I'm not a Paul support, except to the extent that I think he's clearly the least evil Republican in the race.

With the web sites, the idea is to have a variety of media sources. Some are considered serious news sites; some are "fluff" news (I picked USA Today specifically for that reason, and it's possible that CNN will tend to fall in that category as well); and several are explicitly right wing rags. To be fair, I really would like to have included left wing rags, but the only ones I can identify are blogs, which are not treated much as news sources. The news is all pulled off of I search for the topics of interests, then read the resulting stories more or less indiscriminately and identify which site each one comes from.

Based on this, I have a total of nearly 2000 "news" sources, ordered by the number of stories found in searches since I started collecting data. In the stories I've pulled so far, after about three days of serious searches on the 13 topics, the New York Times and the Washington Post (my main "serious news" sites) each account for 104 stories. But has shown up zero times, so I guess there's a master list that they're clearly not on. TPM Muckraker and TPM Cafe both show up, and those are both explicitly liberal sites, but there are only 8 stories from them. "The Nation": 9 stories. So, liberal sites = small sample size. No use.

By contrast,, whose "about" page proudly announces that they were founded as a "conservative web community," accounts for 123 stories. Yes, you read that right: for the topics I picked, townhall is treated as "news" more often than either the New York Times or the Washington Post. So, bottom line, I get to pick on right wing news sources more than left wing news sources, simply because left wing news isn't "news."

Almost time for the Daily Show now, so I've managed to procrastinate this long. Go me!

If anyone would like to make further contributions, feel free to suggest other story topics that are in the news. Anna Nicole Smith and Harry Potter aren't actually generating very many headlines these days, so I need more unserious topics that the media uses as padding these days. Suggestions? And if you have more right-wing, left-wing, or "mainstream" news sources that I should be looking at, make some suggestions. I'll check my database and see if there are enough stories represented to get something useful out of them.

No comments:

Post a Comment