Sunday, September 30, 2007

Paper abstract

For anyone who's interested. I want to take this opportunity to repeat my thanks to those people who suggested directions to go in when I asked for help earlier this year.

In recent years, major news corporations seem to dedicate an increasing amount of time and space to "fluff," reporting on celebrities, entertainment and crime stories, rather than more essential national and international news. As such news content is increasingly gathered online, it has become feasible to aggregate large amounts of data from a wide range of sites. This report proposes a model for collecting information from news agencies, then applying the techniques of Data Mining to organize this reporting in a way that identifies the priorities of individual organizations.

In addition, the rise of user-based taxonomies has made it possible broadly to evaluate the interests of people who actively read and recommend news. In the final analysis, data collected from users of Digg.com are compared with data collected from media sites. This provides a benchmark for determining whether the delivery of "fluff" news is delivered is a fair response to popular demand, or whether typical news readers are dissatisfied with the level of serious event coverage found in the media.

Thesis saga continues: It's working

As per my last post, I'm now sitting here in Schlotzsky's Deli, which has free wi-fi, and my data mining program is just blazing along. I have my flag set to yell at me immediately if I get the "Are you human?" warning.

Maybe it's just because I've been listening to Harry Potter and the Deathly Hallows on tape, but my current frame of mind is that this is kind of exciting. Sort of like I'm sneaking around to my safe houses in order to avoid being apprehended by the authorities. It's the nerdiest cloak-and-dagger story you've ever heard, I bet. And by comical coincidence, I just checked my progress and it's looking at a story about Harry Potter from March '06 right now.

In a weird kind of way, this has actually helped me refocus my attention on how to attack the problem, a bit. Previously I was just indiscriminately grabbing all kinds of data, without regard to whether it was useful or not. Now that I know that my time is limited and I could be "captcha'd" at any moment, I've tightened my focus in a way that makes a lot of sense. I'm focusing on stories only within a specific time range, and only bothering to look at clusters of approximately average size. This way, I know that even if I'm interrupted in the middle and can't collect any more data at all, I'll still have plenty of information to work with.

This has also given me some new ideas on how to interpret the data, and I'm looking forward to analyzing it later. Eventually I won't need to worry about what Google thinks of me, because I can just read their stuff from my own private database.

Saturday, September 29, 2007

An unexpected hazard of mining other people's websites for information:

Sorry for the deluge of long computer sciency posts. The thing is, it's helping me to blog about my thesis. Earlier this week when I posted some comments about my research, I pasted the whole post into my paper and got another three pages out of it. Awesome! It needs some editing, but there's plenty of solid material in there. So, let's see if I can get away with writing the whole paper just by blogging. You, dear readers, will just have to decide whether to suffer through these posts or skip them. Unfortunately, tonight's commentary is about a big setback.

My web skimming program has been having a field day with the Google news archive. I'm currently pulling stories from back to a year and a half ago. Before dinner tonight, I picked up 2000 new Google clusters on "John Edwards." I was pretty cheered by this progress.

When I got home, I fired up the program again and started searching the year for clusters of "Anna Nicole Smith"... and got nothing. Not a single hit.

This was kind of bewildering to me. I tried a few more times, digging through it with the debugger. Nothing. So finally I pulled out the URL of the search page my program was looking at, and pasted it into my browser. I got this message:

403 forbidden

We're sorry...

... but your query looks similar to automated requests from a computer
virus or spyware application. To protect our users, we can't process
your request right now.

We'll restore your access as quickly as possible, so try again soon.
In the meantime, if you suspect that your computer or network has been
infected, you might want to run a virus checker or spyware remover to
make sure that your systems are free of viruses and other spurious
software.

We apologize for the inconvenience, and hope we'll see you again on
Google.
To continue searching, please type the characters you see below:
[Typical captcha text returned]

UH-oh. I experienced a bit of temporary jumpiness as I realized that Google noticed I've been hitting their server really hard and really fast. I typed in the confirmation text, of course, and it let me view the page. But I tried the program again, and it still didn't work.

I did some research, winding up at this post. I don't really get the details, but it sounds like Google has been targeted by malicious spyware programs in the past, which do tons of web searches that somehow uncover target servers that are vulnerable to attack. Then they install copies of themselves on those target servers, which in turn do more malicious searches on Google's site.

So, yeah, that's pretty neato that they catch bad guys. Unfortunately, they also catch me. That's bad. I have a thesis that needs finishing.

I decided to wait a few hours, and in the meantime I put in some code that makes it pause for five seconds before it gets a web page. I don't want to annoy them.

A few hours later, the spam catcher stopped harassing me. I let the program run for a while longer, and it managed to walk through a couple thousand more clusters, all from the month of March. But then it stopped again, with the same message. This time I had a break in there to kill the program before it started failing a bunch more challenges.

This is going to be a slow process. I want my data. Now. I might consider bumping the delay up to thirty seconds in the morning.

Also, I suspect that Google is making a note of my ISP to determine that I am an evildoer. If that's the case, then maybe I can get around it by wandering around town with my laptop. I'll go from one wireless hotspot to the next, grabbing a few thousand entries here and there, until I've got the whole year's worth of material.

Monday, September 24, 2007

Data mining the news (ongoing work)

My thesis is about using data mining to analyze the relative emphasis that traditional media outlets give to various types of stories. Then I'll be comparing this data to the emphasis that actual news consumers who inhabit Digg.com give to the same stories. My point is to discover which types of stories are overplayed or underplayed, and come to some sort of conclusions about which types of news sources best reflect the pubilc interest.

To that end, I've written a big Java program around an online MySQL database. In the last few days I've cataloged about 22,000 news pages, although only a small number of them will ultimately turn out to be important to the study. I've labeled roughly a dozen web sites and a dozen news topics as "interesting." The sites are:
  1. www.washingtonpost.com
  2. www.nytimes.com
  3. www.foxnews.com
  4. www.guardian.co.uk
  5. online.wsj.com
  6. www.usatoday.com
  7. www.cnn.com
  8. www.townhall.com
  9. www.washingtontimes.com
The stories are:
  1. Rudolph Giuliani
  2. Anna Nicole Smith
  3. Harry Potter
  4. Tiger Woods
  5. Rupert Murdoch
  6. Barack Obama
  7. Gulf Coast
  8. Mitt Romney
  9. New Orleans
  10. Hillary Clinton
  11. Britney Spears
  12. Blackwater
  13. Ron Paul
Crazy lists, aren't they? There is some method to this madness. With the stories, I tried to get a reasonable sample of popular topics, some of which are serious and some of which are decidedly unserious. I have a lot of presidential candidates in there since I'll be especially interested to compare who's being covered vs. who people WANT to be covered. For instance, my hunch is that expecting that Ron Paul is a topic of interest much more for Digg readers than for media outlets. Ron Paul seems to have some kind of word of mouth campaign going on where libertarian fans of his call shows like C-Span and post on blogs all over the place, whereas the news seems to be largely ignoring him. I'm not a Paul support, except to the extent that I think he's clearly the least evil Republican in the race.

With the web sites, the idea is to have a variety of media sources. Some are considered serious news sites; some are "fluff" news (I picked USA Today specifically for that reason, and it's possible that CNN will tend to fall in that category as well); and several are explicitly right wing rags. To be fair, I really would like to have included left wing rags, but the only ones I can identify are blogs, which are not treated much as news sources. The news is all pulled off of news.google.com. I search for the topics of interests, then read the resulting stories more or less indiscriminately and identify which site each one comes from.

Based on this, I have a total of nearly 2000 "news" sources, ordered by the number of stories found in searches since I started collecting data. In the stories I've pulled so far, after about three days of serious searches on the 13 topics, the New York Times and the Washington Post (my main "serious news" sites) each account for 104 stories. But dailykos.com has shown up zero times, so I guess there's a master list that they're clearly not on. TPM Muckraker and TPM Cafe both show up, and those are both explicitly liberal sites, but there are only 8 stories from them. "The Nation": 9 stories. So, liberal sites = small sample size. No use.

By contrast, townhall.com, whose "about" page proudly announces that they were founded as a "conservative web community," accounts for 123 stories. Yes, you read that right: for the topics I picked, townhall is treated as "news" more often than either the New York Times or the Washington Post. So, bottom line, I get to pick on right wing news sources more than left wing news sources, simply because left wing news isn't "news."

Almost time for the Daily Show now, so I've managed to procrastinate this long. Go me!

If anyone would like to make further contributions, feel free to suggest other story topics that are in the news. Anna Nicole Smith and Harry Potter aren't actually generating very many headlines these days, so I need more unserious topics that the media uses as padding these days. Suggestions? And if you have more right-wing, left-wing, or "mainstream" news sources that I should be looking at, make some suggestions. I'll check my database and see if there are enough stories represented to get something useful out of them.

This is it. I'm officially in grad school hell.

Bless me father, for I have sinned. It has been two weeks since my last blog post.

So -- ha ha -- did I think that semesters like this one and this one were tough? Bugger that, this one takes the cake. The first draft of my 50-ish page Master's Report is supposed to be done in early October, so I've been focused on that for the week since my last class. Meanwhile, in my next class weekend I have one homework and two midterm exams.

I spend an entire weekend working non-stop on my thesis, then I got to enjoy going back to work fresh on Monday. My boss gave me Friday afternoon off, which was a nice gesture, except of course for the fact that I used it to do schoolwork.

I spent most of Saturday at a coffee shop on campus. Actually driving to campus was a stupid plan, because apparently there was this little football game going on that I wasn't thinking about. I was originally planning to go to the library and renew my TexShare card, but parking turned out to be impossible. So, coffee shop. Nice thing about UT is that it's so wired you can actually get wireless internet from everywhere, included some parking lots.

My work's really taking shape now. I've filled out the 14-page template for my report, which feels like I've accomplished some real work even though only two pages of actual double spaced text are written.

I meant to start working on the homework tonight; however, I've been so brain-fried that I mostly just ran the data collection program, stared at the news for a while, and did a whole lot of nothin' else. Blogging is just another form of procrastination, which I think I will continue to do until the Daily Show starts, at which point I will concede defeat for the evening. There's always tomorrow.

I was going to write more about my thesis in this post, but I'd rather keep this one strictly a post wherein I bitch about the trials of being a grad student, and cleanly separate the stuff about what kind of work I'm doing into a separate post. I think blogging will help me overcome writer's block in adding more detail to the report, so humor me, dear readers. See you in the next post.

Monday, September 10, 2007

I sure did fall for that one!

This weekend I got email from a "Paulraj JY" in India. It said:

Subject: Greetings from a New Friend fm India

Dear Sir and Mr.Russell,

Greetings from India. It was a surprise for me to read your blog and it
is full of surprises. Though you addressed yourself as an athiest, you are full of human virtues and you are a nice person to befriend.
Though I'm a Christian Missionary, I'm interested in your views,
thoughts and way of life.
When you consider yourself to be an athiest it may not be meaningful for me to ask you to pray for our service among children those who are left uncared. These Children are very special to us and we enjoy working with them. You, as I estimate, full of good values for human values and relationship, would be happy to hear more about our work among such children. We would be very happy to win your heart and have you as one of our wellwishers of our program.
You have a beautiful family. Please convey our greetings to them. Bye for now....
With special love and regards
Paulraj JY

Now, it's actually not all that uncommon that I get email from other countries. I am a regular contributor to two podcasts that have some small measure of notoriety, and I get people I never heard of commenting on my blog from time to time.

Nevertheless, there WAS that small voice that said to me, "Hey, this sounds a little bit like the stilted language in some of those Nigerian scam emails." But then I thought, "Nah, those guys mostly shotgun form emails when they're looking for new suckers. This guy was very specific about my blog and my atheism. Be a good atheist emissary and answer him."

So I wrote:

Dear Paul,

Part of the reason why I openly identify myself as an atheist is because theists rarely encounter people who are willing to say that they don't believe in God, and so they may have a lot of misperceptions. While I don't believe that atheists are better people than Christians, I do think that we are just as likely to care about humanity and have compassion.

In any case, thank you for your friendly email and have a good day.

Then he wrote:

Dear Friend Russell,
Thanks for your prompt response, which makes me happy. I just appreciate your openness. Please receive our special love and we really feel proud about your heart full of compassion for mankind....
Since you have a great concern for the betterment of mankind, I think it may not be improper to let you know that we are working with AIDS orphans and we've a formal inagural function our Grace Foster Home on the fifteenth of this month. Please remember us on this special day....
Kindly find an attached picture of our special children with we love to fellowship with and care for their better future.... I'm sure that you will appreciate our work... Bye for now...
lovingly yours friend from India
Paulraj JY

Well, that's a shame. So finally I wrote:

Dear Paul,

I have to give you credit for making the extra effort to personalize your scam email. However, since I now believe that you are a Nigerian con man attempting to perpetrate a 419 fraud, I would like to invite you to kindly go to hell. Of course, I don't believe in hell. But since you are in Nigeria, I reckon that's close enough.

Sincerely,
Russell Glasser

Presidential candidate or Buffy villain?

Yeah, we already know I'm a sucker for those side-by-side similarity pictures. The similarity in these pictures at AmericaBlog really is impressive.

If you're not a Buffy fan, you can look here to see who the creepy guys are.

Saturday, September 08, 2007

Paradox of omniscience and free will

Lots of theological debates center around the religious idea of free will. Some varieties of theists, i.e. Calvinists, don't believe in free will at all. Some atheists (like my friend Denis Loubet) don't believe in free will either, believing that the notion is incompatible with a completely materialistic universe.

Those are all interesting topics, but one issue I find equally interesting is whether "God," as Christians define him, can have free will. I think I'm borrowing this line of reasoning from an old Raymond Smullyan book, although I can't remember exactly where.

God is supposed to be omniscient. He knows everything about the past, present, and future. In fact, his knowledge is so complete that he must know every action that he himself will take in the future.

Now, suppose you yourself were granted the power of omniscience -- not omnipotence or any of the other useful attributes, but you know everything. Suppose it comes time to make a fairly mundane decision, like what you will eat for breakfast. You can have scrambled eggs or oatmeal. So you wonder, what am I in the mood for? Scrambled eggs, or oatmeal? But this is an easy decision: you are omniscient! Simply use your unlimited knowledge to peer a few minutes into the future, and see what it is that you will have for breakfast. And when you look at your future self, you know, as a matter of absolute certainty, whether you will be eating eggs or oatmeal.

But wait a minute. What if you are in a perverse frame of mind and wish to exercise your free will? So you say to yourself "Okay, here's what I'll do. I'll check the future, but I won't do what it says. If I see myself eating oatmeal, then I'll pick scrambled eggs. If I see myself eating eggs, it'll be oatmeal."

Now what does that mean for your powers? If your vision is guaranteed to be accurate, then you don't have the free will to change your decision. But if you can change your decision, then your vision was wrong, and you are no longer omniscient.

This is one reason why I conclude that no being can be both omniscient and free.

Wednesday, September 05, 2007

Surprise, we have bigots for neighbors

We have a couple of agave plants decorating the sidewalk on both sides of our driveway. They're sharp spiky plants, but that's not so unusual; there are a number of neighbors around the block who have a cactus or two.

We also have a couple of very prickly neighbors. They're an old retired couple living two doors down from us. We've been living at our current residence for nearly six years, and those folks haven't spoken a word to us in about five. Ginny says she smiles and waves at them and they scowl back at her.

When we first moved in, they were friendly and invited us to church, which we politely declined. We used to host a regular gaming night with our mostly atheist friends. They started asking "Why are there so many cars here on Mondays, and what are all those bumper stickers about?" So my wife told them. And that's about the time they stopped talking to us. I never felt like it was outright hostility, but she did. In any case, we haven't had much contact.

We have a couple of our own bumper stickers. She has a Darwin fish and a "Freedom from religion" sticker. Mine is more humorous; it says "Knowledge is Power. Power corrupts. Study hard, be evil."

This weekend one agave was cut. I don't mean carefully trimmed, I mean completely hacked up all across the front. Ginny has some pictures on her blog. We found pieces of spiny leaves in another neighbor's trash can on trash day, but we knew that they had been away on vacation so it wasn't them. Ginny was sure it had been this unpleasant couple. She was angry about it. Since I tend to have a bit of a more diplomatic approach to people than she does, she asked me to go over there and talk to them. I wasn't looking forward to it, but I wanted to hear their side of the story without prejudging them, hoping it was perhaps a big misunderstanding.

So I rang the bell and greeted them in as friendly a manner as possible, all smiles. I reintroduced myself to the woman and asked if she perhaps knew anything about the chopped plant. Despite giving me a fairly frosty reception, she invited me in and called her husband down. I had a seat on their couch, they took positions opposite me, and the husband had his arms folded the whole time and a very sullen scowl on his face.

Yes, he cut down the agave. I received a lecture on how dangerous it is to the neighborhood kids, and all sorts of gruesome scenarios about eyes being poked out. But what struck Ginny and me as weird later was when we realized that they hadn't cut any of the spines facing the sidewalk -- only the side on the street. (Again, see the picture.)

They then went on to lecture me about the general awful nature of our yard. Now, our yard may not be the most beautiful and well-kept in the neighborhood, but it is mowed regularly and there are quite a few houses that look worse than ours. I'm not a gardener myself, and I'm really busy with school, but I think Ginny does a reasonable job with it.

I took all this politely and said I understood their concerns, and is there anything else? Then we got into the bumper stickers. The wife said several times that they "make her sick" and she is very angry that we disrespect her religion. That she could never be friends with someone who doesn't "share her values." That she is firmly set in her beliefs and would never change them.

I said I don't want or expect her to change her beliefs, I have never asked her to. I don't proselytize to people who haven't approached me about the subject. And while I sympathized with her feelings, the very fact that she is willing to announce that the bumper stickers sicken her is unfortunately one of the chief reasons why we feel the need to express ourselves in this way. That Christians -- not you, I stated -- feel that it's acceptable to go door-to-door inviting people to their religion, and that we are expected to keep quiet about our opinions because they are supposedly offensive. We are sad that you view our bumper stickers that way, but we see it as a small but legitimate exercise of our free speech.

I then went on to state that while I understand the safety concerns regarding the spikes, it would have been polite if he had come over and brought them up with my wife. Then perhaps they could discuss the appearance and come up with an effective way of trimming it, or let her handle it. His wife restated the fact that they could never be friends with us. I said "I would never refuse to be friends with somebody just because of their beliefs. Only their attitude would make it difficult." Then I said I am not asking to be their friend; I'm only asking them to be friendly as neighbors and be a little more willing to open up lines of communication with us before taking it upon themselves to redecorate our property. I nicely asked him to come over some morning and discuss his concerns with my wife so that she can understand them as well. He agreed, but I'm pretty sure he didn't liked it.

As I mentioned before, I'm the more diplomatic one in the family. Just for good measure, this morning Ginny called the local police to talk about the incident, describing it as trespassing and vandalism. Before I left for work we were visited by a very cheerful and friendly cop, who got to hear all about the history and laughed at the notion that our yard would be an eyesore to anyone.

We didn't want to file charges. He offered to go over there and talk with them, even give a warning that they could be arrested if they were on our property again. We declined that too. I said I'm still hoping that the husband will come over and work things out amicably.

But I did happen to glance over at the neighbor's house while the cop car was in our driveway, and I saw the window blinds being pulled up. It was bright outside and I didn't get to see the expression on her face as she watched us talk to the policeman, clearly discussing our plant. But I have a pretty good imagination and I have to admit, it was kind of satisfying.

Saturday, September 01, 2007

Long live the laptop

I bought a new laptop computer at Fry's just over a month ago. My old one was enough to surf the web and take notes, but impossible to use for any programming work. (Or, let's be honest, games either. But I didn't focus on that. Honest.)

Anyway, I've been fairly happy with my new laptop, an HP Pavilion. But last week one of the buttons on the touchpad stopped working -- to be precise, it was the equivalent of the left mouse button. Now, this was not a huge issue, because you can click on something by tapping the touchpad itself, and also I bought a small wireless mouse that does the job better anyway. But it was annoying, especially since the button felt like it should be working just fine. I thought it might even be a software problem.

So I brought it back to Fry's and asked them take a look at it. They took a few minutes and then said "We'll get you a replacement." That's it. So I quickly wiped all my personal data (they say they wipe the hard drive immediately, but I figure you ought to be careful) and then they just walked up, pulled a replacement fresh out of the box, and slipped it in my carrying case. It took about 30 minutes of paperwork, but not too bad.

Now on the one hand, I appreciate Fry's replacement policy, and think that was extremely handy. On the other hand, this episode doesn't make me very confident in the quality of my purchase. I spent much of today reinstalling all my essential software (Eclipse, MySQL data manager, Firefox, Thunderbird, Google Earth... and yeah, World of Warcraft) and that was a huge pain in the butt. I hope that I don't just need to keep returning to Fry's for replacements every month.

On the other other hand, this incident does make me appreciate the new decentralized way of managing data that I've gotten used to. I didn't have to go home and back up my work, because every document I need is in a briefcase or source control program on my desktop. My contacts are online in Plaxo; my bookmarks are in Del.icio.us; my web feeds are on reader.google.com. All the work was to get the programs running, and mostly you can quickly download the latest versions of everything straight from the web without inserting any discs. That's awfully convenient.