Saturday, April 27, 2013

Watson is Trivial (a crazy-man delusions-of-grandeur conversation with my friend Colt)

Hey Colt,

I believe that I could replicate something like IBM's Watson in a matter of days, coding in BASIC, on my laptop. So could you, though, and I have a party to throw tonight, and an English for Call Centers program to design by Tuesday. And you'd probably be better at it.

I'm going to cut right to the chase.

Wikipedia and the internet itself are semantic networks. Play Hitler Hops. Hitler is one degree away from Rommel, and 4 hops from toenails. Looking for a single, linear shortest path actually doesn't do justice to the way that concepts are linked. Hitler and Rommel have a bajillion links in common. This is all child's play to quantify. I say this because I'm pretty much a child, and it's the kind of thing I've done before.

This isn't my doing, but it's a cool picture.


The connections between concepts are pretty well mapped, at least in a subjective, human way (I imagine that if you had a pseudo-Watson drawing information from much of the web, it would associate alien abductions with government secrets and Justin Bieber with "fags"--this is maybe not the kind of cold objectivity you would want in a godlike personal assistant, especially if you believe that the masses are wrong about a lot of stuff).

Anyhow, I'm thinking about Watson's famous game. If I were writing a ghetto-rigged laptop-Watson, nothing I could ever produce could ascertain that "Chicks Dig Me" was a category about female archaeologists. But a pretty amateurish AI using a Wikipedia SQL dump and the connections therein would be kicking ass in the category after seeing the question. And it'd be lightening fast. I think I could still beat Ken Jennings.

My cousin and soul's-friend Levi took a Wikipedia SQL dump (haha, took a dump) with him when he went to Namibia for the Peace Corps. The download is surprisingly small. Lee's also done language-stuff with Wikipedia dumps when he was competing for the Netflix prize, so it's doable, processing-wise. Also, this has nothing to do with anything, but when he was in Namibia living in a mud hut with no internet connection and a solar charge, he wrote a blind chess AI that he took to the ICGA Olympiad in Pamplona and won the silver medal with, beating the Beijing University computer science team. I love that shit. It's like Iron Man inventing that suit in a cave.






It's like this: after we have the first question, we can isolate the words that fall under a certain threshold in a corpus of written English. I do this all the time to isolate domain-specific vocabulary for English for Special Purposes curricula.

Here are the 5,000 most common English words
. You'd want a different threshold than that (I'd start with 10,000), but we could tinker and find out where the bar needs to be. You can build a list like this using a few lines of Python and some Project Gutenberg texts, or you can just download a better one.

Now we have our question:


Kathleen Kenyon's excavation of this city mentioned in Joshua showed the walls had been repaired 17 times. 

Once we strip the common words we have:


Kathleen Kenyon, excavation, Joshua

If we were to look for the page with the strongest linkage, Hitler Hops-wise, to these three things, we'd probably (we would, I just ran a test) already have the answer "Jericho." And since our search can function by starting with the Wikipedia pages associated with these words and spidering out, we're not doing an ungodly amount of processing; we're not wasting our time combing through pages about the Andromeda Galaxy and snakes indigenous to South America.

But just in case we're not already to Jericho, Jeopardy questions give us an extra category key. In most questions, we look for the noun phrase that follows the word "this", or we look for the words "she" or "he", which indicate that the answer should be a person. (There are other kinds of Jeopardy questions, like fill-in-the-blank answers, which are pain-in-the-ass exceptions that have to be treated with their own algorithms, but thankfully finite.)

So we have:

Kathleen Kenyon's excavation of this city mentioned in Joshua showed the walls had been repaired 17 times. 

Nodes to spider out from: Kathleen Kenyon, excavation, Joshua. Category: city. The answer has to be a city.

Now this kind of categorization is something that I can't sketch a flow-chart for, but it's been done. If it requires a huge database and a lot of processing, its the Achilles Heel of my process, and this whole idea is wrong (unless we don't need the category).

I look at most questions, and this method seems to kick ass. I look at some (like the next question in Chicks Dig Me), and it fails:

This mystery author and her archeologist hubby dug in hopes of finding a lost Syrian city of Arkesh.
Hubby is rare enough to pass my threshold test, but takes gravity away from the semantic area we need to be in, and won't have an any better connection than random to the answer (Agatha Christie). (Thankfully, though, it also doesn't have a Wikipedia page, and we're saved from it by good luck.) Even if I could handle categories pretty well, I probably wouldn't have them down to a resolution that could handle "mystery authors" without having a fuckload of data and processing power. Worst of all, Arkesh doesn't exist anywhere on the internet. Like, seriously. And there are big pages about Agatha Christie and her archeology. It almost makes me think that the Jeopardy question is wrong. Whatever the case, Watson got it, and with linking from "archaeologist" and "Syrian" with no category, I'd be toast.

But for most questions, I'd still kick some ass.

The next one:

At the Olduvai Gorge in 1959 she and hubby Louis found a 1.75 million year old Australopithecus boisei skull. 
I'd kill it.

The next:

Harriet Boyd Hawes was the first woman to discover and excavate a Minoan settlement on this island.
Yep, I'd kill it.

So I'm wondering if this is a Pareto Principle thing where Watson needs three million dollars and a supercomputer and a team of PhDs working years--thousands of times the effort needed for my system--in order to pick up an extra 20% in accuracy, or if the problem is that they were incredibly silly, and their system is godlike at turning Jeopardy questions into database queries, and they have an awesome database, but they totally neglected to go for the low-hanging fruit of all the meta-textual semantic associations we've piled up in places like Wikipedia.

Whatever the case, if I can get a normal professorship where I get summers off, I'd like to try to make myself a Star Trek computer/Watson/Jarvis/HAL thing. Well, at least one that can sometimes answer Jeopardy questions. I mean, we'll all have them in a few years, but it'd be cool to be one of the first.

No comments:

Post a Comment