Introduction to Traitor


Uploaded by xsysstar on 17.01.2013

Transcript:
Hello. We want to introduce Traitor to you; our submission for the Norvig Web Data Science
Award. This Award requires us to use 'Big Data' in some interesting way. Over 25TB
of web pages were made available to us to process with a cluster of computers running
programs that we design.
We want to tell you something about the background of Traitor and then give you a small demonstration.
You can try Traitor yourself at http://evilgeniuses.ophanus.net.
What do some concepts have in common? Or how are they different? How can we refine search
queries? What is the context of some sentence? How can we categorise concepts in taxonomies?
To answer such questions, we need to know about associations between words (or concepts).
We, humans, do this intuitively and generally rather well. But can a computer do this for
you too?
We created Traitor to find out.
Sentences are among the smallest text units that humans use to make 'statements' about
some 'objects' or 'concepts'. Others can interpret sentences and deduce their meaning
and form associations, but computers are notoriously bad at this. So, Traitor takes a different
approach.
For some sentence, Traitor assumes that words therein are, 'in some way', associated
with each other. Well, only the *non-trivial* words. For example, the sentence "Mary had
a little lamb and now it is gone" is reduced to "Mary lamb gone". Then, Traitor associates
"Mary" with "lamb", "lamb" with "gone" and "Mary" with "gone". We call these 'co-occurring
word pairs'.
This may seem as a rather imprecise way of associating words, but 'the power of numbers'
is applicable here. Traitor scans over 25TB of web pages with a tremendous amount of unique
sentences. By aggregating and counting all co-occurring word pairs for *all* sentences
in *all* HTML-documents, some word combinations should occur far more often than others. So,
if a *lot* of sentences contain both 'Samsung' and 'Apple', Traitor concludes there must
be a strong association between these two words. Conversely, a sentence that contains
both 'bicycle' and 'muesli' is pretty rare, so these words are not associated.
Traitor first populates a database with 'co-occurring word pairs' and their counts. Then, we can
use a Traitor web application to search and visualize this data. So, let's try using it
to answer some questions.
Q1: What do Apple, Samsung, Microsoft and IBM have in common, if anything?
We will use our search interface to simply search for "apple samsung microsoft ibm".
This causes Traitor to search for *other* words that are associated with *each* word
in our query. Not surprisingly, Traitor deduced they have something to do with software.
Q2: What does Java have that PHP does not have?
Searching for "java -php" gives us the answer; the "minus symbol" tells Traitor to *avoid*
associations that have anything to do with PHP as well. The result from this query is
interesting: associations with the Java programming language (applets, jvm), Java coffee and the
Java island are among the results. Traitor does not distinguish between different meanings
of words.
Q3: Which extra terms could refine the search query "lance armstrong"?
Entering "lance armstrong" yields results such as "tour", "cancer", "france", "foundation",
"cycling" and "livestrong". As you can see, these are different aspects of "lance armstrong".
Suppose we are interested in "livestrong". Querying for "lance armstrong livestrong"
yields "cancer" as top result, which seems a reasonable association.
Q4: What is the context of a sentence, such as "Food, games and shelter are just as
important as health"? We separate the words in the sentence by "plus
symbols" to tell Traitor that *every* association of each individual word contributes to the
overall context of the sentence. Without "plus symbols" Traitor would only return associations
that *all* words had in common. The example sentence returns "care", which is somewhat
reasonable.
A more reasonable example is the query "haskell + is + better + than + python + and + java".
Traitor deduces this sentence is about "code", "language", "applications" and "programming"--all
true.
Q5: How can Brand names be organized in taxonomies? Let's see the associations of various brands.
As you can see, the companies that are linked with each other tend to be competitors or
collegues in some way. Google, Yahoo and Microsoft have search engines. Oracle, IBM and Microsoft
offer business solutions. AMD, Intel and Nvidia produce computer hardware. Honda, Toyota,
Mercedes and BMW produce cars.
This concludes our introduction and demonstration of Traitor. If you wish to try Traitor yourself,
please visit http://evilgeniuses.ophanus.net/. The "About" page contains more information,
the source code and our paper (with our e-mail addresses).
Thanks for listening!