Authors@Google: Ben Shneiderman


Uploaded by AtGoogleTalks on 09.08.2011

Transcript:
>>Ben Shneiderman: A little bit of a souvenir from Maryland of the Human Computer Interaction
Lab, a T-shirt in our bright green so it goes with your usability green leafs.
>>Male Speaker: Thank you very much.
>>Ben Shneiderman: Thank you for hosting me here and guess I'll just at least show you
the Maryland caps so we're very proud of the relationship of Maryland with Google. Of course
Sergey Brin was one of our graduates in 1993. His father's a professor of math at Maryland
and Sergey graduated from Maryland and he did go to Stanford, didn't graduate from there,
but he is a graduate of Maryland. So we're pleased by all that and [laughs] I'm pleased
with the connections to Google and those have been expanded recently thanks to Dan Russell
for his help in making some of those connections get stronger.
And I'm a professor of computer science and doing traditional database file design optimization
techniques as my early work, but I've grown up to do something a little bit different
and cross over to the side of integrating psychology with the computing. And I'm proud
of the 28 year old or 28 year young Human Computer Interaction Laboratory that's interdisciplinary,
shared between the College of the Computer Science Department and the College of Information
Studies, but also has these other wonderful connections around campus with different departments
including the wonderfully titled Maryland Institute for Technology and the Humanities
(MITH). And so you can visit our website, more than 600 papers, 200 videos, 40 pieces
of software and so on. And we'll take a visual look a little bit later at some of the representation
of those tech reports.
I hope some of you know me from "Designing the User Interface," the book. The first edition
was in 1986 and it's grown and changed and tried to represent what happened. And my collaborator
for the fourth and fifth editions is Catherine Plaisant who's been working with me as a researcher
for more than 20 years. And this fifth edition represents a substantial change, even from
the times of the fourth edition when YouTube was just getting started, Facebook didn't
exist, Twitter was not even an idea, and now we have five billion cell phone users and
a remarkable transformation where the social media are bridging out to be evermore impactful.
So a lot of our effort has shifted towards studying these social media and trying to
understand the complex networks that are embedded in these social media with the goal of course
of facilitating the beneficial outcomes and limiting the negative ones, the malicious
attacks, the use for nefarious purposes endangers the whole venture of social media and yet
the attractions are very strong and that's where we're gonna go today. And I'm gonna
focus narrowly on the use of information visualization and analysis for understanding these complex
networks.
So we had started in this book, now 12 years old, tried to define the topic, and it had
the subtitle given by my co-authors Stu Card and Jock Mackinlay to call it Using Vision
to Think. That is your eyes are not just input devices but they're ways of solving problems;
that the optic nerve and your mind is well designed to in a few hundred milliseconds
pick out trends, clusters, gaps, and outliers if the visual representations are appropriate
in terms of colors, size, shape, and proximity.
So the remaining set of challenges people are getting better at understanding this.
The topic of scientific visualization has been around for 50 years, but information
visualization only about 16 or 18 years by way of the conferences that use that title.
Visual analytics has grown up in the last six years as a major topic and it really provides
a encompassing view where information visualization's usually seen as the technology but there's
a framework of getting the problem, getting the data, choosing the hypotheses, working
on the insights, developing and understanding, preparing a decision, making a presentation.
So that's the sort of the end to end effort.
I'll sort of start by showing off a little, a risk of showing off a little, but this was
done by somebody else, Katy Borner at Indiana University. These are all the papers in the
world of InfoVis as of the 2004 collection of papers. And what you see here are the authors.
The number of papers they wrote is the size of their circles and you see the number of
co-authorships very tight working group of that my partners from PARC, Jock Mackinlay,
Stu Card, and George Robertson, who subsequently went to Microsoft.
And there's my circle and I work with lots of students. So that's the difference between
an industrial research group that's stable, funded, and produces good work over many years
and an academic group which has a lot of traveling people who come and go. Now and then you might
see Martin Wattenberg who I work with who's now working at Google and I think he's out
there somewhere, but I'm sure he's probably writing good code and trying to take the good
ideas he's developed and put them to work for Google.
You may also notice other familiar names like Mark Weiser who was a faculty member at Maryland
before he went to PARC and did a lot of important work there, but a lot of people you'll see
around here.
You can also distinguish the clusters that may be relevant or interesting to you. There's
the Carnegie Mellon bunch over here; we have UC Berkeley; you have University of Minnesota,
there's Georgia Tech. You come to know these and finding the meaningful clusters becomes
one of the natural things that you wanna do.
So this slide is sort of the motivator for the idea of thinking about networks and deriving
visual analytics from a single graphic design. Now of course you're going to apply statistical
methods as well to not only develop the visualization but confirm the kind of hypotheses that you
might be getting from these visualizations.
That book, the InfoVis book, brings the link between myself and Card and Mackinlay.
Alright, so just a further few words about information visualization, few examples, and
then my main focus today is about network visualization using the tool that we've developed
over the last four years called NodeXL. So I hope visualization's well and placed here,
but to push it along there's the companies that got started five and ten years ago, got
bought by the medium-sized companies, and then the medium ones by the big companies.
Google bought Gapminder and maybe other visualization companies that I don't know about, but there's
been a whole cascade of the purchase and the integration of visualization into many companies.
Our own work was by way of Spotfire. Any Spotfire users here? Alright! Great! And Spotfire was
the work of myself and Chris Ahlberg and published in CHI Conference of 1994, remains one of
the most highly cited papers in that literature and he formed the company in '97. I was part
of the board of directors for five years, it grew to 200 people, and was bought in 2007
by TIBCO.
It remains probably the premier visualization tool and this slide encompasses a rather simple
version which shows you one of its features, The Guide where the author here, Nick Thomas,
wrote a little explanation of what's here and provided some guidelines by which you
could do further analyses.
The multiple coordinated windows are linked, a 3D scatter gram, a heat map over here, plate
map over here. And the significant discovery here that Nick Thomas made from this slide
was the unusual or surprising activity of the RBP1 gene which was not known to be as
active and it's important because it influences retinol and retinol in turn influences embryonic
development and vision.
The other aspects of the sort of standard visualization tool kits are sets of sliders
and controls. Here are the double box sliders that we call dynamic queries and have propagated
into lots of places but not everywhere. They're still not part of standard tool kits, but
they're widely used on Web applications and others, some of which I'll show.
And so the sort of basic ideas were put in place, but Spotfire's a high end tool used
by a lot of ambitious organizations, intelligence, and analysts. Its main success story was pharmaceutical
drug discovery. All 25 of the leading pharma companies use that as their process for weeding
out the hundred thousand pharmaceutical compounds into the three that they're gonna run clinical
trials on, and so that kind of process that a company like Pfizer has, like 3,000 people
who use Spotfire in rather systematic ways; similar for AstraZeneca or Norvartis and so
on.
But visualization's used in lots of places and getting more pixels is the desirable thing.
As I walked around the floor here I saw people with two large displays each, that's probably
about right. But getting higher resolution we've moved towards larger, not larger in
inches but larger in pixels, so that we can get more on the screen and read them carefully
without having to move your body too far left or right and still be able to read all the
text, but when the detail gets rich and interesting you can see it all there at once. So these
larger displays are of growing fascination for some people and actually growing benefit
from it.
Also visualization on small displays is becoming an important topic and so people are looking
for the ways to be able to explore data on megapixel or smaller displays.
So the lessons we learned in that 10 year process are summarized in this phrase that
I playfully labeled as the mantra and I wrote it in a paper just like this, 10 copies of
it or 12, 10 copies where each line represented a project where we struggled for weeks or
months to come up with a design. And it was overview first, show the user the whole data
set whether it's a million or a billion items, see the whole data set, see the whole board:
where has the data clustered, where are the outliers, where are the anomalies. Ok? And
then allow the user to zoom in on what they want, filter out what they don't want, and
then click for details on demand; that became the principle.
And the application of this principle went through a variety of domains so the scientific
visualization I've mentioned has been around a long time, but it had a very different flavor
than these newer InfoVis problems. The sci vis problems were meant to show two and three
dimensional worlds and the question was where? Where does the constriction in the arteries
influence blood circulation? Where is the turbulence greatest in the air flow over the
wing of an aircraft? And you wanted to know things were left or right or up or down or
east or west or inside or outside.
By contrast, InfoVis is a very different set of problems and it's not so much where but
you get a new set of questions. So I've talked about multi-variant data which is what Spotfire
and its competitors now like Tableau; there's a variety of other tools out there. They try
to show you when you have high dimensional data sets, a set a series of multiple two-dimensional
projections to give you an understanding of what the relationships, what the clusters,
gaps, outliers might be.
For temporal data you might have patterns that rise: stock market information, gene
expression data, and when Google goes up does Microsoft go up or does Microsoft go down?
Or does Apple go up or down? Is there a relationship that might be interesting, that might be useful,
that might be significant? So that is most commonly seen in temporal data that has continuous
values like stock market closing prices, but as I'll show you we also look at places where's
there's categorical data like patient event histories. Patients are admitted to a hospital,
they have a surgery, they get treated, they receive the medication, they're sent home.
And trying to track the patterns in categorical event streams like Web Log Data we think is
a significant and important problem.
Tree structured data is another common space I'm gonna show that quickly. But my main focus
today is network data.
So just to motivate the idea of information visualization I give you this little challenge:
Anscombe was a statistician at Princeton University and he developed this little example: 12 lines
of 12 data points in 4 groups. And I'll just pause a moment and ask you do you see anything
interesting in these four groups?
[pause]
It's a little audience participation here.
[pause]
>>male #2: [inaudible]
>>Ben Shneiderman: There's a big outlier in four and I think you're pointing out to this
19 here, right? That one stands out. How did you catch that?
>>male #2: Very visually distinct.
>>Ben Shneiderman: It's visually distinct, right? It wasn't just that the number was
distinct, if that was seven I think you would have had a hard time, or six maybe you would
have had a hard time seeing it. All the other ones here are eight which you might comment
on also. Anything else, anybody see anything interesting?
[pause]
>>voice in audience: [inaudible]
>>Ben Shneiderman: Very good. Column, the X columns in one, two, and three are exactly
the same. So that takes a while to spot that. Can you, would you be able to draw these curves?
[pause]
Be a little tough and it'll take awhile. You probably could, but it would take a while.
So you say, "I'm a statistician I can do data mining, I can do statistics." So you're gonna
apply statistics, means, standard deviation, and variance, correlation, coefficients are
all the same. So the point of this, of Anscombe's Quartet was that the statistics don't always
tell you what's happening and that the blind application of statistical methods and data
mining and machine learning is not enough.
These four are not special; there are 40 more or 400 million more that I could create that
would have the same statistical properties. Okay? And so it may be helpful to look at
them and if you take a look in a few hundred milliseconds, you ready? One, two, three,
there we go. You immediately grasp and potentially will remember what each of these patterns
are about.
The first one being a sort of low, linear, positive correlation; the second being a very
high correlation, quadratic in nature though; and then the third one is a straight linear
correlation with one outlier or anomaly? I would call it an anomaly. Outliers are usually
defined as three standard deviations from mean in a normal distribution, but I almost
never get to see a normal distribution and so having something weird like this indicates
probably, I would say, an error in the data.
And I would say visualization is beneficial even if you only use it for data cleaning.
The major of data sets that people bring to me in my office have something wrong with
them that people just don't know about.
I had 6,400 hospital admissions or hospital emergency room visits and the age was recorded
and studied by male and female, admitted and sent home, insurance, non-insurance. It turned
out three of those patients were 999 years old and nobody knew it and they run the data
analysis on that.
Other cases I had years' worth of data given to me, a times series data, but April was
missing and they never knew this.
Other cases, [clears throat] few hundred lines of data [clears throat] and 41 lines were
copied exactly the same twice in the data set. So the data sets we get to see have a
lot of varied kind of errors and the visualization will help people understand them.
I'm not gonna do demos here but we could take a look or you can download it for free, we've
worked on time series data that looks like stock market data or others and finding peaks
and valleys, finding stable portions, finding growing ranges were all things we were able
to do in a very systematic way by visual presentation.
We extended this in TimeSearcher 2, also free to download, so that you could look at tens
of thousands of points even on a thousand point wide display and search for common patterns.
Here we're looking at some weather data in Italian cities. So we're looking at the sunlight,
this is over a five year period. You can see the seasonal variations and here's a two month
period selected from the end of August to the end of October, that's expanded over here.
We've got two cloudy days and seven sunny days in between. We search for patterns like
that and the triangles in the overview show me where those things occur and I can go looking
for things that are similar.
We also looked for patterns. As I mentioned in temporal event sequences, this is out sort
of hot breakthrough area these days, although this early example goes back now more than
10 years of looking at patient visits to physicians, hospitalizations, lab tests, then medications.
And so if you click on the sonogram you see the sonogram, there's the fetus. If you click
on the x-ray you'll get to see the x-ray. And the today line shows you where we are:
this patient is pregnant and is predicted to be pregnant for another two, three months;
has been diagnosed as having diabetes; receiving insulin; increased dose of insulin; and last,
likes to counteract the side effects.
So getting an overview of a patient in one screen rather than 103 pages of documentation
is proving to be a significant advantage and as we move to electronic medical records that's
become a big issue for us in trying to study how to build the right kind of designs.
We've also taken on the problem if I have a million or 10 million patients how do I
find patterns? And so each of these rows here represents one patient and here we were looking
for radiology contrast patterns [clears throat] and the pattern we were looking for low creatinine
before the radiology injection and then high creatinine afterwards. We were able to find
that in a group of some 4,000 patients, find 100 and I think 51 who had this particular
pattern of behavior.
So again free to download Life Lines 2 and I don't think I have it here but the new tool
which is really a breakthrough on top of this called LifeFlow lets you look even if you
have a million of any kind of event sequences see you can see 'em on the screen in a single
screen. That's proven to be a huge payoff and we're working on a variety of applications
for that, mainly medical, but we've just received support from Oracle to go after that and expand
that topic as well.
Okay, I can't resist showing you some tree maps. If you haven't seen the tree map idea
this goes back now some 20 years and that was our tree map for the Gene Ontology which
was 14,000 genes in a 23 level hierarchy.
I hope you're more familiar with Martin Wattenberg's tree map of smart money, Market Map done in
1999. I was a consultant for working with them at the time. And each of the companies
has a rectangle whose size is a function of the market capitalization and the color indicates,
green mean it's rising, red means it's falling. They're organized into 11 industry groups:
health care, financial, energy, technology, and then broken down into two more levels
of hierarchy so you can zoom in on those.
And one day this is what the market looked like, so here's another bad day for the market,
pretty red. Anybody see anything interesting?
[pause]
Yeah, I see some fingers pointing. If you take a moment it won't take you long even,
whoops, it won't take you long to see there's one bright green dot, it's not just a little
green, it's bright green. And when I click on that as you click on any one of these stocks
you get a 20 year trading history and a mountain of data about their previous transactions,
insider trading, and so on. And that green dot was the A&P Trading Company which that
day had bought the Pathmark chain of grocery stores. They were up 6% when everybody else
was down.
And so I like this slide because it makes the point about visualization which is that
it gives you answers to questions you didn't know you had, okay? Things leap out when your
eye becomes trained to us this, and it does take a little training to understand and see
the hierarchical structure in this two-dimensional space-filling, space-limiting representation.
Other days the patterns are a little more subtle, mostly green for technology, mostly
green for energy, mostly green for financial, mostly red for financial, and mostly red for
health care except one [bad audio] over there.
So you can begin to discern these patterns. I like this one which was a green day if you
like happier news and, oh sorry, over here we see the contrarian gold stocks go differently.
When the market's up the gold stocks go down typically, not always but that's a pattern
I've come to understand. And others, basic materials, defense industries are over here,
and so on, Google over here. I've come to know where the favorites are over time. You
can do this, this is free at smartmoney.com/marketmap.
This was an interesting one not too long ago when the market was pretty green but Sprint
had a huge decline of 16%, that's a lot for one day. And does anybody remember why that
happens?
[pause]
>voice in audience: [inaudible]
>>Ben Shneiderman: Right, AT&T bought, what did they bought, they bought ?
>>male #3: T-Mobile.
>>Ben Shneiderman: T-Mobile right? So that was bad news for Sprint and it tumbled badly
on that day.
So if you know enough of the background domain knowledge you can begin to understand what's
going on and that's kind of the power of these tools.
Others have done tree maps of many kinds. This is Marcos Weskamp's news map which shows
all the world's news based on the Google news aggregator. It simply takes the values from
the Google news aggregator, shows them here U.S. only organized by world news, international,
national, technology, sports, entertainment, and health. And then you can compare U.S.
to U.K., Spain, Netherlands, Mexico, or 20 different countries. So you can see how a
story plays in one country versus another country.
This one I just added last week was the Encyclopedia of Life project you may know about. It's two
million species on planet earth and the idea is to get a webpage for every one of them
and get photographs and data, scientific data, for every one of them as a kind of Wikipedia
project and of course it takes from Wikipedia, but this was the way they organized, or one
of the ways they organized, by a tree map. So I think chordate are where humans are right,
am I right the chordate are spine and then all these other, and you can see the number
of species represents the size.
Spotfire added tree maps a few years ago and that's become a regular staple and very useful
part of that tool extending its value.
New York Times began using tree maps in 2007 and continues to do that for a variety of
applications including the more sort of esoteric ones like the Voronoi Treemap, a nice invention
of a German researcher named Michael Balzer, who within a year of publishing it appears
in the New York Times to show the consumer price index in a more organic take a look
in form.
For scalability there's a million node tree map, shows all the files and directories on
the servers in the computer science department at Maryland and there's an hour's worth or
discussion of insights here but this large gray area was about 7% of the storage space
that was being not garbage collected and was just wasted space which was a surprise.
Other obvious things are three large directories that had, were simply copied and shouldn't
been. There's other places where's there's six copies of the same directories and a variety
of other interesting things.
Okay, so that's all the preface. Here's what's the talks about today: analyzing social media
and I choose Manuel Lima's page ironically entitled "VisualComplexity," where he has
cataloged more than 700 ways of showing networks. And they are many very beautiful, but many
very tangled bowls of spaghetti as they're often called, and so while they may look pretty
they don't always provide you the kind of insights you're after.
So we begin to develop our own strategies and this is the Ph.D. work of Adam Perer called
"Social Action" which we still use and it integrates statistical methods with a search
process, a visual analytics process called systemic yet flexible, and gives you control
over the visualization.
I'm gonna show one little video, this is my favorite one minute story. It shows the Senate
voting data for the year 2007, the red Republicans and the blue Democrats and every senator is
connected to every other senator by a link so there's about 5,000 links here. The strength
of each link is the number of times they voted the same way. Okay, the number of times, there
were 330 roll call votes and the most similar votes were Cardin and Mikulski, who you may
know are both Maryland senators, they voted the same way 303 times. And then we go down
295, 285, down to 100 or so.
We used the Fruchterman-Reingold layout algorithm and we start jiggling around and we're now
gonna start filtering out the low value ones: 110, 120, 130, 140, 150. So those who aren't
that similar to the others break out and McCain,
[laughter]
who was running for President didn't vote enough times to be similar to anybody, okay?
And then you get this remarkable, unbelievable, too good to be true visualization. Brownback
was running for President and he drifts off separately and it's just amazingly you get
this nice strong separation going on here.
And I'm gonna stop it right here just to draw attention to a few things. You do see the
Democrats have a strong group and the two independents, Lieberman and Sanders, they
may call themselves independent but they're pretty much Democrat and Sanders very much
so, very far from the Republican positions. And way out on the side are Dodd, Biden, Kerry,
and Obama, okay this is before Obama went more center, he was far from the Republican
positions.
Now in the middle you have Olympia Snow of Maine, Arlen Specter, Pennsylvania, Susan
Collins, and those three as you can see get sometimes close to the Democrats. In fact
those are the three senators who two years later in January 2009 voted in favor of the
stimulus bill with the 50 Democratic senators.
So that was a pretty strong, compelling result here and that showed this remarkable, I didn't
expect this strong separation, but there it was and yet you could see somehow these crossover
patterns which really leap out at you.
Any questions about this? Do you see this? Do you understand what you're looking at?
I'm gonna continue rolling it and we're gonna freeze the topology, turn off Fruchterman-Reingold
and now continue filtering to get rid of the similarities that are 190, 200, 210, 220,
and the really similar voting patterns and now you see another remarkable pattern emerging,
and there it is. The Republicans during that voting season had lost their party cohesion,
whereas the Democrats were still pretty well connected, although Dodd, Biden, Kerry, and
Obama were not connected to the Democratic mainstream. If you do a deeper analysis you'll
find that this linkage is tied up a lot with a group of Midwestern Democratic senators
who caucus together and often vote as a block.
Within the Republicans there's one left, so any algorithm that was looking for no connection
wouldn't have found it, but there's one of Isakson and Chambliss who still have a connection
there, but as you can see the Republican Party did not have the cohesion.
Others have since followed up and done other years going back in time, forward in time,
and this has become a sort of standard method. Chris Wilson was the journalist who brought
this to us for Slate Magazine and he wrote about it in Slate with Adam Perer as well.
Any questions about this before I leave it?
Again I wanna sort of convey to you that when you do have the right layouts, when you do
have the right organization of information, then the insights essentially leap out at
you. It does take some training in getting your eye accustomed to what the normal patterns
are, but once you do that you really can begin to make these discoveries that can be significant.
Okay, so we did that, that was Adam Perer's work.
We still use SocialAction as part of other tools, but today I wanted to take you down
the path and talk about the social media analysis and I liken it to footprints in the sand.
Here's a social media of walking on the beach and with a little skill you could begin to
read these: which direction are they walking, is it a young person, an old person, are they
heavy, are they light, are they walking together, did the same pair of footprints return two
hours later, are they walking a dog, what's going on? How long did it? So we are just
beginning to learn how to interpret or track these social media.
And the tool we've been building for four years is called NodeXL. It's embedded inside
Excel 2007 and 2010, sorry to say that means that it runs on the PC side. The MAC even
2011 version of Excel does not have the foundational tools we need. So we'll have to work on that
and get that to happen.
So the idea is very simple; in the spreadsheet side you just put Ann invited Bob, Bob invited
Carol, Carol invited Dave, and so on and then you say if you can make a bar chart or a pie
chart you can make a network drawing. And it will draw the network and then you can
change the visual properties and make the girls pink and the boys blue and add labels,
etcetera.
If you do the Senate voting data it works very nicely; it comes with the tool when you
download it, it's free to download, open source and free, and so you can do the Senate voting
and there again we have Specter, Collins and Snow right in the middle in this case. Okay?
So I just wanna give you a little tour of the possibilities to tantalize you, the copies
of the book is on the table here, give you more details and show you how to do it, motivate
it, and then walk you through the application in many different places.
Part of NodeXL's charm is the importers, or as we call them spigots, that let you import
your email, import your Flickr, Twitter, YouTube, Facebook, variety of other sources and we're
working with a variety of publishers and organizations who wanna create spigots either for their
internal use or to publish within the NodeXL framework or sometimes an independent one.
So the VOSON system lets you map websites, you give it a seed of whatever 10 or 15 websites
and it will fan out and collect all the network connections of a set of websites and then
you can browse them in NodeXL.
The particular charm is the Twitter importer which has become very hot and I'm gonna show
you a bunch of the Twitter streams and then show you a bunch of other applications.
This is one of our early ones at NYU just a few blocks from here. September 2009 there
was a conference at the Business School called A Workshop on Information in Networks. And
about 80 people assembled and there was a real feeling there was a bunch of sociologists
talking over here and a bunch of computer scientists over there. And my buddy Mark Smith
went to the Twitter stream and downloaded all the Tweets that had the hash tag win09.
And sure enough when you plot the follower and follower's network you find the sociologists
right over here and you find the computer scientists right over here and one MIT graduate
student as the bridge between them. It was a very encouraging and fortuitous initial
discovery that we were able to find this pretty interesting pattern. Some of you may recognize
sociologists like Barry Wellman, a sort of important figure in the field, they're sized
by the number of followers.
So we've done this over the few years, this is the Worldwide Web 2010 Conference, a very
different pattern, somewhat larger one where you can see it's a very cohesive group. These
people all know each other quite well and follow each other, they're a large group.
On the bottom we put up the people who are not connected. So only about 5% of the group
were not connected; this is a very different pattern and this nexus of tight connections
over here shows a well-organized, coherent group that actually talks to each other.
By 2011, we had polished our techniques still further and so we're able to apply clustering
algorithms and then put the different groups into separate boxes whose size was determined
by the number of nodes in each box, and so you get a set of researchers including our
own Jimmy Lin over here sized by the number of followers. Tim Berners-Lee has got a big
set of followers so he's over here, but some of his collaborators you may know, Jim Hendler
and Nigel Shadbolt have a separate group. The Brazilians were over here and we're able
to discern patterns and I'll show you some other examples of those, of how we began to
be able to organize and present the information inside of a bowl of spaghetti into meaningful
ways you could see clusters and groups and outliers.
This is --
>>(Dan): So Ben --
>>Ben Shneiderman: Yeah.
>>(Dan): can I interrupt real quick?
>>Ben Shneiderman: Yeah.
>>(Dan): I got a quick question about that one --
>>Ben Schneiderman: Yeah.
>>(Dan): go back one.
>>Ben Schneiderman: Right, there you are.
>>(Dan): The boxes are interesting but they have no apparent connection to the obvious
underlying structure of the visualization. How did you get the boxes?
>>Ben Shneiderman: If you look within each box all the nodes are the same color so we
apply a clustering algorithm to it and we have several embedded in NodeXL, so we'll
do the clustering for you and then you've got one, two, three, four, six, seven, eight
clusters here and then we'll draw the boxes to include where the size the biggest one's
in the upper left and it's the most number of nodes in selection from Twitter. In this
case, this one's a little messy because it shows the edges go across the boxes as well.
I'll show you another one later where it doesn't have, there's a check box where you can eliminate
the edges that go across the boxes so you can look at the structure within each cluster.
So Dan does that explain it?
[pause]
It's a pretty nice --
>>Dan: Yeah --
>>Ben Shneiderman: feature.
>>Dan: I assume that there's a way to actually create a cluster that follows that dramatic
diagonal stripe, the sort of light from the sun god up there in the center --
>>Ben Shneiderman: Yeah --
>>Dan: radiating down below --
>>Ben Shneiderman: Tim Berners-Lee and I'm sorry, that's Tim Berners-Lee. This is, the
way it's going out these are his close collaborators. The clustering algorithm actually put them
in a separate cluster, not together with him which is that's the way clustering algorithms
go. But you can see there's a preponderance he's connected to these people very well and
Jim Hendler, Nigel Shadbolt, and others I can't see them right now. But we tend to do
these on 3K by 4K displays. On Mark's Flickr Website you can download the full original
one of these and take a look at reading the labels.
Does that get you what you want?
>> Dan: That’s, that's--
>>Ben Shneiderman: We do drop all the singletons in a separate bin over here and so we're able
to discern, we did a nice one with soccer teams and they had interesting patterns so
that was kind of fun. I'll show you another one in a little while.
But this feature's our hot new feature and we're just writing a paper on this so this
group in a box idea turns out to be I think a very significant way of doing layout better.
When you do this by straight Fruchterman-Reingold you really can't see much, but when you throw
them into clusters and then into the boxes, it really opens up your understanding.
Alright, this was a simple one where we searched on the term "HCI" and you can see it has two
actual clusters: this one is actually instead of human computer interaction is human capital
index. And then all of these people are people we knew, including Ed Shea who's just joined.
Is Ed over there? But Ed's joined Google now as well. And then this huge one we sized by
the number of followers is RKMT some of you may know him, it's Jun Rekimoto who's the
head of the Sony Computer Science Lab in Tokyo. And I sent him a note and I said, "Jun, how
did you get 160,000 followers?" And he said well, at one point he was a recommended follower
and so he got 160,000 followers. That's way beyond anyone else in this screen. Although
[inaudible] of Twitter has of course even more than that.
So this showed you a structure, less well developed structure and a lot of singletons,
a lot of people and a lot of mention HCI in a variety of circumstances. But they're not
connected up as a group; the community of HCI is not well structured, okay? On the other
hand, the CHI 2010 community did have a pretty good structure and here we're showing it in
yet a different way where we have the number of followers and number of Tweets algorithmically
on the sides there and it's pretty clear that there's a strong cluster in the middle and
a dramatic outlier and that is Zephoria some of you may know her. Danah Boyd, she's a senior
researcher for Microsoft but she has the most number of followers even though she doesn't
have the most number of Tweets. But her Tweets are interesting to read and a lot of people,
about 45,000 people follow Danah Boyd and care about what she says.
The place you wanna be is below the diagonal. My buddy Mark Smith's right over here; he
gets attended to quite well. At this point, I was only a moderately active Twitter person;
I'm over here still below the diagonal; that's the good place to be. The ones above the diagonals
we call losers because --
[laughter]
they Tweet a lot but nobody listens really. And over here is this voice bot which is Tweets
a lot but nobody listens. So you can begin to see these patterns of what's going on --
[pause]
that Tweet right. If so if you Tweet in a forest does anybody hear? Right.
So this is another way you can begin to look at the data and NodeXL lets you represent
the X, Y positions.
Now we began to study controversial communities. So last year the oil spill was a controversial
topic and oil spill's sort of a open term so a lot of people talked about it but were
not connected to anybody. In fact, I would say more than 50% of the people were not in
any organized group. But here were see the pattern of controversy: you have one group
that's well connected here which is USA Today and NPR and other climate related groups,
whereas the group below are the climate skeptics, so they provide a different perspective. They're
well connected and only a few bridging edges that take them across.
So there's concern about polarization in the political sphere is often raised and you can
see it lightly over here and more dramatically in this case, which was in January we began
to do work on the political issues and so just the keyword GOP, Grand Old Party for
those who don't know, or the Republic Party, but all the Tweets that had GOP in them.
And then we used the clustering algorithm and the layout to produce this dramatic and
very stereotypic picture of controversy. You have one large cluster which we made red,
which are all the Republicans who are well connected with each other and then the people
who were saying negative things about GOP, and we painted them blue, they were the Democrats.
They're a smaller bunch in this case; not a surprise because we searched for the term
GOP and so you do get some bridging things.
And the largest node, here we rank by the betweeness centrality metric is over here,
Politico, which is the political blog in Washington and it's followed both by Democrats and Republicans.
The clustering algorithm painted with the blue Democrats but it's as you can see a good
bridging organization. These others are highly conservative, the high betweeness centrality
groups otherwise beyond that are conservative groups. And we've repeated this for the Pew
Internet and American Society doing a study of the State of the Union address and the
Tweets related to Obama pro and negative and then the follow on Michelle Bachman and Paul
Ryan presentations that followed. So that's about to appear, it'll be on the Pew Internet
site soon.
So we're getting to understand how to take this rich stream of social media data and
be able to filter it in a way, get an overview, then filter and zoom in on the key things
that are important, that could lead to interesting decisions.
Here's another sort of great example of the clustering. On Flickr, we took all the tags
that had the word "mouse" in them and then did the cluster and layout and you find a
yellow group over here which is the computer mouse and the blue group which is the animal
mouse and the red group which is Mickey Mouse and these are pictures that come from each
of those so clearly you can see the mouse there.
So this is what natural language processing people would usually describe as word sense
disambiguation and here we're showing that you can do that often just by the network
analysis of common occurrence of terms. So that was sort of a nice idea.
Here I show you my personal Flickr, it's a little washed out on this display, but you
can see there's two natural groups here. On the right hand side is my family, my two daughters,
Anna and Sarah. There are 28 pictures in which I appear with Anna, 25 with daughter Sarah,
39 of them together. And Sarah got married a few years ago to Mark and I have 93 pictures
of them together. And my sister's here with her husband and children and then other cousins
are over here.
On the professional side, the lab people, Catherine Plaisant and Jenny Priest and Ben
Bederson, Allison Druin, all those appear in one cluster. And a few other of our close
buddies like George Robertson and Jakob Nielsen also got colored green. Then another bunch:
Austin Henderson, Wendy Kellogg, John Thomas was another group. And then you'll see other
people who I often photograph at CHI Conferences, Terry Winograd may be familiar to some of
you and a variety of other people.
But the clustering structure and algorithms are very powerful to tease out structure in
complex networks.
This was done by a student just this semester as part of a homework assignment, but it was
so nice I wanted to include it here. He took the 600 tech reports from the HCL Website,
downloaded them, scraped them all, put 'em into a co-occurrence, a co-authorship network,
and you can see that within that community of 600 papers Catherine Plaisant and I have
written a lot of papers together.
But she has lots of other groups that she works with separately. There's some groups
and people we've worked with together and there are other groups that I've worked with
on my own.
Similarly Allison Druin and Ben Bederson work closely together; they're a little younger
so maybe fewer papers, but Bederson has his own group and Allison has lots of people.
One of Allison's students is right here as well [laughs]. So I'm not sure if you, were
you the co-auth? You're looking. Were you one of the co-author for one of the papers
there? But sometimes they work together.
On the hand there's very few paper, or in fact there's no paper where we are all together.
And if you tease out by filtering a little bit more and look for only the links where
there's at least five co-author papers, you seen an even more distinct pattern here where
Catherine and I are linked by the work we've done together. Here color represents the age
or the number of years we've been part of the lab so Bederson and Druin are newer. Only
Anne Rose who works in our lab has collaborated with with all of us, okay? So we have pretty
much independent bunches. I've worked with Bederson on some things, but I've never written
a paper with Allison Druin okay or I haven't written five anyway.
And you can see some other patterns which we like show even more strongly. These three
should be together as a group because they have the same relationship and similarly over
here Kang, Chao, and Marchionini have the same relationship, but these others have a
different one. These others would produce a fan, we'd like to put them together in a
group. So we're still working on the ways to more clearly represent the relationships
that are important that your eye can pick them out in a second instead of 10 seconds.
Francois Guimbretière, one of Terry Winograd students, spent a few years at Maryland but
did not publish a lot with the rest of us, he was pretty much more an independent. Okay?
And the one I just grabbed yesterday to show you. We're studying the Nation of Neighbors
which is a community safety system; it's got 160 communities around the country; and we
were looking at the postings among, within the different groups. So we had all of that
data and it was, when we looked at it just in Fruchterman and Reingold or Harold Corn
layouts it was just kind of a mess so we threw it into the group in the box thing and low
and behold it showed a pretty strong structure here, and we eliminated the links across them
so the post and the response pattern stands out very apparently here, the biggest ones
in the upper left and then all the small ones in the lower right. But you quickly get a
very memorable understanding of the pattern of relationships within these groups, okay?
So I just close by saying that's the book and there are copies here in New York; maybe
some we can get to Mountain View but appreciate that and the first three chapters are introduction
to the theme of social media networks and then four, five, and six are very much how
to do, it walking you through the use of the tool. And then chapters 8 through 15 apply
these things and some of the examples I've shown you are from the book, but there's ways
to be able to study email and threaded networks and then Twitter, Facebook, Worldwide Web,
Flickr, YouTube, and Wiki networks as well.
And as we go down the road we find lots of other groups that are interested for building
a spigot, so that's sort of the natural thing then either they can work internally or we
can make it public for them.
The group itself is supported from Microsoft External Research, but Maryland, Stanford,
Cornell are involved and this. Mark Smith and we've formed the Social Media Research
Foundation which promotes open data, open tools, and open scholarships.
So we'd like to hope that NodeXL will be as widely used as other popular tools like Firefox
and so on that are free and open on the Net and like the Mozilla Foundation or Linux Foundation,
we'd like to make these things open and accessible, that's our goal and if you can help join us
with that that'd be great.
I'd like to just say we wanna put all these things all to good work to make the world
a better place in ways such as helping the Millennium Development Goals that's a talk
longer.
And it just sort of close with saying we just had our 28th annual lab symposium; we had
more than 250 people attending and a much larger format. And the next event will be
the summer Social Webshop. We're being about 20, 25 speakers, about 40 students, doctoral
students from around the country. And just two days ago I'm pleased to report that Google
helps us along by sponsoring so we can bring some more students and speakers in. And it's
mainly sponsored by National Science Foundation, but I'm real pleased and you can visit our
website there and see who the speakers are and I hope you'll come join and find it to
be a good event.
So the summary of this and the take away message is networks have been around for a long time,
visualization of networks go back 75 years. Computerization of such network visualizations
has 15, 20, 25 years of history, but only now I mean we and others are developing the
strategies by which we can filter, do the layouts, and be able to interpret these data
and apply statistical methods so that we can extract the insights that allow you to make
important decisions.
So we're quite excited about that. We're on to Version 170 of NodeXL. We drop a new version
about every two weeks or three weeks. And so it's of continuing satisfaction to see
how that evolves and there's just a lot of support and good people who are using it all
around the world.
So we're very pleased. Thank you for the opportunity. The books are here. I'd be glad to sign them
for those who like and I'm open for questions. Thank you.
[applause]
[pause]
In the back, yes.
>>voice in audience: [inaudible]
Ben Shneiderman: No, I think that was the idea we found the ex. Oh, the question was
why do we choose Excel as a sort of platform? The answer is that all of us who had done
network analysis with tools like UCINet or Pajek or any number of other tools you wind
up doing enormous amount of cleaning up and transforming and organizing the data so it
tends to be you're using some kind of spreadsheet or database, then you import to the visualization
tool, you look at what you get, and you say, "Oh, I need to go back and change that." You
go back out and you're going back and forth. This way we integrated the process, not perfectly,
but we did integrate it in a platform that makes it go.
We'd love to do it on Google spreadsheet and make a web-based viewer, that would be a nice
project to do together. We're sorely, let's put it this way, we're very eager to make
a web-based version of this and we have, although we're very appreciative I should say of Microsoft
research for having supported this for several years. They encourage us and we are eager
and I spoke at Mountain View in November about this and I'm speaking here, we'd be delighted
to have a partnership with Google and if somebody wanted to pick up and help make a, based on
the Google doc spreadsheet, a browser on the Web, hey that's very high on the list.
[pause]
Source is open. Csharp.net, but that [laughs] that could change too [laughs]. Right? We've
included other platforms like Jure Leskovec at Stanford has this SNAP toolkit which is
algorithms do very large networks and so we're happy to integrate those or work with other
tools. We have no devotion to any particular platform or language, we've love to see variations
on this.
Our current effort, I think, will bear fruit in a month or two will be to have a web browser
for people who if you're using NodeXL you export, it goes on the website, the image
goes there, the dataset goes there, and your explanation goes there. And then people can
then download and browse it. So that will get us started with a web facing version.
Right now it's on code collects website; there's a good discussion group; I must give high,
high congrats, thanks to Tony Capone who's our lead developer and he's done a remarkable
job and he also answers the questions quickly and very effectively. People really are very
happy about becoming part of that user group.
There's frustrations with it, there's all kinds of things we could talk about that we
wish we had done or so on and I can recommend other tools like Gliffy, we've very good friends
with them if you're looking at network drawing tools they do a lot of nice things that we
don't do, we do a lot of nice things they don't do. There's a lot of room for possibilities
here. We keep exploring on the research side of what other ways we can better extract meaningful
insights from these network datasets.
Thank you. Other questions here or elsewhere?
[pause]
Yes, okay, alright. Well Dan thank you for sitting through the whole time, I see you
out there too. Hope it was good review.
[pause]
Alright.
>>Dan: It was excellent.
>>Ben Shneiderman: [laughs] Thanks. See you again and thank you all here. I appreciated
my visit today and look forward to talking more. Thank you, bye.
[applause]