The Little Engine(s) That Could: Scaling Online Social Networks


Uploaded by GoogleTechTalks on 16.03.2011

Transcript:
>>
Hello. And welcome to this TechTalk on The Scaling Online Social Networks. It is my pleasure
to introduce Josep Pujol from Telefonica Research in Barcelona.
>> PUJOL: Thank you Marjan. Thank you very much for attending the talk. So, I'm going
to talk like--of a project and a paper that is going to be presented at CICOM 2010 this
year. And basically the--what we have done is like, we have like a new system of doing
partition and replication to scale social networks. And well, actually right now it's
like, probably you already ate, but I usually give this talk when--while eating. That's
why I have the menu. So we'll first go like to the appetizer which is like the problem,
then the main course which is our contribution and the dessert which is the evaluation. Well,
scalability as you know, probably much better than myself is a very difficult thing especially
for social networks, but in general it's like a--it's a complex--it's a complex field. Like
to scale up, like to serve like hundreds of millions of users, well, it's extremely difficult.
However, like what we try to put there, our emphasis is what happened when you scale like
a systems that are like early stage. In a way, what we want to do is like, one; to ease
what we call--well, what people call the designers conundrum, which is--okay, when you are like
a small company, so you have two choices. You can go there like an--you have a product,
kind of like let's say like Twitter or like something like this, a small social network.
Basically, what you can do is like okay, you can just devote all the resources that you
have, like making the system scalable up to like millions of users. Like these are the
programming, et cetera, et cetera, so we spend a lot of resources there, which you might
not actually not have and you might actually lose the opportunity of the time to market,
or what usually what happens is that, okay, you started small and you just like hope that
you will be able to scale up as you grow. That can result in like Twitter's success
or can result on Friendster demise. And regarding the scalability, well--how well scalability
is pretty much at least for like a level of startups. It's pretty much solved by the--by
the Cloud. So you have like utilizing for structure and network. But what you don't
have solved is the application scalability. From all the application, typical application
layers, I'm going extremely fast because I'm pretty sure that all of you are like more
proficient than myself on this area. But basically you have all the typical application layers,
presentation and application logic and so on. Scaling this on the concept of the Cloud
is extremely easy because those layers are stateless. Since they are stateless, you just
like can like add more machines, like run on either--another instance and you're like
keep--keep increasing the number of machines and the computing power as the users keep--the
user-base keep growing. However, like scaling the data source is not so straightforward
because the data source is independent. It isn't independent, it's dependent. So there
is like--you can now like add more machines and hope that everything will be fine because
you have to like do like correlate the joints and all kind of like operations that might
actually like affect more than one server. So, what is the typical solution like to scale
up? Well, the first one is like--is a scaling up, which is like do full replication. Full
replication basically what you say is like you have like a perfect copy of all your--all
the data in like N servers. And then you only have like to coordinate like the consistency
across those servers. This is the first approach that people usually do because basically it
allows you like to do the load balancing across servers. However, eventually you hit up like
the limit of the physical limits of the machine. So you kind of let go indefinitely. The other
approach, once you are done like with full replication is like to go like--okay, let's
do horizontal partitioning or sharding, which is basically that we split the data in shards,
which are like horizontal partitions of the data. So that that particular data is independent
and can be put in different servers. Okay, that's kind of like also easy to do, like
there's off-the-shelf solutions like Microsoft Cluster for doing this. However, there is
like an important assumption here is--which is like the data can be split. The data can
be disjoint. And that's where actually the problem for social networks arise, because
if you are like--I don't know, like--well actually not Google but if you're like Amazon
and you have the users of Amazon you can say like, "Okay. All the users of Amazon who start
with A or at the highest function model X goes there, like goes to the machine and will
follow up and then if you like need more machines, you just like do the partitions a little bit
smaller and you add more machines." And there is no problem because the users do not talk
to each other. The users are not correlated. You can do like disjoint partitions. But in
social networks, the disjoints partitions do not exist, because their main operation--the
social network is basically adjoined on your friends. And as you--probably you--probably
you all know, like is there a cherry-picking problem? So if you try to like to get like
one cherry, if they are like linked, basically it will take more than one. If you break them
because they can be broken, then you will get one but that's not the case for social
networks because you always like try to--try to fetch data from your neighbors. So from
that theory, what we know--like this is in a--as I don't know. Oh yes, this is like the
typical Zachary Caltech--Caltech Club. Social network, that is like very small but you got
the gist out of it. So even that, you can like do some partitions on the data by colors,
like--for like many of the users when they try like to go to their neighbors, they have
to go to a different partition. So that means a different server. And actually like if the
graph--well, from graph theories, like trivial to demonstrate that if you have only like
one connected component, it cannot happen that all the users and all their neighbors
are in the same partition, like by definition, that only happens when you have one partition.
So well, where is the problem? The problem is that you can now like partition and have
a clean partition because you have these like neighbors. You have these things that will
always go outside. How the social networks deal with this problem then? Well, there is
like different--there is different approaches, like different companies use a different,
like custom solutions. One custom solution that derive from the Microsoft's Live Users
is that they rely on--they have to rely on traditional database to do like their operations.
And basically, what they do is like they select adjoins across multiple images of the database.
So what they basically they--basically what they do, they have like a caching system.
They have a caching server at the level of the rack. So basically you like the--they
do like random partition and whenever they have to find data from a remote server, where
they go is like they go to the cache that is on the same--on the same rack to like minimize
like network [INDISTINCT] latency. This is kind of like--it's an okay solution because
it works, but obviously like only companies like Microsoft or Google can implement these
kind of like customs solutions. This is not something that can work out of the box for
like people who are like starting to build their systems. Then because like the relational--the
typical relational database management systems can no longer scale by--for social networks
is not very appropriate. There is a tendency to use the Key-Value stores that probably
you know. Key-Value stores are like all these like no SQL movement. There is like dozens
of them, like MongoDB, Cassandra, CouchDB, Redis, HBase. There is like--there is a point
in which they will--they will eventually run out of like cool names to name these kind
of systems. And Key-Value stores are like very nice. They are like very nice for social
networks because basically what they do is like they atomize the data. They do some like
implicit de-normalization so that they all start from the application where the data
is. And that's efficient for the social--for the social network because the--because it
hasn't de-normalized. However, there is like a tradeoff here. First, the tradeoff is that
you lose the query language of databases which is the SQL. That means not only the expressiveness
of the query language but also you lose like additional things like programmatic queries
or like 15 years of like query planning optimization. So, you are losing all these things, I mean.
Then another problem that Key-Value Stores have is that well, basically that you lose
the abstraction from the--from the data operations. Basically, the application has to be aware
of like what--of like how you do joints, how you do like selects, how you do like filter,
how you do wrench queries and all these kind of things. And also what we will see later
on is that Key-Value Stores they can suffer from high traffic. Like application that have
like a lot of--like traffic like Facebook or Twitter, like they burn the solution. They
eventually become like even heat like a network--the network bound with limits and probably like
the same goes for Google. And let me--okay. So--and now the main course which is our contribution.
So, the contribution is extremely simple and it's like--I would like to keep to the conceptual
level. So basically, what we want to do is like, we want to maintain the data locality
semantics. So everything was perfect in the world where data is local to a server, because
then basically it means that all operations can be resolved by the current systems. Either
mySQL, for example or either Key-Value Stores, it doesn't mean it has to be one or the other.
So we have this little sketch of social network that we have here of like 10 nodes. And then
let's say that we want to split it in two servers. So, what happens is that--okay, you
first start with--start with a full replication. What happen is that you eventually end up
like with 10 nodes in this server. So basically, the cost in memory and when it's--I say memory
here, it's like not proper like RAM memory but it's kind of like storage RAM. It's kind
of like a conceptual definition which is basically like how many users you have to have in each
server. So, it would be like 10 users in each. Width traffic is zero, because all the data
will be local, so you don't have to face data from outside when you want to resolve a query
and right traffic is 10 because basically you have to maintain the consistency across
the replicas. Eventually, the--that's not the scale because you cannot fit like all
the users in a single server, otherwise you could go with only one already. Then the typical
Key-Value Store, what it does is like--okay, let's do like a random partition based on
hash and the hash function. And what you end up is that you--the memory is very good because
you only like--put like as many nodes as you need in the server. So, you have like a perfect,
divided by M node. You don't have right--you have right traffic, assuming that you don't
have copy for redundancy, but you have a lot of free traffic. Why? It's because eventually
like for any node that you query, probably it will have to face data from multiple servers.
Then you can like--okay--that's the problem because for Key-Value Stores, that's transparent
but it has the problems that we already mentioned. And for databases, you cannot do that out
of a box. You would have to do it like--also like on the client site. And this is kind
of like--that like defeats the purpose of like easy scaling and transfer in the scaling
that we want to achieve. So one solution that you might think of and actually, like in a
way that there is like some people that already like, propose it a long time ago; it's kind
of like--okay, let's not do replication at the table level or at the database level.
Like in the case of full replication but let's do replication at the raw level. Which in
this case the raw level would be the user. So let's do like--let's do a random partition
and then let's replicate those users who are outside your server into your server, so that
all data is local. But obviously, these have like the pun that you can already see here
is that you create a lot of replicas and eventually that can lead to like a full replication scenario.
And then we would propose--which is kind of like probably you're already like, are guessing,
is to do like a social-based partition and then the replication. And then by the example
that I'm giving you here, what you see is that the Social-based Partition, what it will
do is like partition the social network by here. And then basically what had happened
is that then you would have to copy four--I don't know what happened with the pointer.
Okay. You would have to replicate four to the first server and you will have to replicate
three to the next server. And then you guarantee there is--by definition--well there is like
the guarantee that all queries can be resolved locally and you have a very few number of
replicas. So like, so far so good, you might say. But the problem is that, now you might
think that I'm kidding you because that's like the very good example is like on paper
everything works, but what's happening in the real social network? I mean, can actually
this work? So, we went ahead and we try it. So basically, we just like got real social
network data from Twitter, Orkut and Facebook. Our focus on the Twitter data can like to
the 2.4 million users, 48 million edges and 12 million tweets for the period of 15 days,
which this--we call it our self. And this is like a very good representation of what
Twitter was at the end of 2008, which was like already like popular but not as popular
as it today. And why I'm saying this because this is like--probably we have like 50% of
the tweets--well, actually no, we have 50% of the tweets that happened in the 15 days.
So that means that probably we have like half of the network and so on. So, it's like a
very good representation of what's Twitter was already at that time. And that Twitter
had like to undergo like several, like, re-arch--costly re-architectures to deal up--deal with the
traffic. Then where we go is like--okay, we try like partitioning algorithms. Partitioning
algorithms is like another new field. It's like a lot of work on the area. Basically,
what we do is like we compare random partition, the Key-Value Store. Then we use like one
example of like a spectral clustering. We use METIS which is very well known. If you're
working in graph partitioning, it's like a--it's ABC algorithm. You've got it very well. And
then what we do is like also we use like some modularity optimization algorithm. And why
modularity optimization is because--by the way--it's because I come from physics and
those like--we don't use like a spectral partitioning. We use a different strategy which is like
optimizing for a metrical modularity. We have to [INDISTINCT] and it has like several advantages
over spectral clustering. Although one can be reduced from the other but anyway, that's
not the--that's not the point. So basically, what we do is like we--is like pick one of
the more direct optimization algorithms, it's state of the art. And we like implemented
some hacks so that we have equal sized portioning, because community attraction doesn't have
to yield equal sized partitions, right? And we finally have our algorithm. Let's call
it SPAR online. And now let me go to the results already. So, the question is like--okay, if
you want to guarantee local data semantics so that all queries can be resolved locally,
how many replicas do I have to generate extra in real social networks? So, these are like
the three plots here, like one is for Twitter, Orkut and Facebook. For different level of
partitions from 4 to 512 and the replication overhead which is like, again, how many replicas
extra you have to generate. Let's focus on Twitter and the case that I want to have like
16 partitions because I have 16 servers. So what's the replication overhead? Well for
Random, it will be like 3.9, for METIS is 2.2 and ours would be like 1.3. That means
that if we replicate on average one to three times every user, we can guarantee by four
of these datasets that all the queries can be resolved locally. And that's kind of like
low. That's--actually like a surprisingly low number. Another nice property is that,
as you can see this is a lot of scale of the servers. The replication overhead for all
cases grows linearly here through a sub-linear graph. It means that at scale as well. The
replication overhead keeps growing as you add more servers but it's sub-linear. And
then you will also--you might actually have noticed that there is this K=0 here. Why?
Because in here we have like replicas that we have to generate to maintain the data locality,
but in any case there is like--in real systems there is like all the replicas you have to
have, that are the one--had the one for redundancy because you don't want to have like, only
a single copy of data. You want to have at least two copies of data or K copies of data.
And since we are actually like, replicating here, can we actually combine those two things,
like the replicas for data locality and the replicas for redundancy? So, well, here we
have the results. And the replication overhead for, like, K=2 which means that you at least--you
want to have at least two replicas for each user. For our aim is 2.44, which is like much
better than before because basically it means that you already were paying two anyway. So,
if you add 2.44 more replicas, you achieve data locality. Then now you might actually
might notice that like, the numbers like a Spar is like a little bit better than the
MO+ and MO+ is a little bit better than METIS and all of them are better than Random as
you will expect. Then some people actually, they are like--especially the partitioning
on the community I'm talking. They have like, some kind of like--there is like a problem
here. Like--some people say that our algorithm is not much better than METIS--than METIS
or MO+ because they're in--they--the improvement is like--over like 30% or like 20%, it depends
on the--on the case. And there are other people that, on the contrary that say that, that
they cannot understand why our algorithm performs so much better than METIS and MO+. So, there
is an answer for all the questions. First, why our algorithm does not perform much, much
better than theirs? It's because our algorithm is online. And why--and it's online is incremental.
Why? Because we need to--like, social network are simply dynamic. There is constant like,
new users coming in and coming out, and there is always like, edges being created and not.
And every--all these events and obviously all events of the system dynamics which is
like, machines failing, adding machines and so on. So, all these events, the algorithm
has to be able to react to those events. So what we don't want is like to run like a partitioning
algorithm which is offline because basically we will have like a synchronization problem
there. And also we'd have like a stability problem because what happened with the METIS
and MO+ is that they are very sensitive to the initial conditions. So, for example, if
you change like one percent of the link structure of one network and then you will run the same
algorithm, there is no guarantee that the placement that the--the resulting placement
that they will have will be like equal--well, almost the same as before. Actually it can
be very different. It can be like half. So, like a small change on the number of links
can produce that you would end-up like, having to move a lot of nodes to keep the data locality
constrained. So, basically our algorithm is like online for all its reasons which are
like amenable to online social networks. Otherwise it's very problematic. And then if our algorithm
is online, how come it performs so well? Well, it's very simple. It's because we are optimizing
for the right problem while the other algorithms, they are not. And that's, kind of like, a
little bit surprising. So let's see the example here. So, you have the social network over
there. Like, on the first one. So what happened that--if like, you are like trying to use
like, typical graph partitioning mechanisms, they try to minimize the CAD--the CAT edges
which is like the inter-partition edges and the partitioned edge which will come out with–-the
partition edge that we will generate will be this one. Because this partition generates
only, like--leaves only like three edges. But these would require like five nodes, as
you can see, it would require like nodes from E to I to be--yes, to I to be like, replicated.
So, it would be like five nodes that have to be replicated. Those two--there is a pointer.
Those two should be replicated here and these three has to be replicated there. That's because
we're--no. But actually that's not what we want. What we want is, like, to minimize the
number of replicas that we have to do. And for this case, this partition over here is
better because even though you have, like four inter-partition edges you only end up
like, replicating four nodes. Well, what happened was--it's kind of like, it's similarly--it's
very simple--it's like graph partitioning algorithms. They are like, in a way what happens
is that if you--let's say that if you have 100 friends. So, usually it's like trying
to maximize the number of friends that you have on your same community or into you. So
let's say that 90% of my friends are in this server, but then there would be like five
friends which are there, two are there, and one, one, one. And actually that creates like
five replicas to one edge server. What we want is actually a system always like that,
"Okay. I don't have so many replicas. I don't have so many friends in my machine but I have
like," for example, "80% here, 10% there, and 10% there." That's much better than the
original partition of the--of the session I will partition in. Why? Because we are,
kind of, like worst case. So--and let me just like, describe our like, extremely like quickly
the SPAR online. So, when we use SPAR online, what we have to do is like minimize MIN_REPLICA
problem and MIN_REPLICA problem is--as you might expect is NP-Hard. So what we basically
do is like, we do like a Heuristic solution. Well, we use like Greedy optimization with
local information only. And we have a load balance constrain which is basically based
on back pressure. So, the algorithm--the algorithm reacts to the six events that we mention,
like adding and removing of either like nodes, edges, or machines. Let me just like go a
little bit like on there or what happens when an edge is created. So, let's imagine that
you have the situation here. And actually it's like, there is nothing on--don't worry
too much because that example is also on the paper and it's extremely easy. So basically
what we try to do is like, we have a situation which is like, we have data localities semantics
maintained. And there is a new edge created and the new edge goes across two servers.
So what do we have to do? Well, first of all we have to fetch only the local information,
so the neighbors of node one and node six that are the ones affected by the--by the
edge creation event. And then basically what we do is like, count what ifs. So there is
like three possible scenarios as you might anticipate. Let's move node one to node six--to
the server where node six is. That generates three additional replicas and we can calculate
that only with the local information. Then what happened [INDISTINCT] the other case
is with like, "Okay, let's move to node six to a server where node one is." And that actually
like remove one of the replicas and finally what we have is a status quo case which is
like, "Okay. Let's keep it like this, just like, generate the replicas that aren't necessary."
And that generates, like two additional replicas. So, which one do we go for? We go for; in
this case, we'd go for the status quo. Why? It's because the solution would be much better
because it minimizes but, it doesn't maintain the load balancing constraint, right? So you
are--you are only like allow--like replicas to move to a server that are like--that contain
less replicas than yourself. Otherwise, these like to avoid like the degeneration that would
happen because it--eventually like all--the best configuration is to have everything in
one server. So, to avoid these we have the solution. And by the way, this simple--this
simple approach, which is extremely simple as local information, so only the information
of like, the neighborhood of the two nodes affected and its local decisions. So, there
is only like, the only nodes it can move are one of those two plus their neighbors obviously,
if they need to be replicated. So only with this algorithm we achieve by those results
as before. Which basically means that our solution, the contribution or the data locality
can actually probably be improved--those numbers that we already gave. If you will like, go
like, a little bit more in depth with the--with the algorithm. However we will--because we
just like value their simplicity in a way which is like a state at this level. So, important
thing is like--okay, now, let's go--are the partitions load balanced? Yes. They are load
balanced. Like, actually the coefficient of variation of masters is extremely low. The
coefficient of variation is like, how many far--how many times are you away from the
standard deviation. Right? So, for application reads and writes which happen on the master
replica is still balanced. For writes, it's not so well balanced. Why? It's because there
is like--we are not like--we are like treating the users as individual--as individual units.
So we do not take it into account the traffic that the user generates. However, we could
actually like incorporate that but we chose not to for simplicity. But even in that case,
the coefficient of variation is extremely--is extremely low. And actually what happens is
that the data, if you can still like in this--in this plot here and--this is important because
like in real social networks, there is like--there is some correlation on like the heavy users
and how many replicas do they generate. But the correlation is not as high as you might
expect. And actually that means that the low balancing by this simple scenario and by this
simple approach are I think worked pretty well. And another thing that you have to take
into account is that--okay, how is the distribution of the replicas? So I'm telling you that the
average for Twitter was like 2.44, right? But that means like--how was like the distribution
of the--of the--on how many machines you have to be replicated. So basically, what happens
is that there is dispute towards like the low values. So like 75% of the users, for
example, they end up like having like three replicas, which basically means like two slaves
for like the redundancy that we needed them anyway, plus the masters. So that's like the--for
75% of the people, we don't add anything. And then for the rest, we could like add more--add
more like replicas to maintain data locality and eventually you'll end up like 90% of the
users to have like less than seven replicas and normally like 139 out of like 2.4 million
end up replicated everywhere. That means that basically like--well, we will have like--some
people will have some like--kind of like a problems getting like a consistent state but
those--I mean, it's--very few people and those can be like--it's very few people and you
can do like and actually--eventual consistency. We only implement eventual consistency and
you got like the end result pretty quickly. So, now--and then some people might actually
like--okay, but--and here like, you're see like--that you are moving nodes and we don't
like actually like moving nodes because moving nodes have a cost, right? So it's the hidden
cost of our algorithm. Well, it is in a way because, obviously, we'll have to move them
but it's not really that important. Actually, what we see here is like--on the X axis, what
you see is like edge creation events, which is basically a timeline. Like you keep adding
edges as a--as the network grows. And then you have the--on the Y axis is like the ratio
of actions upon an edge is created. So basically, what you see is the majority of the times
like--about like 60% once the system lose--it runs in a state. You don't have to do anything
at all. The algorithm doesn't have to move at all. Why? It's because the edge that it
created, it means--the edge that is created is--has already the two nodes in the same
server or if they are on different servers, they already have a replica of each other--in
each other's server. That happens a lot of times. Why? It's because since we are like
building a social partitioning as we grow, when you keep adding edges, most likely that
those edges will fall into the right place because people do not--do not add edges at
random. They add edges with the semantics. So, basically like 60% of the time, you end
up doing nothing which is very good. And then like--or like 30% of the time, you actually
have to do some movements, right? And--okay, what is the magnitude of these movements?
Well, this is like the plot that you have here and that you can see that this is like
the CDF, which is like I think is better and see like how many times--how many nodes are
affected on those movement events. So, like 90% of the times, you only have to move--to
move two nodes which were really like moving to nodes meaning like moving the--all the
data relevant to those users, right? And then obviously, there is like some non-linear probability
that you have like a big event but this big event are like pretty much constrained. And
actually, you know, this case like the biggest event that we saw, it wasn't moving in a single
shot of like 130 nodes, which is kind of like pretty like low. So more on movements, actually,
as a system, obviously, we have to deal with like a system dynamics. Also like how you
add machines, how you remove machines; that's kind of like--that's extremely high level.
I'm going to say that we can go over like later if you want. Adding servers is an--no
problem. You can add servers like either like in an all representing fashion or you can
actually like move like a fraction of every user to that new server to maintain load balancing
from the scratch. Removing servers however is not so clean. That's like the--the big--that's
the little problem of it--of this system. Why it's not so clean? It's because we have
a structure and because we have a structure--we are maintaining this structure to maintain
data locality, when--well what happened is when you like remove--when you kill one server.
Like to recreate that server, okay? It means more--but to recreate the server, it only
means like moving as many nodes as you had in the server. But let's say that you don't
want that server to actually like be recreated, you just want to get rid of it. So that basically
means that to my--that all the nodes that exist in that server have to move elsewhere
and all their neighbors too. And that thing can--okay, happen in a single point of time
where you cannot do the optimization. So basically, what happens is for example, in the case that
we do some experiments. So we have like 32 servers. Basically, we'll just like kill one
that would only affect like, in theory, like see that 3.1% of the users. We ended up, like,
that we had to move like 20% of the nodes. So it's like the 3%. So that means that removing
a server is not a good idea in our system but then you might--you might think that,
"Well, that's not necessary," because you never have to remove servers, you never have
to kill [INDISTINCT], which is probably not true. But the problem is that you can't think
that the permanent failure is also like removing a failure--removing a server. When you have
a failure, a machine fails. There's two things that you can do. One, if its permanent and
you don't want to--well, you are losing the server you have to like--be able like to--like
reproduce the state where you work with one server less. Then you are like having these
like battle scenario of like removing the server. And then when we have to actually
like--we are thinking of like doing this like a different strategy based on backups, which
we can do because our system is stable. The partitions do not change much. So, we have
like actually some numbers that's kind of like seem to point at--is like a very physical
to do it like this. However, what happens when you have like transient failures? By
transient failures, I mean those failures that last less than X minutes that you might
actually not want to recreate the server or like to put it down. But you just like want
to--okay, let's wait until the thing comes back online. The node system is affected by
this. Actually, it is in a way but it is very resilient. Why? Because basically--because
we have this kind of like community structure. What happens is that if one server fails,
what you can do is like--you can promote one of the slaves of that masters that failed
to be a master. Then there is no guarantee that that--that new promoted master will have
all the data local. Obviously, because we are--we are not enforcing it. But because
of the partition and we waited--the social structure, most of the neighbors that had
to be there to maintain data locality will already be there. Because they had to be there
anyway to begin with because they have like--this kind of like little balls and basically they
are comparing balls, an intersection of the balls are probably on the same servers. I
don't know if my--myself clear but like with a little example, it's like--yes, easy to
see. So, we can like--we see that for example, let's supposed that one server like we have
Twitter data, 16 servers, if one machine fails, if we do the promotion to a master using a
water filling strategy. So like the one user picks which is the best candidate and so on
to maintain like the load balancing constraint. We end up like 65% of the users have more
than 90% of the neighbors on their best slave candidate. That means that for like the period
of that fail--for the time span of that failure. Like most of the users would have like a degraded
performance of what the--what they see from the--but they will make some twists so they
will make like perhaps some email that is going to be--that is being sent to them for
that particular time. But most of the–-like for most of the users, they would actually
not experience a lot of deviations. So this come like a grateful deviation and obviously
it's not like a perfect solution but it comes like a nice byproduct of the thing. In the
case that fail--transient failures that happen a lot, in a way, people might actually not
be–-even be aware of those. Like-–by the high level of the architecture, basically
with architecture what we have is like--well, we have the server X that has the front end
and application logic that we don't have anything to do there and we have the middleware, which
particularly the middleware interfaces with the data store that we want to use. Right?
And the data store can be like either MySQL or Cassandra or whatever. You only have to
implement that middleware for that particular data store. We only have done it for MySQL
and Cassandra. Basically, like the application logic talks to the middleware as if it was
talking to the data store. So basically, in the case of Cassandra, we have implemented
the API. In the case of MySQL, we did like a driver-like thing for MySQL. So the application
talks to the data store through our middleware and basically our middleware what it does
is like, okay, it goes to the directory service which take--the directory service is a place
where–-because we don't have a hash function here. We have to have like a table lookup
to know, if you are like user X, you will have to--you have your data in machine A.
So we have to know that. So it goes to the directory service that answers this and then
it directs you to the right data store. And then that particular operation, by definition,
will have all the data lock off. So the data store that you have a cluster of three--the
data stores are independent of each other. They don't know–-they are not aware of the
existence of each other. And they can resolve all queries local. And then what happened
is that--what happen if those particular queries are writes or updates? Then the replication
manager here is all of like--also like a [INDISTINCT] things, if he's like an [INDISTINCT] it doesn't
do anything. But if it's a write or an update, basically it catches that. It goes to the
directory service again, and he finds out where the replicas of that user is and just
like--that will replay on the data store where the replica--where the slave replicas are.
This is like a very, like, cheap way to implement--to re-implement like the replication--typical
replication that MySQL offers. And now, like, the dessert which is like--actually the part
that I liked most is like, basically okay we have the system, the numbers look, kind
of like good. Let's see if we actually like, deliver what we--what we tried to do. So we
take a Twitter clone. I don't know if you know Laconica or status.net. It's an open
source twitter that is done by like kinds of--like students. Because, you know, we're
like, especially like, if you–-actually I said the same--either it's Facebook and
the complaints I can not say–-I think I can't say it here too, is that if you only
do that system for like, 10 people doing Twitter and Facebook, you--it takes like 15 days.
Like, to implement the functionality if you only like want to have 10 users. If you want
to have like a million users, that's a different story. Right? So basically what we take is
like, "Okay. Let's take this Twitter open source which relies on the synchronized architecture.
Let's have like a PHP front end. The logic is on PHP too. It interacts with either like
MySQL or Populus and we feed it with the data. We put the data of--that we have of Twitter.
That is like a good representation of what Twitter was at the end of 2008. And then we
have our little engines, which is basically 16 commodity desktops because we are like
low budget, as would a startup. With two gigs of RAM connected with a Gigabit-Ethernet switch.
And with that, our system, which is--our system which is a SPAR but also like--which is our
system between the Twitter dataset–-Twitter clone and MySQL, and also Cassandra. Okay.
And here we have the results. So basically what we do is, like, okay, when we have the
data loaded and we have like the--kind of like what Twitter would be like at the end
of 2008. We have like--we apply, like, different application--application level read request.
This read request is now like a get from the Key-Value Store. It's basically, it's like
a call to the local directory service to find out where the--what machine is hosting that
Cassandra node. Then one--an additional call to get the list of tweets and like, one multi-get
call to get the content of the last 20 tweets. So that's kind of like a--just like a heavy
operation. To avoid like the problems of caching, we immediately like, query the user only per
session. And what we have here is that the typical response time versus the [INDISTINCT]
of the response time for SPAR which is our system and the Vanilla Cassandra which uses
random partition. And basically what we see is that; well, typical--like typical result,
99 percentile. We want to guarantee under 100 milliseconds. We can serve like 100 requests
per second while Cassandra, out of the box, can only serve 200. And let me--and using--this
is like because network bandwidth. No, it's not a problem because network bandwidth is
not a problem in this case because you see that we have a gigabit network and Cassandra
keeps growing but it just stays like--until like 17 megabits, which is very low. That
means that Cassandra eventually will hit a problem when you increase the number of servers
but, it's not a problem. And we see that our solution, obviously we don't have--mutually
no network traffic. So it's--so, we see that nice result for Cassandra and let's go for
MySQL. In MySQL, the experiment is a little bit different. What we do in MySQL is while
we compare MySQL full replication, so like every--against MySQL with the SPAR. This comparison
is a little bit unfair because in one case we are partitioning data and in the other
doesn't. Like every--every one of the 16 clusters, on the case of full replication has a full
dataset. However, it's not unfair on this case, but is very fair on the--from the application
perspective because from the application perspective, both full replication and our system behave
like a centralized system. So the application doesn't have to be aware of where the data--for
the application the data is local. So I can do like, SQL queries without worrying about
anything because I know that--I have the impression that the application has illusion that all
data is local. So in that respect, this is like the fair comparison to do. Well then
basically what we have is like 95 percentile, 98 percentile under 100 milli--150 milliseconds
because 100 didn't give any significant results from MySQL. We just like, can jump from like,
50 requests per second on MySQL to 2,500. So which is like a big improvement. And please,
the disclaimer--do not try to compare these results with a--I'm not comparing them here,
like, MySQL with Cassandra at all and it's like--it's a different scenario. So why do
we have so much--so big improvement. Let's go first for the--for the MySQL. It's because
we don't really know whether it's like a billion table--a billion rows--a table with a billion
rows. What happened is that because, well, that–-those systems used like de-normalization
so, basically you have as many, like, entries in the database as tweets multiplied by the
average degree. Because you have--you don't want to do adjoin to fetch your last--the
last 20 tweets of your friends. What you want to have is materialized. So you want to have,
like, a list of the--list--you want to have an inbox, right? With your late--last 20 tweets
that you have to fetch. So basically this table keeps running very fast. And obviously
like when you have one million rows, you're still like hitting--you have--you're still
like hitting these [INDISTINCT], it starts crashing, all sort of problems. But then what
happened why--but you might think that, "Okay. One billion you divide it by 16, you multiple
it by 2.4 which is our replication overhead, that is--it will give you like, a big table
of like 100 million approx. So, why there is no problem there? Well, because in a way,
we are doing data locality but we also like doing data correlation implicitly because
that table is like 1 billion rows, but like all the team users always go to the same places.
And because of that, you will like increase like a lot the memory--the cache--the cache
hit ratio. And that's why it's like improve on that until there is kind of like performance.
And like those like 15 years of like optimization on SQL kicks in and you have like a tremendous
boost of performance. On the case of Cassandra, the explanation is like--it's different. Why?
It's also affected by the memory hit ratio, but there is other like factors which is regularly
that we reduce the network latency because the problem on this Key-Value Stores at least
from these small sizes is not the bandwidth that you have, but the latency that you get.
And even though latency is even there on Cassandra when you test them on an isolation or like
a METIS are like under one millisecond. When you keep adding them on like multiple thousands
of requests per second, they keep--they keep adding a little bit because what happens is
that--okay, you give like--pushing the server body hard, that server is like keeping at
state because it's on a synchronized state. Every time that you keep more at state, you
go a little bit slower and then those like little latencies, you end up like with that
kind of like a smaller, like, a--and ever increasing latency that actually kind of like
affects the performance of that particular server. And then what happens is also like--it's
because like--you were like branching the request to multiple servers. You cannot actually
like scan the results until one of the servers--well, until all the servers that contain information
has--have answered. Well, one or like multiple if it is like [INDISTINCT]. But in any case,
what happened is that there is little spikes of traffic due to the CPU water links that
you have that I mentioned before because of the state. What happens is that--that also
like affects the overall latency of the query because you are like paying the performance.
You are like delayed by the performance of the worst performing server in the cluster
and therefore, last--but it kind of like adds up. It doesn't run--it doesn't--it's not a
lot, but you have like four times more improvement. As we saw like from the measurements we did,
the only I/O button is that we saw, it was like networking on CPU. And with the coffee,
there go the questions. So that's pretty much it.
So, no questions? No comments? >> When you compare the results...?
>> PUJOL: Yes. >> ...between--in particular in Twitter and
Facebook, there was--there was a very significant difference from those two.
>> PUJOL: Yes. >> Can you comment on the differences in those
networks? >> PUJOL: And then we go... Yeah. Also--yes,
the question is that there is, like, a difference between like the replication overhead that
you see on Twitter or on Facebook. And actually I also add that in Orkut. Yes, it's true.
Like--as you see like on the slides, the replication overhead for Twitter is like lower than for
the Facebook and Orkut. There is like one reason for this. Well, there is like multiple
reasons. Once--one is that Twitter is the directed net--is a directed network, while
Orkut and Facebook is undirected. So when it's under--when it's undirected, obviously,
the replication overhead, it has to be double by default because you have like--you don't
have like this, like, have edges. And then on top of that, what also happens is that
Facebook and Orkut are like more dense than Facebook. So like the more dense you are,
the more replication overhead you will end up having. As you see like Orkut is much more
like dense than Facebook and so on. So you have like a high replication overhead. However,
than--like actually that's kind of like a premier, you might think that, "Oh," but that
means that you have your network of Twitter was not big enough. And therefore, that's
why you have like these like extremely low numbers. And then we have like a slide here,
sorry. We have a slide here which basically tries to say this. So actually, what we did
is like we don't have it on the paper because we just like finished experiment, so we basically
fetched the people at KAIST, they were very kind and they gave us their dataset of Twitter.
They have like 41 million users, one that for billion edges. The replication overhead
that we had for Twitter and 16 servers which is the same, before was 2.44, now it's 2.55,
so it kind of like scales. But obviously, it's true. If you have like undirected networks
that are very dense, you will have like high replication overhead. And actually, if you
have like no community structure, you will have replication overhead. So for [INDISTINCT],
I might think that--for example, this--for Myspace where people kind of like connect
at random in a way that there is like this kind of like frank notion of friendship is
not very clear, because you don't have this community structure then the replication--the
replication that you will have to do will be kind of like--kind of like closer to the
random. Yes? >> [INDISTINCT]...
>> PUJOL: Yes. >> [INDISTINCT].
>> PUJOL: Yes. >> However, I think in order to do all these,
you have to have an initial partition to start with, right? So do you think--maybe I missed
that. So for [INDISTINCT] planning, is it the METIS that you use it as the initial partition?
>> PUJOL: Okay. Okay. So the question is that on the online--on the online algorithm, we
stay like that as well. We have the case that one--and once it's created, you decide which
of the three options you go to, but you have to have an initial partition. Do we use METIS
for this? No, we don't use METIS. Actually the--that's one of the cases that I didn't
explain because I didn't have time but like when you create like a new node, it's assigned
to the--to the machine that is less loaded--that is least loaded. So basically, like we did
a strategy and then we used the--exactly the same heuristic from the very first edge. So
we do not rely on any initial partition of any kind.
>> So how do you know the order? Okay, look--because if you'll take a snapshot of the--of the [INDISTINCT]
table, you already have a...? >> PUJOL: Okay. I get it. I see, I see. How
do we know--how do we know--yes. >> [INDISTINCT] the record.
>> PUJOL: Yes. How do we--how you know the order of the edges were created?
>> Right. >> PUJOL: Okay. We don't know for--we don't
know--we do not know for Twitter or for Orkut. And basically, what we do is like we take
the edges and we do a random permutations and we suppose that the edges are created
like with the random permutation. However, for Facebook, we have time there--we have
the timestamps of the edge creation events. And actually, these two plots actually show
you that Facebook with permuted order and Facebook with the actual order. They perform
like--the replication overhead is very much the same. The only thing that changes and
actually there is like--on the permuted order, you see that there is like an error bar which
is very small. It means that different permutations of the edges do not give like different replications
overheads. So it's like--it's very con--it's very consistent and it's very like a [INDISTINCT].
The thing that really affects the order, however, is like the number of movements, but actually
we are like doing, in our case, if you have the actual timestamps, you would do less movements.
>> PUJOL: The thing that really affects the order, however, is like the number of movements,
but actually we are like doing, in our case, if you have the actual timestamps, you would
do less movements. Actually by not--by doing random result like random permutation, we
end up like doing more movement because we don't take advantage of like, this kind of
like, temporal locality of like, edge creations. What happen usually on edge creation is that
people undergo like a process of like, scope identification. Well like, you start, like,
doing a lot of, like, edges, you create, like--you add like, 10 friends one day, five the next,
three, one, two and eventually like--you end up like, adding one every like, 15 days. So,
if you have those events like, all together, you will end up like doing less movements.
But on replication overhead it's like marginally less but it's like, negligible. Yes, sir.
>> Along those [INDISTINCT], how long does the algorithm take to conversion [INDISTINCT]?
>> PUJOL: Yes. How long does the algorithm take to convert? The algorithm converts immediately
because it's only like doing--it's a local decision. And every time that you do add a
new edge, you do like--you say like, "Okay, this X number of nodes have to move to these
particular servers." Once the movement is done, it's stable, there is no other change.
>> But having other edges to other rules in the surrounding server [INDISTINCT].
>> PUJOL: No. No. The only nodes who are affected--the only nodes who move--who can move are the
two nodes that have the edge. What can happen is that other nodes will have to create like
an additional replica in the case they were not there. But their master will not move.
Their master will still be on the same server. So like--actually in a way that we--our algorithm
is kind of like extremely simple in a similar heuristic because we wanted to abide, like,
large optimizations that's good result in cascades because we don't want--I mean obviously
we don't want these kind of like things. >> So, what happens in a situation with, let's
say if a cluster uses [INDISTINCT] and you have a huge [INDISTINCT]...
>> PUJOL: Yes. >> ...what do they--are they able to distribute
it [INDISTINCT] can I just essentially pile up on the server I have using this [INDISTINCT]?
>> PUJOL: Well, what happened if you have like already, like a closed site community
in one server and then there is like, new users coming into that server? Well, if there
is new users, the new users will come to the least loaded server. So there is no, like,
guarantee that they will have--they will have to go there. Eventually, you might think that,
okay, well--because they will add links to that closed site community, they will--they
have a tendency to go. Yes. But there is like--there is like the load balancing condition, that
basically, you can only go as long as the server that you are going to have less masters
than the server that you are leaving to avoid possessive effect. So basically, in this case,
it's kind of, like, okay, you are left on your own. But eventually, what happens is
that, this like particular ball of users and then they are--the new ball that's forming,
I mean they will not be able to go here, like, to jump but what will happen is that the new
ones who are coming that also form a ball, they will--they will be put together. So.
Yes. >> What's the--I'm not sure I handle this
exactly but what was sort of your utilization per server because you have to reserve space
for prompt occasions to get to be made on all server, right?
>> PUJOL: Yes. Well, what do you mean a spare--what do you mean a space like, on a storage?
>> Yes. So, in other words, if you say a server's full, right...
>> PUJOL: Yes. >> ...in terms of the load balance [INDISTINCT]
says you can't create new nodes there on purpose. >> PUJOL: Yes.
>> But you could activate, you know, the server [INDISTINCT] because they are [INDISTINCT].
>> PUJOL: Absolutely. Absolutely. >> So what, sort of, your average, like, maximum
utilization, does that make sense? >> PUJOL: Yes. So, basically, the question
is, okay, you maintain like, the load balancing for the masters but there is like, some kind
of, like, load balancing for the replicas because the replicas need to be generated
to one thing, local consistency. There is two answers to this question. It's like we
have an algorithm that--that's the two things at the same time. But then, the--that does
like load balancing for both conditions but then the numbers are worse and we just like
didn't present it. So because, here, you are, like really spot on, on like, a particular--particularity
of the thing which is like, they are like servers, they will have like, many more replicas
than others. So, there is no, like, there is no way to guarantee unless we enforce it
that all servers have, like, equal utilization on--of a storage. That's right. But for us,
it's not a problem in the--in the sense that--because we assume that storage is very cheap, right?
So we both fail to actually have, like, what? Not check on that and achieve, like, very
low replication overhead because we are more concerned of, like, the delay that can happen
from the eventual consistency down from the size of the storage. And then, like, let me
go a little further. Then, I see that storage is cheap which is probably true, but then
that might actually--you might think that, that could have an effect on the performance
of the operations. For Key-Value Stores, they do not because Key-Value Stores basically
work on like, all memory. All in memory, so the replicas need to be there by like, only
a fraction of information needs to be on memory. Only the fraction of information of one user
who is a replica, who is relevant to another user. The bulk of information will never be
loaded. So there is, like, on the database, that's a different problem because that increases
table size. So that's a good question. And what is the other thing? It can be like, up
to like, 50%--well, it's always like, up to 50% or bigger. So there is like, the smaller
server and the bigger server, they can have like, up to 50% more of storage space, assuming
that all users have the same profile information. I mean that all users are the same. We do
not--we do not check either, like, if there is, like, users that have, like--I don't know,
like 100--like, they take, like, 10 megabytes and there's other user that will only take
like, 100 megabytes. But, we don't do that; we treat all the users equally which is not
true. But that could be added into their--into the algorithm without much problems.
>> So when you say the algorithm--the version that cared about this base was worst. How
much worst is the problem? >> PUJOL: I think that instead of like, for
like, a particular example of like, 2.44 of Twitter that I was like, constantly referring
to, I think it was like, 3.5. But, I mean, that's kind of, like, a recall because in
a way--but in this case, again, this solution--the algorithm is already simple as it is but like,
that version was extremely like--is a very a tentative approach because basically what
I was doing is like a two--I was doing bug pressure at two different levels. But that's
kind of like, very--it's a very, like, stupid way to attack the problem because basically
you are like, optimizing for like a contradictory things. So probably, this is like a better
approach to solve this problem that would keep it lower. But that out-of-the-box was
three dot--three dot, like, 3.5 or 3.9? I don't know, three dot something, like big
something. Yes? >> If you could add extra metadatas to the
edges, what data would you move [INDISTINCT] that will help you partition better than [INDISTINCT]...
>> PUJOL: If >> ...it depends on time the [INDISTINCT]
of the data sets but obviously they didn't have control in those data sets but if you
were planning your data set by yourself. >> PUJOL: Okay. If you have, like, extra information
or metadata on the edge, how could you use that like to...?
>> What has it done, basically [INDISTINCT]? >> PUJOL: What would be the actual information
and what would be, like the--like the benefit? He mentioned, like, time, like, minimizes
the number of movements as we said before. Well, another information that would be very
relevant is like, read patterns. Like how many times--because like, now, I'm also like,
doing the implicit assumption that the--everything that you publish and everything that you do
is like propagated and read. But actually that's not true. Like, many of the things
that people do they are not actually ever read. So actually, if you could like, have
like, read patterns, or like, particular people, you could actually incorporate that because
there would be like a link that have like, that much more traffic and actually, there
could be links that if you wanted, like, to go, like, one of the farther that you could
actually say that I don't mind if I have to go to first data from a different place because
that happens, like, very rarely while I prefer to keep these particular link local because
it happens a lot of times. So that will be like, the--that will be, like, our, like,
actually dream data; to have, like, proper read patterns. But we do not have them and
there's no way to get them from the public APIs and so on. And all the things I'm--I
don't know, I will have to--right now, I'm so focused on the read pattern that I cannot
think of anything else. So, any further question? No? Well done. Well, thank you very much for
your time.