An Internet for Bulk Content Delivery, not Communication


Uploaded by GoogleTechTalks on 23.03.2011

Transcript:
>>
RODRIGUEZ: Now, the talk that I'm going to give today is trying to think a bit about
the issues that we're facing in today's Internet to transfer large amount of data. I'm talking
about data that is bug-delay tolerant and I'm going to see the impact that it's having
on the network and I'm going to try to discuss some of the possible solutions to cope with
these problems. Now, before I get started, I would like to kind of go a bit over the
area--the different waves that the networks have been going through over the years. Now,
if you look at the beginning of the networking, it was from the '30s to the '60s, and in there,
the focus was wires and it was communication between people. It was voice communication
among people. In the second phase, it was after the 60s, the Internet was created, and
then it was--the Internet was built on top of the telephony network. It was using the
same wires in a different way and the focus was for computers to talk to each other. And
then in the 90s, that's when the Web started. And then the Web used the Internet as a platform
for people to access content. And it happened to be that the Web became kind of the most
successful application that was running on top of the Internet. Now, if you think how
the Internet was designed, it was designed as a very, very thin layer. It was designed
on purpose as a thin layer to allow for a lot of innovation to happen on the top layers
and a lot of innovation to happen at the bottom layers. But this thin layer, it was designed
so that any application could run on top, and, you know, there would be all these innovations
happening on top of the Internet. However, the fact that it was not optimized for any
particular application, it meant that for some areas, it was not truly doing its best.
And in--particularly, for content delivery, I think the Internet at times has been struggling
and I'm gong to give you some examples. So, you know, after the--the Web started, if we
start thinking what happened soon after the [INDISTINCT] created in Geneva, we started
seeing that much of today's Internet infrastructure to handle content has been an afterthought.
And, well, for a lot of the times, we've been playing catch up. I'm sure you are all aware
about all the infrastructure that goes behind the curtains in the Internet to make sure
that we are able to scale the delivery of content, and we have content distribution
networks and large datacenters with thousands of machines around the world. However, it
has not always been very easy. In this slide I have two snapshots of the CNN page. The
one on the left where it shows a regular CNN page in any day, with lots of multimedia content,
the image is very rich. And then on the right, it's the same page during the September 11
attacks. As you can see, as people were rushing into the Web in the same way that they were
rushing into the TV to see what was happening, the Web just couldn’t cope with the load.
So the only solutions that CNN had at that time was to strip off most of the multimedia
content that there was in their pages and then just provide a very simple text-based
information so they could actually scale the distribution of this content. So it was--we
started realizing that the Internet was not really very well-suited to be a broadcast
medium like the television or a radio was. And more recently, we've been seeing more
struggles with very large content. So if I were to ask you whether the Internet is a
preferred medium for distributing bulk (delay tolerant) digital content, probably the answer
is that not beyond a certain size. If you think about a lot of movies, home videos,
data backups, scientific data today are not going over the Internet. They are still going
over either, dedicated networks, very expensive private networks or parcel delivery systems
like the postal service. So the best example is Netflix. You know, more than 10 million
users in the U.S., 1.5 million DVDs per day, so that's roughly 2.5 petabytes a day. If
you compare it with some recent figures that Cisco released, all U.S. peer-to-peer traffic
is roughly 14 petabytes a day, which means that Netflix is carrying a significant fraction
of the traffic, of the video traffic that is being distributed in the U.S. on any given
day. But that's not the only example. Probably you know that a lot of times when you want
to do replication of data across data centers, a lot of this times, what you do is you take
the machine, you install the software, you put all the data and then you put in the FedEx
and then you ship it to the location where you want to have the servers running. Even
CDNs a lot of the times and their business is to distribute content when they have massive
amount of data to distribute, like for instance, the log files that they need to look into
and study. They actually distribute it using postal mail service. Another example, the
CERN at the--in Geneva, everyday it exchanges roughly about 20 Terabytes of data of scientific
data across different universities in the world and research centers. And to do this
they actually build a private a network. And if you look at the public internet--the internet
while it is dealing with bug data transfers, for instance, the peer-to-peer traffic that
we are seeing today, we're seeing a lot of signs that it is sometimes it's struggling
to cope with that amount of load. It seems as if the current bug data demand is higher
than what the internet can handle. And the solutions out there are not the most interesting
ones. You know, we get solutions where you can play with the pricing schemes and you
can change to very sophisticated congestion-based pricing schemes and the users [INDISTINCT]
the ISPs is hard for them to track, how to do this. Then the other resolution is, you
know, you block these transfers and you know, it's a complete fiasco for the users and for
the ISPs. And then the third option is to have some day-time volume caps. However, this
is also very hard for the users because they need to keep track of how much volume they
have consumed and on the long run it doesn’t work. My own experience; I don’t know if
you've ever tried to move 1 Terabyte of data from one side of the world to another side
of the world. It's actually a nightmare. I have this experience when I was at Microsoft.
We had a 1 Terabyte of logs that we wanted to ship from Redmond into Cambridge. And what
was happening is that when there was congestion in Cambridge it was not congesting in Redmond
and the other way around. And the bill was so huge that at the end we ended up getting
into a plane and recording everything into a hard drive and bring it with us. And there
are a couple of reasons for these. Then there's a reason reported by Tom Leighton showing
that the effect of distance, it's being seen as more and more pronounced everyday in the
internet, where you actually have, this is the data of how long it takes to download
a DVD file as you go further and waiting to the network? And you can see that the time
to download the file can actually change between 12 minutes to several hours depending on how
far you go into the network. And another thing that we're seeing is that not only, the further
we travel into the network, the more bottlenecks we're likely to see. But also these bottlenecks
are time dependent. And I'm going to show you some more data later and I'm going to
dive of a bit more into it but you will see how at different times of the day the congestion
levels are completely different in times of the day and these changes in different parts
of the world. So, what is the real problem? Well, the real problem is that the internet
is not very well suited for delivery. It was very well designed for communications but
it was designed with bulk and instantaneous delivery, again--sorry, short and instantaneous
delivery. And what we're seeing with these large files, these bulk and delayed tolerant
traffic. So the result is that you're, if you trying to rush these large data files
over the network, it's either, very expensive or impossible. You either get charged a lot
of money because of volume based charging, or you start experiencing some congestion
in different parts of the network. Some of the reason is that eventually not all the
bits cost the same. And actually not all the bits are equally important. And today the
network, it just treats all bits regardless of whether they are interactive or delayed
tolerant, indiscriminately. So, let's step back for a second and try to think about other
industries of how they solve this problem. And if you look into the physical world there
are actually industries like FedEx that they've been looking at this problem for a large number
of years. And--so, what if we start thinking about the internet rather than a Telecommunication
network as a cargo distribution network? So you could think that in the same way that
you go to the FedEx webpage and you say, "I want to take this parcel and I want to ship
it from point A to point B. And I want it to arrive by this time." It's going to charge
me this much money. If I want it to arrive it by this other time, it's going to cost
this much more. Similar thing could actually be done in the internet with different amounts
of data that you want to transmit from one place to another one. So, how does it work
with FedEx? Well, FedEx they have all these local offices, all these local branches where
you actually take your parcels and you give it to them and that's it. From there you forget
about it. And then they take it and then they have a delivery network made out of planes
and made out of trucks where they actually try to optimize the delivery of the information
from one part of the world to another part of the world. To make sure that they can take
the cheapest flights or that they can--when they arrive in the ground that they can take
the shorter spot to the destination. And then they have something that where they have this
warehouses where they actually store a lot of the content in--in big facilities. To make
sure that they can take the cheapest flights overnight or that they can take the best route.
And they store all the content there and they don't route it immediately. Now if you look
at today's internet, it's very good at routing. You know, it's very good at connecting point
A with point B. However, what if we were to think about the internet as a postal service?
What would it require? Well, we would have the same internet that we have today routing
from point A to point B. But then we would need some kind of post office, internet post
offices, right? It would be some boxes that are close to the--to the user. That the user
has this large video file, has this information and just uploads it very fast to these boxes
and then forgets about it. And then you also have this transit storage warehouses that
are big facilities that you use to schedule the information from point A to point B and
you store it at different places to make sure that you minimize the cost of delivering from
one place of the network to another place of the network. Another is if we use some
intelligence in the middle to be able to sort all these things out. Now there is a lot of
debate currently on the new internet design and whether we should really sign the internet
from scratch or not. So there's a question of "Well, should this be a revolution or an
evolution?" And I think it can be implemented in both ways. What would be a revolution?
That revolution will be to think about the internet as a FedEx per bits. So the routers
will need to include terabytes of storage. There would be new routing algorithm. It’s
that they not only route across space but they also route across time. And, you know,
current algorithms they route across time but in the time span of milliseconds or seconds
to avoid congesting control. But we never think about the routing in the time span of
hours or even days. And if we were to think about how to implement this as an evolution
they would probably think about it as buildings and as an overlay and there would be a lot
of things to learn from peer-to-peer networks. And you could actually think how it would
take some of the existing CDN infrastructures or large Data Centers and recast them into--with
some transportation logistics to provide some of the facilities that I have discussed before.
So, if you're going to go on try to design something like this, I think there are still
a lot of questions to be answered. You know, how many storage warehouses do you put and
where? How many post offices? How big? How do you do routing and scheduling across time
and space? Where do you store things for how long? How do you do reliability, fairness,
et cetera? And then you can always try to design such system with two goals in mind.
You can either decrease your transportation cost. And there are model where you were getting
charged by the peak or you can actually try to increase the data rate to transfer under
some flat-rate base scheme where what you're seeing is congestion. And then--whether there
is no elasticity and how much the bandwidth can grow and you just--because you're paying
flat rate. Now I'm going to focus in the scenario where you have all these data centers around
the world and then each of them is kind of subject to some charging scheme based usually
on 95 percentile and then what you want to do is you want to replicate information across
all these data centers and you want to do it with the minimum cost. I'm not sure how
much of--how many of you are aware of how ISPs charge for traffic. But it's usually
base on what you call the 95 percentile. Ninety five percentile basically says that you take
the month and you divide it into 5 minute slots. And you measure how much traffic goes
in each of this five minute slots and then you remove the five percentile, the largest
peak, then you get the 95 percentile that's how much you get charged for. So it's sort
of matrix of the congesting that was created into a network at any--at any given point
in time. And the reason for this is because a lot of the networks and a lot of the ISPs
they dimension their equipment and their networks based on peak hour load out and they manage
their traffic accordingly. So let me give you an example to illustrate what happens
if you don't do something smart with this large bug data transfers. This is typical
load that you see in most equipment today. In routers or in servers, you know, you get
these diurnal and nocturnal patterns. And imagine that this is a link where you're experiencing
this--this traffic and now what you have is you--you want to transmit this cargo data.
It could be large data back ups or it could be some replication information that you want
to transmit to through from one place to another place. If you just send it like that on top
of this link. What's going to happen is that the 95 percentile that is defined by this
red line. It's going to be pushed all the way up. You see all the load gets push all
the way up and in similar way the 95 percentile gets push all the way up and therefore you
increase your charges a lot. Smarter way of doing it is--is water filling way. Well, you
know, we have the same scenarios before. You have these peaks and valleys and then you
have this cargo data that you want to transmit. Rather than doing the way that we described
before you can actually try to fill it into these valleys and being able to use the resources
that are available during off peak times. Now if you think about this cargo data, in
certain sense it can be seen like the fat on the internet, right? And--we really need
to move this fat. We need to make the--the internet more slim. We need to remove this
fat and--and create more spare capacity so we can actually pump a lot more data at cheaper
cost. Now you could think, "Well, maybe the simplest thing, the solution is just to take
the data and wait for the valleys of the network and then just push them into valleys." I think
it's not enough. No, that--that is not enough. You just push data into the valleys. You could
end up in a situation like this. The problem is that the valleys and the peaks don't overlap
at different places in the world. When it is night time in the Europe, it is day time
in the U.S. and vice versa. And even in different networks if they're carrying--carrying different
types of traffic they peak at different times. For instance, a residential network peaks
at the certain time and an enterprise network peaks at a completely different time. So the
peaks and the valleys may not overlap. So if you just follow the alleys of one of the
networks you may hit the peak of the other one and vice versa. And this is a simple example
that exemplarises that information. This is at the load at different links in routers
in Latin America, in Europe and in China. If you're trying to send data in the linking
Latin America when it is night time, what happens is that in Europe it's already 1:00
PM and in China its 8:00 PM. So what happens is, if we're trying to rush this data end
to end, you're going to incur a lot of churches in China or vice versa. So what you really
need is this transit storage warehouses. So you do in-finance matching between these different
waves that have this maximum and some minimums. And you need to make sure that you do the
proper scheduling like FedEx does with their planes under parcel system. So let's--let's
see what this would take. Imagine that you want to minimize cost so you have these post
offices all around the world that you have these internet warehouse where you can store
information. These could be your data centers and what you need to know is you have some
data that you want to replicate that you want to distribute and you want it to arrive by
a certain deadline because of consistency reasons or because some--some SLA agreements.
And you also need to know somehow the load, when is the different links are peaking at
what times and so you are able to predict the consumption patterns of the--of these
links or on these servers. And you need to have some kind of rough idea of the cost.
How much--what is the pricing for each of these links which is usually a concrete function
because economy of scale is helping you and the more you consume usually flattens out
the expenses. And then what you do is you can do some dynamic programming. I'm going
to go into the details of the actual algorithmic of how you get this but the end result is
that you're able to scale your fins very efficiently and get some very interesting reduction cost
and how you transfer data from one point of the world to another point of the world. So
let me show you some actual result. We took the load from a large wholesale ISP with roughly
400 interconnection points and peering with roughly another 140 ISPs over a three month
period and we took their pricing of each of the interconnection points and we--assume
that we were placing one of these internet post offices and each of the PoPs and that
there was a number of transit Warehouses at this whole ISP--Wholesale ISP that would allow
us to route information from one place to another efficiently. So the example was, we
wanted to do transfer 27 Terabytes from one point of the world to another point of the
world basically from any of these two PoPs from these large ISP and we have a deadline
of two days. And we wanted to compare two transfer policies; the first one was this
end to end transfer and then the second one was the Internet Postal Service where you
were using these postal offices and these transit warehouses. And we wanted to compare
the difference in money that it would take you between using one policy or using the
other policy. And basically these are the results. And so you want to take again 27
terabytes from--this is from Latin America to Europe within two days and the cost of
an end to end transfer would be roughly $150, 000 and with the internet postal service the
sketch decreased by about a factor of 20. And the reason is because you're really trying
to use these peaks and valleys of the network and the store information in those places
as much as you can in the similar way that FedEx was doing to minimize the cost of the
transfer. Now I just gave you one example; if you look at all the sources in [INDISTINCT]
pairs in this network and this is the amount of data that could be transferred with--at
the Internet Postal Service and this is with end to end at the same cost. So if I am just--I
just have this much money, how much data I can transfer with one mechanism versus the
other mechanism? And it turns out to be that all the points that are below the line, it
means that with the internet postal service you can always transfer a lot more data than
with the end to end transfer. And there are certain points that actually this difference
becomes quite significant. So if I zoom out into this points and I look at what these
points are, are actually points in the world to PoPs that are have some time difference
in the world map. So for instance, information that wants to be transferred from the US into
Europe or into Asia and so forth. But the interesting thing is that even if you want
to transfer information across notes that are in the same time zone, you still get benefits
and the reason is because it's not time only the time zone that that defines when peaks--when
the peaks and the valleys could produce but are also the utilization patterns of those
networks and as I was saying before you may have networks with completely different utilization
patterns and networks are serving some type of users or some type of content. Another
network is a completely different set of people and the consumption patterns and the picks
and valleys could be completely different. So what about the real FedEx? You know we
did this exercise, what able would take these 27 Terabytes that we need for the previous
example and we would like to transfer them from Argentina into Spain, so we actually
did the numbers and we said, okay, "How many disks do we need to feed 27 Terabytes?" And
it's roughly 30 disks at one Terabytes each, more or less. And this is roughly 38 kilos
and then we went in to the FedEx website and we said, "Okay, we want to transfer these
from Argentina to Spain, it's going to take us two days, how much would you charge us?
It turns out to be that it's roughly about $600. But it's only one shipment. But you
need--over the month you would have to do 15 shipments if you want to have a continuous
stream of information once every two days and so over a month you would actually get
$9000 charge for these 27 terabytes from Argentina into Spain. Now you compare that to the results
that I have described before, it was a 144,000 with the E2E transmission, it's 9000 with
FedEx and it's 7000 with internet postal service. So what this means is that the internet postal
service is not always a win-win situation. That are a lot of [INDISTINCT] things smart,
that if you do things right you can actually get a very convenient solution that is an
online solution that actually provides you with similar cost to the ones that you are
getting with FedEx but with a lot more convenience. And I believe it's a spectrum of data, volume
and time deadlines and configurations where the FedEx can do better or worse than the
internet postal service and you just have to look at these parameters and decide but
it's not--there is no clear winner depending on the situation. One or the other will do
better. And then the final thing that we did is we said, "Okay, rather than just taking
one of this sources in nation pairs," that was Argentina to Spain, "let's try al the
sources in nation pairs and see when is it that FedEx is better than the internet postal
service or not." And it happened to be that this is the distribution of the cost for internet
postal service and this is the cost for FedEx, so it happens to be that if we draw a line
here, about 70 % of the sources in nation pairs and the network that we were studying
actually have cheaper cost than FedEx. And only 30 % of the sources in nation pairs had
incurred more cost than sending it over the postal service. So, you know, reducing cost
to push more data is great. I think it's--it's a very useful thing to do. But what about
the scenario where we had flat rate, we pay just a flat rate and then what happens is
that we just experience different congestion levels at different times of the day. And
this is a lot of the things that many--many times it happens with residential users they--you--they
pay different--flat rate and then the congestion in the network changes over the day depending
on the usage that is going over to that network. So imagine that you are a user and you have
a large amount of information that you recorded with your digital camera. And you want to
transfer it from here to your family somewhere in the other side of the world and you have
a high speed fiber connection and you want to do this point to point transfer, the fastest
way possible. But the problem is, as we see before, you get these bottlenecks in the network
that increase as you go deeper into the network and then you got these congestion levels that
change overtime and that changes in different places in the world. So how do you--how can
you schedule this transfer such that you maximize the speed of the delivery of this information
from you to your family in the other side of the world? So do such situations exist in reality? So if
I try to do a transfer of data from here to Europe, will I see fluctuations in the speed
that I see and that's that speed fluctuate overtime? And is it different than the fluctuation
that I will see if I transfer it to another continent in the others--in the other direction
of the world. So with some experiment where you take a sender in the US, in Berkley, and
then you tried to send it information into Barcelona and then you have another intermediate
note in Seattle and you try to do transfers you know, in over 24 hours and you try to
do an FPP and then you just measure the throughput that you're getting. And you're trying to
do from Berkeley into Seattle and from Seattle into Barcelona and basically, you see these
large fluctuations over the period of the day where you get the same peaks and valleys
as the ones that I was describing before in the core routers of the Internet. You're actually
seeing the same thing in the access links of the Internet. So let me get--let me show
you something for a second. Let me step out of the presentation for a second. And I'm
going to show you some data that confirms these peaks and valleys into the access networks
of the users. This is a tool that we built that basically is able to track the experience
that users, the BitTorrent users across the world are seeing in their access networks;
through any ISP, through any--in any country in the world. So basically we're able to know
what type of experience a user using BitTorrent is seeing in any of--any ISP in the world.
Now, the color basically tells you more or less the average download speed that they
are seeing so you can zoom into different places of the world and you can go into countries.
You can zoom into the U.K. and--and then try to see what type of experience the users are
seeing in there. And--let me jump. And for instance, for the U.K., you get information
like this. This is the average speed that users are seeing in the U.K. when they are
unloading the torrent. And this is the peak experience that they're seeing. And this is
over a period of one day. And most of the times, for a lot of ISPs, you see a very flat
experience. So over the course of the day nothing changes. But then--and then you see
other ISPs like this one, you know, nothing changes over the course of the day. But then
all of a sudden, look at this, you get things like this. You get things like--during the
peak times of the day, the performance of the BitTorrent in a load actually decrease
to make room for other interactive traffic. And this is the case of one ISP in Europe
but this is the case of another ISP in the other part of the world, in America. And you
can see that there is similar variation into the available rate for this application. However,
the times of the day are--which this thing happens are completely the opposite. So when
it is thick in one side, it is bottom on the other and vice versa. So the bottomline is
that we're seeing more and more of these fluctuations of the available speed and the congestion
across the world and these fluctuations change over time and we really need some smart mechanism
like the one I described before to be able to just give you the data across these different
peaks and valleys in the world. So, what we did is we took all that data that we have
collected from the tool that I described before and it basically tells you the available capacity
that a BitTorrent user is seeing when he's unloading these bulk data across different
ISPs in the world. And then we calculated what would be the speed, the increase in the
speed, if they were to use some kind of Internet postal service like the one I described before,
to be able to transfer their data across all these different ISPs that roughly we found
about 60 of them that we are doing active traffic management. And you can see again
all these dots, if they are on the upper side of the plot, it means that the speed can actually
increase versus just using the regular BitTorrent. And we got factors of speed improvement of
up to five times. So, I'm going to conclude there. I think I've–-I've tried to discuss
that. The content distribution, it's something that was not inherent to the Internet design.
I think, over the years, we've seen the internet struggle to be able to cope with a large,
massive content distribution and often it has pushed the Internet to its limits. And
the reason is, you know, communications and bug delivery are completely different beasts.
And in the sense the internet was not fully optimized for that. And I really think we
need to look again at how is it that we can enhance the current internet infrastructure
with more storage that are temporarily scheduling so we can increase the capacity or decrease
the cost. And finally I think, you know, I describe two worlds that are the online world
and how to schedule data in there and then the physical world, the FedEx. But I think
there are a lot of opportunities to have-- to mix these two worlds and kind of make a
system that could actually combine both of them and it doesn’t, you know, it doesn’t
have to be all the way physical or all the way online. It could actually be part of the
way physical and part of the way online, and so forth. And I think this is a space that
I still open for design and exploration. So, if you want to get more information there
are a couple of papers online. You can go to our website and you can read more over
it. And with that said, I'm going to stop there and open it for questions. Yes, so the
question is whether we've look at the interplanetary network? No we haven’t, but the problem
there is different is that you get very large round trip times. So the problem many times
is that interactivity, it is very hard because of the large delays. And--but once you cope
with that large delay, then being--you need to make sure that you maximize the available
capacity. And the available capacity, they are able through put, once it's maximized
then you should be okay. But I can see how if you--if you want to transfer data from
one planet into another planet then you could actually use intermediate planets at storage
notes in the same way that FedEx uses these warehouses to restore the content or even
use it as storage warehouses for the years. Because depending on the rotation of the earth
you can have some satellites or some planets that actually don’t move at a different
rate with respect to the earth and they could store information there such as when the earth
rotates they could actually deliver it again. So the--I think that's a good idea. All right,
so the question is, we've seen similar problems from the beginning of the internet with things
like email. And then there was another second part of the question about the probability.
What would be some concerns about whether the whole world has to adopt these before
it's actually being useful? So you bring a good point. Things like email; they work in
store and forward. The only thing is sometimes most of the times it’s--it was not optimized
for bug delivery and it was not optimize for cost or delay. It was more like a for a connectivity
purposes because I'm not connected online they just store it in there. So, their variables--why--through
which you optimize are--are different. And then you need a larger number of servers across
the world to be able to optimize things properly. So it is similar but at the same time it needs
to look at the space the wider space to make sure that it optimizes the solution. And then
the second part of the question regarding the probability. I think, I would see this
like a large CDN or a company with a large number of data centers or large Telco with
global presence just providing this facility of a large data file transfers. And I think
you can do an incremental deployment over that. You know, there is--if one of the CDNs
starts deploying a service like these tomorrow, I can see how enterprises could actually contract
these service from this CDN in the same way that they contract hosting. They could just
contract service to replicate data across the world and then the next thing would be
to provide it to the end users so they could actually have their point-to-point transfers
more efficiently. And I think this is probably something that would become more and more
important. Probably the first 10-15 years I would look at content delivery, we've always
focused into the very popular part of the safe distribution that you get the very popular
content accounts for most of the request. But we really haven’t looked much at the
very long tail. How do we optimize things? And as more and more content is produced,
user-generated, that content is only for--one user consumes it and then their family--one
user produces and then the family consumes it. And that content is really on the long
tail and it's very hard to use the traditional content distribution infrastructures to host
that content efficiently. So you would probably need to start thinking about some of these
other elements like the one I described today to make sure that you do the distribution
of this long tail of information very efficiently. Right. So the question is, "If you don’t
have enough servers in some places of the world, then it will be hard to provide the
service and why not reuse some of the infrastructure that already exist like for instance, e-mail,
that they already have a large number of servers with the storage?" Sure. Why not? I mean,
I can't think of e-mail but I can think of many other services that they also have a
large number of servers and storage capabilities. I think if you look at any CDN or--they have
in the order of tens of thousands of storage capabilities around the world. But you could
even think about it in a peer-to-peer way. You know, more and more setup boxes and home
gateways are becoming capable of doing the same thing that a PC is doing and they have
a lot of storage, so you could actually take it to the next level and think of the network--think
of the home as an extension of the network, and then start thinking how you could use
that infrastructure there to deliver services like the one I describe now. Right. So the
question is--I guess the question--the question is, the BitTorrent data that we showed, is
there a way of gathering it without having to send a large volume of information? So
I'm going over the details of how we gather that. Actually, you don’t have to send a
lot of information. So basically what we did is we took a BitTorrent client, we modified
it and this BitTorrent client--usually a BitTorrent client opens 70 connections to talk to 70
peers and then in parallel is unloading from four. And it keeps changing four out of--in
the range of the 70 to find the fastest ones. So, what we did is we took a client, we modified
it, and rather than opening 70 connections, he is able to open 100,000 connections. Okay?
So not only we are able to talk to 70 peers, we are able to talk to 100,000 in parallel.
And then once every five minutes, we switch, we terminate those connections and we move
to another 100,000 users. Okay? So over the time of the day, we are able to talk to several
millions of peers. Now, we don’t unload any content. We don't unload anything. The
only thing that we do is we listen to the messages that the peers are exchanging among
themselves everytime that they'd unload a block of a file. So in BitTorrent, the way
it works in the peer-to-peer network, everytime I download a block of the file, I need to
tell my friends so they can download from me. I need to tell my peers. So that's the
only thing that we do. We have this line, we connect to 100,000 peers and then we start
listening on the exchanged messages. So we don’t participate in the data exchange.
We just look at the metadata exchange messages and we do that once every five minutes over
a large number of peers and then over the course of several months, then we are able
to map the world. Right. And then the cost comes from a real Telco. So the question was,
"What are the assumptions that we use to compare the cost of FedEx versus the cost of transferring
the data over the network?" The cost of transferring the data over the network are real cost at
the links based on the real 95 charges. We took the--the actual charge, I think from
last year, 2008. So the question is whether we're looking at the ISP rates or the link
rate. It was the wholesale prices of the links. Right. The question is whether we looked at
any wireless provider. The answer is no, but I'm expecting that--I'm expecting that the
results will look similar, even more amplified, because in wireless the resources are even
more limited. So the constraints in terms or utilization and cost of upgrading and deploying
will be tighter. Sure that--sure, probably the peaks and valleys will vary. Still, a
lot of people are using 3G today as home broadband connection and then, you know, as mobile broadband
connection. So, you're right. Maybe they can use it over different periods of time, even
when they are at work or when they are traveling. So maybe the peaks that will be--rather than
having one peak and there will be multiple peaks during the day. Maybe when there is
a break during lunch because you have your laptop with you with a 3G card and so forth.
So the balance will change, and you'll probably need to adapt the algorithms to make sure
that you adapt to those patterns. Right. So the question is, "If you were to implement
this, do you need some kind of entity that has a holistic view of the available capacities
in different places in the world in the same way that FedEx has that information?" The
answer is you need to have local information. But you need to have all the local information.
But that's something that most of the networks that have presence in different points, they
have. When the--I'm wondering when--when Google puts a server somewhere in the data center
and purchases some bandwidth from some ISP, they know the charges. They know the utilization
of that link. They know when it is going to peak. The thing that--the thing that you need
to do is you need to predict the future. So you--it's not--it's not as much as the problem
of knowing the cost and knowing the utilization of the links where you have presence because
that you can measure. It's about predicting the future. So because, what you're doing
is you're scheduling over time. You need to, maybe have a data transfer that is going to
last over two days. You actually need to predict how the load is going to look over those two
days so you can actually scale your information across the world. And for that, you need to
have some good prediction mechanisms. Our experience is that a lot of these links, unless
there is some black swan effect that all of a sudden something drastically changes, are
fairly predictable. So the same patterns repeat over different days with some variations over
weekends and so forth. But the answer, so the answer is yes, you need to have information
about the links where you operate these notes and yes, you need to be able to predict the
future but I think both are doable. So let me see if I got the question right. So the
question is, "If you start looking at other applications that are not delay-tolerant,
that are more interactive, like for instance, video streaming, video conferencing is there
enough bandwidth out there to support those applications?" Yes. I think a lot of the times
the problem has been on the access link and the uplink capacity. So I don’t think its
being so much a problem inside the network as--in the core of the network as a problem
in the access networks. And it's a technology problem. And a lot of them will go away with
fiber and with more symmetric links. I think exactly the point of this talk was trying
to say that you need to make sure that if you have two types of traffic, one that is
interactive like for instance, video streaming or a voice conversation and you have something
else that is a lot more delay-tolerant like, data backups or replication or large movie
transfers. You need to treat them differently. And you can actually have a win-win situation
if you treat them, not only do you treat them different but you can also optimize how you
deliver each of them. And then, you will be able to make a lot more capacity for both
of them. So, the point of this Internet postal service was to treat this large bulk delayed
transfers in a smart way so you can actually make more room for the more transit delay-sensitive
application like the video streaming one. Right. So the question is what type of traffic
is amenable for this infrastructure that I described before and whether we've looked
at what type of traffic in today's Internet could fit this profile? So I think as you
said, mostly I would look--I would think of a large amount of the peer-to-peer traffic
that is happening out there, which is a significant fraction of the--of today's Internet, is delay-tolerant
to a certain extent. You know, people leave their movies to download and then most often
they download when they're away, overnight or--so there is some certain delay tolerance
in there. Right. So the question is, if we start caching illegal content, won't we run
into trouble? You know, caching for a long time its been--having a top and downs. At
times, we are receiving an attempt in the EU to ban caching all together because the
cache was hosting--some content would be harmful for--of another provider. And a certain point
that we're saying that they were not respecting the time to leave TTLs of the CNM page so
they could actually absorb a lot more of the load. So, I think you're right. In certain
situations, you will probably need to look at what type of content and being able to
maintain the rules by which the content provider wants that content to be treated. And some
of them, they want a market that's non-cacheable and some of them is illegal content and you're
not going to be able to store it. But I'm seeing more and more of the content that is
being consumed by users going into more legal ways of distribution. More legal frameworks,
like for instance, if you look at something like Spotify today. I don't know if you use
Spotify. It’s an online system where they provide you most large fraction of the music
content that is out there and they're using a peer-to-peer system. This peer-to-peer by
itself, it's just another medium of distribution but they just frame it into a way that were
using it with agreements with the [INDISTINCT] and the providers, they said, "Okay, let's
use peer-to-peer to reduce the cost." So bottom line, I think you have to go through--I don't
think same peer-to-peer and all peer-to-peer is--can be attached to saying it's illegal
or illegal. It's just a different transport medium. You can actually use it as a separate
way of transferring your content. So, what I was saying before is that the peer-to-peer
traffic, I'm not--I don't mean the--I mean that there is a lot of movies and there is
a lot of content that now is being transferred over peer-to-peer but tomorrow, it could be
transferred over a CBA or it could be transferred over network like the one in the Skype. I
understand your question now. So the question is peer-to-peer networks are very good at
handling the very popular content because the more people request their content the
more knows that they are there to serve it and therefore, you know, naturally the capacity
of the network it responds. But now, you're telling us there is more about user generated
content or a user that wants to send it from Argentina into Spain and how you're going
to handle that. So, that's true. That's exactly what I was saying before. I think over the
cost, over the period of the last 10 years, we've been able to design very good content
distribution systems to deal with very popular content like CDNs and caching. And I think
on the next wave, we are seeing a lot of the content that is being user-generated that
is non-cacheable, right? And for that, traditional system like peer-to-peer or like CDNs are
doing--are going to do a poor job because you are always running into the problem of
the needs ratio. You start into caching nobody else wants to see it. So that’s why the
infrastructure that I was presenting today is more amenable to be able to transfer information
from one part of the world to another part of the world even if only those two entities
are the ones I want to see. Now, the part where I was talking on peer-to-peer is you
need to use this intermediate storage to be able to do the mismatching between the different
peaks and valleys in different times of the world and the storage notes rather than be
sitting in the network, could be sitting in the user's computer. And that's the part of
the peer-to-peer angle. Not so much the traditional swarming effect that the more people come,
the better it works. It's more, like, using some notes as helpers to be able to store
this information and be able to do the difference between the valleys in different times of
the world. You know if you look at the distribution, usually you get--we did a study of the YouTube
traffic a couple of years ago. And it was more on the 60/40 rule. So you get something
like, the long tail is roughly 30%, 30% to 40% of the traffic. So you--we are very good
at optimizing the 60% but then the other 30% or 40%, we have the hard time. So, I truly
believe that if you put some system that starts understanding the differences of the two end
points in terms of peaks and valleys and you use something like that, you'll be able to
optimize for that 40%. Yes. Thank you.