New Developments in Link Emulation and packet Scheduling in FreeBSD, Linux, and Windows

Uploaded by GoogleTechTalks on 06.04.2010

Permit me to welcome Luigi Rizzo here from the University of Pisa. We've worked together
on the FreeBSD project for over 10 years. And Luigi's contributions are many in that
area, involving networking, ipfw, dummynet, all kinds of contributions. Some of which
he'll tell you about today. And without any long introduction, I'd like to just hand it
over to Luigi, and I look forward to hearing what he's got to talk to us about.
>> RIZZO: Thanks, Mario. So, I'm going to present a couple of networking projects that
we have been working on in the University of Pisa in Italy. The first one is something
you might have already heard about is the dummynet emulator, which is a package that
has been widely used in the past and on which we have done a lot of enhancement recently.
And the second one is related, and it's an algorithm for fast packet scheduling, which
is of some importance now that networks have become extremely fast and computers have become
also extremely fast. And so, some of this solution used to provide service guarantees
are not completely usable anymore. So now to move to the first part of the talk, why
do we need emulation? Why we want emulation? Emulation is a standard tool in using protocol
and application testing because it makes your life easy when you want to set up an experiment,
when you want to reproduce results of your experiment, and also it's a--it can give much
more realistic results than pure simulation as can be done with an S2, an S3 and other
simulator; they only model the network pack of the system. Usually, when you run an experiment
on application, the application behaves as the combination of its component; the application
itself, the operating system kernel, device driver, et cetera, et cetera. So, you really
want to account for all these specific effects when you run your test. There are many existing
options for emulation; one of them is dummynet and other--there are other line or space solution
like NISTnet, the combination of tc and netem, and other tool called netpath, which is based
on click. So click is the modem router project which is mostly developed on working on Linux;
in other words, there is a user space port. As I said, dummynet is pretty old. I started
developing in '97 as a side project when I was doing some research on TCP congestion
control. It was included at that time in FreeBSD, and then it was inherited by OSX. In recent
years as part of European Project, ONELAB, we did some extension to the emulator. Most
of the goal of this project was to extend the ability to emulate the behavior of wireless
networks, but I took it a chance also to make the tool more usable and port it to other
operating system. As many other similar systems, dummynet intercepts packets in various points
of the protocol stack. You can pick the one you want. Either you're working at layer two
or layer three or at the bridging level and so on. It passes packets through a classifier
that decides what to do with traffic, and then puts the traffic you selected to an object
called the pipe, which models the behavior of a communication link. When the packets
come out of the pipe after some delay, possibly they are not even coming out because they
are dropped by the queue. The packets are either rejected into the protocol stack or
they can go back into the classifier. So, you can build some complex topology just by
looping through the emulator multiple times. Eventually, you'll get out of the emulator
and you get your traffic as if it had traversed a network with the features that you configured.
The user interface is if really simple. Basically, you have just one common that you use to set
all the things that you need; configure the classifier, configure the pipes, read back
the statistics, and so on. And so here, you see an example where you can define a couple
of pipes, number one and number two, which model the two directions of a communication
link; they don't need to have the same characteristics. For instance, this one, on one direction we
have the 256 kilobit per second and a little bit of delay, the other one id 4 megabit link.
So this could be a model of a DSL line. And then you configure the classifier so that
some--certain traffic is sent to pipe number one, some other traffic is sent to pipe number
two. And then the flow of packets is--like in this picture, packets that are coming in
from the network go through classifier, the one selector for pipe one go into pipe one,
and then they get out and they are sent to your application, and the same appearance
in the other direction. Packets that are targeted for the output interface are passed through
the classifier then sent to the appropriate pipe and then they go out to their proper
destination. The main application for this type of emulators are, at least the three
that we show here. One thing that you can have is just a link emulator for protocol
application testing and this is sometimes just used inside your computer, your workstation
or laptop, et cetera, possibly even on the loopback interface to test how your system
behaves. If instead of a 10,000 megabit link or gigabit link, you have a DSL line or some
form of delay in your communication or you can build very easily transparent bridges
with couple of network interfaces. And so, all the traffic from--that goes through your
bridge is subject to emulation according to the rules that you configure in the classifier.
And so, you can have one box that implements the emulation in your network and get the
desired result on the entire subnetwork that you connect through this bridge. Another application
of dummynet in particular in emulator in general is traffic shipping. Many--after I developed
this tool, many people started using it not for running experiment but to sell bandwidth
to customers just according to how much they pay. If your not an ISP you might still want
to reserve bandwidth for certain application and limit bandwidth for some other application
so they don't get the entire link that connect to your network to the external board. And
going through this route if you want to use the emulator or two--like an emulator in a
large testbed or at an ISP, you need certain features that are not needed in a stand-alone
setting you need to scale to thousands of emulator links or you need extra features
to do the classification and the multiplexing of traffic very quickly. You cannot afford
scaling a linearly with the number of emulated link because that would be a [INDISTINCT].
The project I was mentioning before was done in the context of extending the feature of
a Planetlab. Planetlab is an initiative for developing global testbed, sponsored initially
by Intel, Berkeley and [INDISTINCT]. And there are many organizations involved in this project.
Each of them contributes few nodes to the system and uses self account on the host system
where they can run and fully distribute their application. Planetlab is supposed to use
the actual network as it is as a playground for your experiment, but sometimes it want
to add more reproducible results or the network--the connectivity that Planetlab nodes have is
very good, much better than what you would have from, for instance, customer [INDISTINCT]
customer, et cetera. So you really want to squeeze the bandwidth or apply come modification
to the features of the link, and that's why we developed a version of Planetlab that runs
within Planetlab nodes. And in order to make life easier for researchers, we simplified
even more the user interface. I showed before that in configuring emulation you need to
set up a few things, not many, but you need to set up the pipes and you need to set up
the classifier, et cetera. In Planetlab we have a much simpler use in interface, where
you just issue one command which is called "netconfig." And then you set the--in one
line, you set--at the same the bandwidth of the inbound and outbound links and the type
of traffic you wants to--you want to act on. And whether you expect to run a client application
on your nodes, so it's an application that opens a socket and establish connections to
the outside world or several application, one that accepts incoming connection. And
with this interface, and the pictures just shows the architecture--how the signal link
and the configuration works on a Planetlab node. With the--this system, each Planetlab
user can define independent emulated links, reconfigure them on the fly while running
in an experiment, because this is a command that you can run from a shell or running system
from within your application, et cetera. And so, you have a very, very flexible tool and
easy to use and to put on an existing application. So let's see a little bit of the internals
of dummynet. How this--the objects that make up the emulator and implement it and what
features they provide. The basic object is called a pipe, and model is only the basic
features of the link. And actually the--this was the main strategy in our design of dummynet.
Keep it as simple as possible and only model things that in a way match physical realty.
We didn't want to implement models for delay losses, jitter, and so on, because, first
of all, we are lazy. Second thing, we can develop a model but it's not clear that this
model covers all possible situations. So, we would need a huge number of different,
you know, probabilistic models of these features. And the other thing is that in many cases
some features like losses, preordering, and so on, are not a feature of the link but they're
a feature of the traffic that goes into the link and causes these effects. So our idea
was we just emulate the links, and the pipes, and the queues and then let the user generate
the effect of using or misusing these links in the way that the application likes to use
them. So, basically, the--coming back to the link, we can configure the size of the queue
either in vise or in packets. We can configure the bandwidth of the pipe, and we can configure
the additional delay which models the physical distance between the two endpoints of the
link that is modeled. We have a little bit of configuration on the queue management policy
at the time when this tool was developed things like RED or GRED of variation activity to
queue management algorithms were very popular. Now, I think they're not very important giving
the advance in packet scheduling. We avoid, as I said, the non-deterministic behavior
for the reasons I've said before. We do have some features that are nondeterministic because
they are too useful to be left outside. Like for instance, we have a simple random packet
drop features where we can program that--a small percentage or a percentage of package
has dropped. Also, we have the random rule match option in the classifier. So, the outcome
of classification is not deterministic, but only acts with a certain probability, and
we use them to run simple experiment on how an application would, for instance, respond
to persistent congestion or to model some kind of packet re-order, and just by passing
through these probabilistic match features by passing packets through different pipes
with different delays. You don't have to use them if you don't like the models and you're
free to generate traffic so that causes the effects he wants. So the classifier is just
used to send the traffic to one of the many pipes that you can define, and we use ipfw
which is the standard firewall in FreeBSD. It has a large number of options, and we've
done--I've done a lot of work in the past to make it more efficient, more flexible.
And also to add some features that make it suitable and scalable to using dummynet, which
means that we want to be able to handle thousands of pipes, thousands of rules. So one thing,
for instance, we added to the classifiers was the ability to have multiple passes through
the classifier once a packet comes out of the pipe. If you want to do more complex things
than just sending traffic to a very simple pipe? Then, we need to split the pipe and
its components. We--and the pipes are made of at least three things. One is queues or
per-flow queues, if you want to do some work related to packet scheduling. And then it
has a scheduler, which is the simplest. And carnation is just a FIFO scheduler. And then
we have a link with a certain bandwidth and delay, but we could think of adding more features
to this link model. So the reason of the split is that in this way we can configure independently
or all the features of each of the components. And compose all these pieces together in ways
that are not possible if we have a monolithic pipe. So the first thing we did was to split
the queue from the rest of the pipe, and we created an instruction called the flowset,
which is an abstraction to model per-flow queue. It is a--basically, a data structure
which we configure by defining flow-mask, which is something we use to partition traffic
into multiple flows according to certain configuration of the protocol IP address, social destination
addresses and protocol source and destination port. Then, a scheduler, which means that
we can attach packets that are coming from different flowsets to the same scheduler.
And then, we can run a scheduling algorithm on those queues. And then, there are scheduling
parameters attached to these flowsets like ways or priorities or shares of the link that
should be given to each of the flow. Configuration of the flowset is--again, very, very simple--very
simple once you understand the model. And here we have an example where we configure
basically two flowsets. One is--the keyword is called queue for background and compatibility
reasons. I did not come up with a better word at the time I implemented the feature. And
so we can configure--we here configure one flowset, which is number one. Which is attached
to scheduler number five and it has a weight of 10. No mask here, which means that all
the packets that are sent to this flowset will be put in the same queue. And this queue
is attached to a scheduler and will be presented to the scheduler with the parameter, the weight
with the value of 10. Typically, the weight defines the--how much--what share of the link
the flow will get. The other flowset, number two, is configure as attached to the same
scheduler. It has a mask which has--which takes a few bits of destination IP and the
weight is one, which means that when a packet reach this flowset the partition in multiple
queues according to the final eight bits of the destination IP, like in this example.
And they are sent to the same scheduler as before but with the weight of one. So, when
all this queue will reach the scheduler, presumably the queue coming from flowset one will get
a bigger share of the bandwidth than the others. Again, we need a couple of rules because we
have just defined the objects here. But we need to classify the rules to send packets
to these flowsets. And here are the rules, simple as before, instead of using the pipe
keyword, we use the queue keyword. And then, we select traffic coming from my PC and send
it to queue number one with the bigger weight. And the traffic coming from other machines
or my-subnet and send to flowset two, which have--which generates per IP queues with different
and smaller weights. Regarding links, the basic feature of links are bandwidth delay.
We have a uniform for random loss feature, which I mentioned before. Reordering is not
a feature of the link but is implemented through random probabilistic packet matching, like
using a configuration like this. I can tell a classifier to send packet to pipe one with
the probability point three, and send packet to pipe two with 100% probability, of course,
subject to the--to matching of the other parameters. And then as you see, pipe--pipes can have
different delays. And so when traffic go through the--these rules, part of it goes to one pipe,
part of it goes to the other, and the result is that packets are reordered. About MAC layer
overheads, we have many wireless links now, and those are not as simple as the model that
we have in dummynet where each bit takes a constant amount of [INDISTINCT] of significant
protocol layer overhead like preambles, contentions, et cetera. We have two mechanisms to implement
this. One is using scheduler, because the MAC itself is in fact a scheduler. And one
is use the feature that I'm going to show now, which is called profiles. With profiles
what we do is model the extra air-time for packet transmission. And we use an empirical
distribution of the extra air-time, which in--when injecting the emulator. And then,
at runtime for each packet the emulator computes the actual transmission time of the packet
and then adds a random value taken from this distribution, and depending on how we set
these tables and these empirical function, we can model simple things like just the framing
of the packets or we can model more complex things, although not in a very precise way.
More complex things could be contention on the channel or even link layer retransmission
or eventual losses after a number of retransmissions. In terms of schedulers, the original dominant
only had a FIFO scheduler, then we added the ability to run specific scheduling algorithm
WF2Q+, which is a proportional share scheduler which--with very good scalability and service
guarantees features. Then just because we needed to emulate to make a better emulation
of the MAC layer, and then we are also doing some research on packet scheduling, we added
the ability to have configurable schedulers in the system. And so now we have define an
API using which we can load the runtime in the kernel different scheduling algorithms
and then we can configure the pipes to use one or the other type of schedulers. At the
moment we have this list of schedulers and others are coming. We have FIFO, deficit round
robin, priority and so on. Schedulers differ in the type of guarantees they provide and
also in the runtime complexity list. So, in some situation you want a simple solution,
some other situation you want a more complex solution and you're willing to pay the extra
computation time. Also, as I've said MAC layer, MAC layer is also a scheduler and so we are
developing and it's almost finished, an 822-11b scheduler and then order will come after this
first prototype. Schedulers have a mask tool same as flowset. So whenever a flowset sends
packet to a scheduler, the scheduler does an additional partition in your traffic and
can create on the fly multiple instances of a pipe or--actually, multiple instances of
scheduler and then of the pipe attached to the scheduler, which means that, for instance,
an ISP can create independent schedulers for thousands of clients by just defining one
rule like this one at the bottom where it defines a scheduler. It defines the type of
scheduler and defines the mask, and magically, well, automatically each customer will get
its own scheduler, its own pipe, which is not conflicting with the user--the traffic
use--which the other customers are using. So, the fact that we have a well defined and
very simple scheduler API, which takes care of almost everything you need in passing traffic
through kernel, like assembling packets into and buffer and doing the classification, et
cetera, et cetera, make it very simple to create new scheduling or test new scheduling
algorithms. And you don't have to worry about, you know, classification, getting the traffic,
locking, building a queue, allocating memory, et cetera, et cetera. And then the scheduler
themselves become very, very simple, as you can see from the line count of the various
implementations that we have in the kernel most of them are--all of them are below 1,000
lines of code and most of them--at least half of the lines are commands or copyrights and
so on. The overall structure of the--this revised version of dummynet, each links, together,
pipes, flowset, the queues and so on is this. So we have flowsets objects which dynamically
create a queue according to the mask given in the flowset. And then, we have scheduler
objects which dynamically create scheduler instances here, again, according to the mask
given to the scheduler. And then the packets from multiple flowset can go to one of the
instance of the scheduler which is created dynamically like in this configuration. Testing
code is very important, especially with kernel code where it's not completely attributed
to reproduce the operating condition. And some of the scheduling algorithms particular
that we are dealing with are a bit tricky because they--there are kernel cases that
must be tested. In some cases if you make a mistake in doing certain computation, the
theoretical guarantees of the algorithm do not hold anymore. So, we really need to be
sure that before inserting the scheduling in the kernel, we know that implementation
is correct. And so, we build some support to run the schedulers in user space, so we're
using a system like this where we used the same exact code that we run in the kernel,
but link to a user space application which is driven by a packet generator. And the packet
generator itself can create multiple flows with the feature that we like in terms of
different addresses, packet lens, weights, and so on. And there is a controller which
drives the packet generators and pulls traffic from the scheduler. And the rule of the controller
is to give us a way to control the operating point of the scheduler, whether we want the
scheduler to be run in a near empty situation or with the queues almost full, et cetera.
So, we can try to exercise all the code paths within the scheduler. The other nice thing
of this set up is that we can run performance testing of the scheduling algorithms itself
without all the extra overhead of packet, either packet degeneration or packet reception
from the interface going to the routing layer, locking, and so on. So we can really compare
scheduler's performance in this way which is something that would be very difficult
by running an experiment in the kernel with the entire packet processing chain. Here,
what you see here is an example of how you can run an experiment by specifying the algorithm
type, the low and high threshold of the number of packets in the scheduler and the composition
of flowsets that you want to generate with the--you can generate many flowsets specifying
the weight and the number of flows belonging to each of the flowsets. So, when you use
an emulator you always wonder how accurate it is, how well it reproduces the condition--the
operation of the actual link. And there are at least three main factors that influence
the accuracy of emulation. Most of the emulator run of a timer in the kernel, and the accuracy
of the timer is generally in the order of one millisecond. But you can push this value
down to very high frequency on so very high values, so with a very low resolution. We
have tried running the FreeBSD kernel with hertz set up to 50,000, it does work. There
is a quite bit of power within in the system doing that, but it does. And you see in the
experiment that the accuracy of the system improves a lot by doing that. There is another
reason for--that influences accuracy which is the interference of competing traffic.
Eventually, the emulator, we receive traffic from a physical link and we send traffic to
a physical link. So, there is no point in--an emulator which is accurate to the microsecond,
when competing you could have two pipes sending--expecting to send a packet at the same time on the output
link because the packet will compete for the physical bandwidth on the link, and so one
of them might be delayed by as much as one maximum packet size or more if you have multiple
pipes competing for the same link. So a gigabit speeds, this value is in the order of a 120
microseconds or more if you are running as it is very common, if you're running a link
at 100 megabit per second. So I mean it doesn't make a lot sense to bring the timer accuracy
down in the 20 microsecond range when you have another factor which is much worst here.
And the third thing is the operating system interference. Most of the system where dominant
runs are not real time system and many kernel activities can take this--lock the CPU for
a moment--a period of time. And if you just run measurements by using very CPU intensive
kernel activities like heavy disk, your heavy network, your, et cetera. And you run a simple
experiment like pinging a remote node subjected to that traffic. You will see that ping response
time vary a lot. They are typically in the order of 50, 70 microseconds but very easily
they go up in the hundred microseconds range. And so this is just another factor that hits
on the accuracy of the emulator. So, overall if you run the emulator in carefully controlled
condition, making sure that you don't put too much load on the system or that the--you
bring down the accuracy of the timer, et cetera, et cetera, its really feasible to have hundred
microseconds accuracy on modern algorithm, which is a good result in practice. When you
run an experiment in user space the--even the process scheduler itself will give you
a much bigger jitter on your application. In terms of performance, we have a detailed
analysis in the upcoming issue of computer communication review. Basically, the main
factor limiting performance is the per-packet processing time. And this is coming from three
components. One is the classifier of cost. Another one is the scheduling cost, because
when you run multiple pipes at the same time you have to schedule the work in the emulator
itself. And then, there are emulation costs. How much work do you need to do to move packets
to the emulator from queues to the scheduler to the delay lines and so on? The--we have
made a detailed analysis of these three components; and the classifier cost as the constant part,
and then one, which proportional to the number of rules. And in the entry level PC hardware,
we get times in the order between 400 and 1,000 nanoseconds with up to 20 classifier
rules, which is a descent size for a set of rules for an emulation system. You shouldn't
need more of them because the classification can be done basically in logarithmic number
of--the number rules is typically logarithmic in the number of flows that you need to handle.
Scheduling cost varies a lot, but all the algorithms that we have a runtime from order
one or to order of login in the number flows in emulation cost, moving packets to the emulator
and so on, as a cost which is logarithm in the--in the number of pipes. So, we have measure
times of up to 1.5 microseconds with 1,000 independent pipes. So, overall, we can estimate
between two and three microseconds per packet on entry-level PCR. So there is--the emulator
goes presumably fast for most application one would like to try. Porting, so we have
recently--as I said, we are porting to Linux and Windows so now all major operating systems
have the ability to run emulation directly--natively on the kernel and application. You can test
your own application without needing, needing an external box. When we started doing this
work, the code was only for FreeBSD and OS-As. And our decision was to--was to make as little
changes as possible to the source code except for headers. So we basically built compatibility
layers replicating the content of the FreeBSD headers, and then we built a glue layer which
implemented the previous FreeBSD APIs on top of the existing APIs on the other operating
systems. And most of the differences are internal packet representation, locking, packet filtering
hooks, et cetera, et cetera. But most systems have a semi-hook system or hooks with the
similar features. So all the--most of the trouble was finding out the pairs of subsystem
that we're matching in--matching in the different operating system. In particular, for packet
representation, most of the time, we don't have the operation of copying the entire packet.
The FreeBSD representation uses mbufs and as many other representation, mbufs are just
a container for metadata and the pointer to the actual packet data. And when we get packets
on Linux, they are represented via skbufs. And when they are--when we get packets on
the Windows, they are represented by NDIS packets. And our strategy was the same in
all the cases. Whenever we get the packets from foreign operating system, foreign for
the Dummynet source code, we created pseudo mbuf descriptor which is initialized with
metadata from the original presentation and with the pointer to the original data. Then
all the processing goes on, on this pseudo mbuf, and then when the packet is released
back to the original kernel, we destroyed these external data structure. And a similar
approach is used for other, other APIs like locking, like packet filter in hooks. And
perhaps the most challenging thing was that in some systems like Linux, APIs are changing
very frequently. And given that our model is not part of--the kernel is an external
model, we need the same set of sources to support, you know, old versions of Linux from
2.4 to all variations of 2.6. And then the code becomes really a spaghetti to adapt to
the various version of the same API. So, you can get more information on this address in
terms of availability. And FreeBSD was there from--since '98; OSX probably 2006. We completed
Linux and the OpenWRT version in 2009; and Windows version was completed couple of months
ago. And these are the students who worked on the, on the, on the project, on various
parts of the projects in addition to myself. And then, we are moving to the second topic
which is the order one packet scheduling at high data rates. And the first thing is why
do we care about packet scheduling? A few years ago, people might think, okay, we can
solve small scheduling problems by just over-provisioning our links. The fact is now links have become
fast but also CPUs have become very fast. And so it's very easy for a process on a workstation,
or a workstation on a network, to completely saturate one of the links and then overprovisioning
is not feasible. And the, the goal of scheduling is to arbitrate access to common resources
and give each of the customers of these resources certain service guarantees and guaranteed
isolation in the use of these resources. So again, overprovisioning is, in my opinion,
is not feasible anymore in many cases, and that's why we need real schedulers. And links
are very fast too. And so, we need that the scheduling algorithm keeps up with the speed
of the traffic, and so it must be very fast itself. Now, in terms of setting--setting
the problems, what our service guarantees? What do we care about? Well, very often, at
least, this is the model that we follow, we consider an ideal system which is called a
fluid system where the communication links or the schedule resource is infinitely divisible
among customers. And each customer gets a fraction of the available resources which
is proportional to a parameter which is called weight. And so the formal definition is that
this customer should receive a fraction phi i divided by the sum weights of all customers
that are active on the link at that time. They are called backlog clients. In the fluid
system, because the link is infinitely divisible, it--the system serves or flows simultaneously.
And so you have perfect sharing according to these weight parameters. In the real system
which we call packet system, this perfect sharing is not possible because, at the minimum,
you have to serve one packet at a time. So you really need to define a good sequence,
a good order of serving customers that approaches as much as possible these perfect sharing
of flows. And the packet system has constraints. In addition to serving one packet at a time,
one problem it has is that, often, it is not preemptable; and also, it must design online.
It cannot foresee the future, and so avoid sending a packet because another one will
be coming which has a higher weight or would end--would end up its transmission earlier
in the fluid system. What we do is normally in terms of defining service guarantees, we
compare the behavior of the fluid, the fluid system and the behavior of the packet system.
And the way we do this comparison is to compute the difference in the amount of service--this
is an equation. Equations look always good on papers. Now, this equation, what's the
meaning of this equation? This equation tells us that our performance metrics is the maximum
value over all customer, k, and over all possible time intervals, delta t, of the amount of
service received by the actual amount of service received by a customer, which is Wk over net
delta t, and the theoretical amount of service that that customer would receive on the fluid
system which is phi k times W, which is the total amount of work done in the system. In
the best possible packet system, and there exists such a system which is called WF squared
Q, this deviation in term of service is equal to one maximum second-hand (ph) sites which
means that in the worst case, a customer will never be behind in terms of service more than
one packet or had in terms of service more than one packet. Being idle with respect to
the fluid system is not a good thing because it leads to burstiness and application generally
don't--do not like burstiness as well as humans don't like burstiness. I drink one bottle
of water a day, but I don't like to drink 30 bottles of water at the beginning of the
month, and nothing for the rest of the month. So now, this index captures very well this
idea of fairness in the distribution of traffic. And it's very easy to compute the amount of
service received in the--it's easy not in the--well, we have a well-defined formula
to compute the amount of service received in the fluid systems. So that's a good reference.
Also, we have an actual packet system which can give this optimal performance index which
is one MSS. And so we also have a reference for what is feasible in practice. And so,
we can design a scheduling algorithms and see how well they behave compared to this
optimal system. In fact, one would wonder, why do we need to design scheduling algorithms
when we have this optimal system? And the answer is easy because this optimal system,
there is a theoretical result that says that it has a run time complexity of at least log
N in the number of flows, and there is a result by a student of mine in 2004 who came up with
an actual algorithm which matches this lower bound. So log N might be expensive when the
numbers of flows become very high, or the speed, the packet rate that becomes very high.
And so, we need to break this barrier. And breaking into these barrier implies a relaxed
guarantees, because if theory--if the theorem is correct and the theorem is correct, we
cannot have a faster algorithm which gives these optimal performance index. Now, when
we relax the guarantees, if we can relax the guarantees in various way, and depending on
how far we have from these ideal--these optimal value, we have better or worse schedulers.
The state of the art includes the categories listed here. We have priority based schedulers,
which are really fast but give no guarantees to anyone except the king--the flow with highest
priority. We have round robin or deficit round robin schedulers or variation of them which
have--although one runtime but full guarantees. Basically, the amount of service that is given
to the deviation between the ideal service and the actual service can be as many--as
large as--and times the optimal service, which is a bad thing, because, again, we can have
a large bars in these. There are--there is a family of schedulers which are called timestamp-based
scheduler, which internally model the behavior of the fluid system and try to track it. And
WF2Q, the optimal scheduler, is one member of this family, which gave the optimal service
guarantees in longer time, and they are approximated variants of the schedulers, which have although
one runtime--and their guarantees are reasonable. They're just constant time larger than the
optimal guarantees. But so far, these schedulers, the runtime which was at several time slower
than round robin, which is the best, you know, proportion--the best proportional share scheduler
that we have around. Our result is a new algorithm, which is called QFQ which is--gives the same
guarantees of WF2Q+ variation of timestamp-based scheduler, only more or less the maximum packet
size and it's truly constant. The other algorithms that I mentioned before are actually dependent
on some other dimensions, which are not the number of flows but some other parameters
of the algorithm. So, in the end their code needs to iterate over some data structure
to perform operation. Our algorithm has no loops inside, just because it is based on
some data structures which enforce some ordering properties that are involved in the scheduling
decisions. And the instructions that we use are very simple, basically; and/or masking,
and find first set. There is only one multiplication in the--in Q or the Q operation. And in terms
of round time, again, on my desktop machine, which is an entry level work station, the
algorithm runs in Q and the Q pair which are the basic operation for sending and pulling
your packet to--from a scheduler, runs in 110 nanosecond per packet which compares to
55 nanoseconds for deficit round robin, and over 400 nanosecond for KPS which is one of,
I think, the fastest competitor in terms of order one scheduling algorithms. I think the
result is important because it makes fair queuing feasible in software at gigabit per
second rates or feasible also in--on inexpensive hardware like switches. An overview of the
algorithm; I will not enter into the details because some of them are very technical and
boring, but then the idea is that it tries to track the behavior of a fluid system. And
for each flow and for each packet, it tries to compute when the packet would start service
and finish service in the fluid system and this is something that all timestamp-based
schedulers do. So each packet is tagged with a virtual start time, virtual finish time.
And the system also tracks virtual system time which is the--tracks the evolution of
time in the, in the fluid system. Now, the idea is that we would like to schedule packets
in the same order as in the fluid system. So, we would send packet by ordered, by finish
time--by virtual finish time. The problem is that sometimes when we have to make a scheduling
decision we don't have the next packet that the fluid system would have finished and has
not arrived yet, so we cannot really make the same decision. We have a [INDISTINCT]
in this process. And the other thing is that in order to avoid too much vastness, we should
avoid serving packets that have not yet started service in the fluid system. So, basically,
our scheduling decision is mostly based on sorting, but it's a constraint form of sorting
because it has to account also for this virtual start times. And the sorting step is what
implies a log N complexity, lower bond in the complexity of the algorithm. And as in
many cases with sorting, you can try to reduce a complexity by you sorting on a bounded universe.
So you basically approximate the values and--so that you have only a finding number of distinct
values that's why you can use some kind of bucket sort algorithm that is a constant time
sorting algorithm. But when you do these approximation we--you have to be very careful not to introduce
additional deviation in the behavior of the scheduler from the ideal scheduler. So quickly
glancing at the data structure of QFQ, the yellow boxes are flows here. And the idea
is that when we need to in queue a packet, the packet belongs to a flow. We partition
the flows in the number of groups. And each group has a different rounding of the virtual
timestamps. And the rounding is proportional in a way to the--to this parameter. The maximum
packet link divided by the weight of the flow, so the heavier is the flow and the smaller
is the--is the--sorry, the bigger is the rounding that you get. Anyways, the partition is done
in such a way that groups with--flows with similar feature ends up in the same group.
Within each group we--by the issue of the features of the algorithms we can only have
a finite number of different finish time values and also start time values. So basically,
we have a finite number of slots where a flow can end up with each group. So within each
group we can sort in by using bucket sort, bucket sort algorithm. And for selection purposes,
what other algorithms do with other algorithms which choose a similar principle, they scan
the list of groups looking for the flows with the minimum finish time and that are so called
eligible. So we look for the flows that have already started serving in the fluid system.
In our algorithm, we organize group scene, in a number of sets, four different sets which
are represented by these bitmaps. So that the index of the group--the ordering of indexes
in the group is just--reflects the ordering of finish time in the group themselves. So
when we need to find the group with the smallest finish time, we only actually need to find
the bit in the set which has the smallest index, and this is easily found with a single
machine instruction which is a find the first set. So the tricky part of the algorithm is
in managing the sets so we don't need to, you know, iterate over the group to manage
the sets and organize them in a way that preserve this ordering, and without entering into many
details. But, I mean, the algorithm is--once you follow the--all the proof, the algorithms
becomes very, very simple. It's just--when you--when you queue a packet--and we'll see
it here. When you queue a packet you just need to find the group it belongs to, which
is basically a static decision. Insert using bucket sort, then the packet and the flow
within one of the buckets in this group. And then update these four bitmaps that represents
the set that we care about. One of this bitmap is the one from which we do the selection
of the next candidate for transmission and in the--in queue--in the queue process, we
just look at this particular bitmap that we'd find the first bit set, which is the group
which contains the flow that should be served first. Then we go to this data structure which
has the final number of buckets. We--again, using our find the first set, we find the
first bucket which has flows in it, and we'd pick the first flow in the queue, and that's
all. After doing that, we need to put the flow back in the--into the proper position
usually within the group. And do some management of these four bitmaps which is easily done
using a few logic operations, basically, masking to move a number of groups from one bit map
to the other one. So, now, we move to the performance number evaluation of our algorithms
in a paper that--it's on the webpage that I will mention in a moment. There are proofs
that show that the service guarantees for our system are stressed by this formula, and
the end result of this formula once you put in the constant, et cetera, is that the WFI--the
performance index is within five maximum segment size, compared to one which is the optimal
system. There is an equivalent in terms of delay which is expressed by this formula,
but we don't care much about it. In terms of actual performance running time, how fast
it is--how fast QFQ is compared to the other system. For this, we used the same scheme
that I showed before, talking about dummynet. We implemented in actual QFQ scheduler and
actual implementation--and we have actual implementation on many other schedulers, including
competing algorithms. And we tested with the system mentioned before with the variety of
configuration in terms of number of flows and packet sizes, and scheduler status in
terms of occupation. And performance data are here. This graph shows the scalability
of the system. What we compare here is the baseline which is this graph called NONE,
which includes the system where we run the same system as before without any scheduler.
So it's just the overhead of generating the traffic and driving the controller, and independently
of the number of flows, more or less independently of the number of flows, the system with no
scheduler consumes this amount of time in our test, around 60 nanosecond, or so. When
we add the simple FIFO scheduler even just doing the queue and the queue takes a little
bit of time. And we have the second curve, deficit round robin, which is as I've said,
one of the fastest proportional share scheduler that we have as this queue over. And then
QFQ, it's just above here. So about twice as slow than the deficit round robin. The
next competitor is KPS, it's up here in the 400 to 500 nanosecond range for a pair. And
just for a comparison this is the WF2Q+ which has a runtime, which is logarithmic, and in
effect we can see this from the graph because the X-axis is logarithmic in the number of
flows. From this experiment we see a few things which can easily be explained. As you see,
most of the algorithms are flat up to 256 or perhaps even 1K flows. When we go up, we
start seeing caching effects, because even though these algorithms have order one runtime,
the attach memory--the amount of memory they attach is proportional to the number of flows.
So, when we start the--start running out of cache the runtime increases, and that runtime
increases for all the algorithms. QFQ has a peculiar behavior with only one flow. One
flow in the scheduler is almost--always the possible case for a pair of scheduler because
all the time you take a flow, take it out of the data structure, and you put it back
in the same place after making a lot of useless checks to see if there is some other flow
ready for service. So, we could have optimized the code to look good on this point, but then
we would have had worse performance in the other points. And in the end, they shared
the scheduler with only one flow. You don't care too much about the runtime. So, we--these
are real numbers from a real implementation, and some of this phenomena you wouldn't see
from simulation or from theoretical, purely theoretical analysis. The previous graph showed
the performance only with flows of the same type, but we can run the same experiment with
the mix of flows of different features in terms of weights and, of course, we have different
results. And this table shows the result of the experiments with different combination
of flows and different algorithms. As we see the numbers vary, the standard aviations are
really small in our experiment, just because we moved most of the sources of noise from
the experiments. There's nothing particular to say other than here we can also see the
growth of the runtimes in WF2Q+, which is logarithmic. At--with the small number of
flows even this algorithm is acceptable. As the number of flows grows, probably, we really
want some of the fast algorithm to run. Conclusion; we have a timestamp-based scheduler which
has near optimal service guarantees, so through or the one runtime which is comparable to
round robin. And this actual code this is not just a theoretical result, it's already
available as part of dummynet, so you have it on many platforms now. And the code is
available at this link. This is joint work by the way with Fabio Chiccone, a student
of mine; and Paolo Valente, who was a former student of mine. And we are planning--in addition
to the dummynet implementation, we are planning to put this scheduler also within a click
module just because click is widely used as a solution for doing all sort of networking
devices, router, switches, emulation systems, and so on. And future work, we look at doing
more detailed performance analysis to see how we can optimize this system even more.
Especially, you think of--we think of--we are actually running the scheduler on open
priority platforms, which are, you know, cheap access points for system which have only 200
mega CPU, not much memory, not a lot of memory bandwidth. We would like to see how fast the
system works there, and how can we optimize it for this platform. Also, we want to see
how much we can gain in hardware implementation because there are many parts of this algorithm
that can actually run in parallel. And so, if we can isolate this parts and run--build
a hardware implementation perhaps on NETFPGA or in the few more of some network card that
would be an interesting test to make. So, things are covered on the talk. You have a
link to my dummynet page and QFQ1; for everything else, there's (ph) Google. That's all. Questions
are welcome. Thanks. >> [INDISTINCT] starts with the question and
answer. I know you [INDISTINCT] >> RIZZO: So, [INDISTINCT] was some of [INDISTINCT]
within the FreeBSD Project last year. HE easier applied at my department, not as, you know,
one of the standard organization. And unfortunately, tge application was not accepted. Frankly,
I don't know the reason, but we can discuss this offline. But no, I have been trying to
encourage my students to apply to some of the code in the past, and actually five of
them have worked on several FreeBSD projects, some of them related, some of them were not
related. And actually, now that you ask, there is some other activity we have been doing
also within [INDISTINCT] contacts of some of this scheduling. I couldn't cover it in
this course which was already to dance, but I'm happy to discuss it offline if people
are interested. Yup? >> [INDISTINCT]
>> RIZZO: Yeah, okay. IN the theoretical--so, okay, sure. The question was how of the--did--is
whether we measure the actual deviation of the scheduler from the optimal deviation?
The answer is we did, and there are some experiments where we showed the average deviation. We
don't have--I don't remember, at least. I'm sure we have the data. I just don't remember
the result. What is the actual deviation that we measure in our experiment? We have to think
that many cases we care about those case because you'll never know how an attacker could drive
your neck or can--and try to force you to run in one of those worst cases. So, it would
be great if you would be able to replicate those worst cases with a particular traffic
pattern. And actually this is one of the goals of having an environment for testing that
we could run and use a LAN with a controller that drives the scheduler. We haven't worked
on--actually--trying to figure out what are the worst case patents for the schedulers.
Okay. Yup? >> Is it possible for you to compare briefly
ipfw and dummynet associated with [INDISTINCT] and so on [INDISTINCT].
>> RIZZO: Oh, yes, they're classified, yes. In Linux or the emulation solutions?
>> [INDISTINCT] the functionality. So what--how--why would I want to use [INDISTINCT] what's already
in Linux? >> RIZZO: Okay. So, the question was how does
ipfw and dummnynet compared with the other solution that exist in Linux? So, we need
to go back to the first--yeah, to this slide. So, NISTnet is not actively developed, I think,
and it's not--none of them is of standard part of Linux except perhaps from tc. NISTnet
is not actively developed as much in figure and in terms of features. Tc is mostly as
I understand a traffic shaper. It doesn't--it doesn't have support for instance to emulate
delays. And so, you need an external package to do all the delay emulation. And at least
the feedback that I got from other people who are used to--who have tried different
solutions that this ipfw and dummynet is a lot easier to use. Because it's really--ts
really as simple as this, it's really as simple as issuing a few commands. In all the other
cases you have to familiarize with the structure of a tc and how the classifier works and how--toward
the additional emulation package, you know, after it. In the other solution, net parties
project which builds an emulator using click configuration. Click is a system where you
can basically assemble modules in a graph, connect them, and then--it has a very efficient
device driver and it has a very efficient packet processing path. And so in terms of
comparison with dummynet, the netpath approach is flexible, because you can, in principle,
either build or use one of the existing click models to do all sort of emulation that you
like. You can do some kind of scheduling but there isn't actually a good range of schedulers
in click. The performance of netpath solution is better, I think it--for given the same
algorithm, netpath goes twice as fast as dummynet in terms of peak performance. But this is
only to--if you use it as a dedicated system, because most of the overhead in dummynet doesn't
come from the processing of the ipfw and dummynet, it comes from the natural stack. So by the
time you use netpath in a standard Linux box you have all these overhead back in and basically
the performance is the same. So, I mean, I'm clearly biased in saying that dummynet and
ipfw are a lot easier to use, but this is the message that I got from a lot of people
using these other solution as well. Okay. Thank you all for coming.