iproute2+tc notes
The iproute2+tc package allows access to the variety of neat new networking features in the 2.2 kernels. Policy routing, NAT, QoS, advanced tunnels, RSVP and Differentiated services, are just a few of the buzzword capabilities unleashed by the ip and tc programs.
Contents
Introduction
This is the second version of this document; the first began as a simple command reference for my own use and evolved into a slightly more complex document, including some usage notes and a bunch of links to documents. That version became dated rather rapidly, and so I've redone it from fresh sources. Hopefully the pace of development has slowed somewhat (at least on the user interface side of things), and this version will remain useful a bit longer.
Suggestions, additions, sources, comments, etc to dragon@snafu.freedom.org. Specifically more detailed explanations of the commands listed in the Command Syntax section, for inclusion, would be extremely welcome.
All of the information pulled from linux/ files comes from linux-2.2.14-pre9. The iproute2/ files come from the iproute2-2.2.4-now-ss991023.tar.gz distribution.
Distribution
From the Documentation/Configure.help file of the linux distribution; "QoS and/or fair queueing" item:
To administer these schedulers, you'll need the user-level utilities
from the package iproute2+tc at ftp://ftp.inr.ac.ru/ip-routing/
That directory is mirrored at:
- ftp://linux.wauug.org/pub/net
- ftp://ftp.nc.ras.ru/pub/mirrors/ftp.inr.ac.ru/ip-routing/
- ftp://ftp.gts.cz/MIRRORS/ftp.inr.ac.ru/
- ftp://ftp.funet.fi/pub/mirrors/ftp.inr.ac.ru/ip-routing/ (STM1 to USA)
- ftp://sunsite.icm.edu.pl/pub/Linux/linux-iproute/
-
ftp://ftp.sunet.se/pub/Linux/ip-routing/
It's probably worth picking up the latest version; if your distribution includes these tools at all they're likely to be a revision or two behind.
Building
Unpack the distribution tarball. Edit the Makefile; you need to tell it where to find the kernel include files with the KERNEL_INCLUDES variable. You might also need to tell it about libraries or something; I didn't have to. A simple "make all" compiles the programs ip/ip, ip/rtmon, and tc/tc. I didn't get but a couple of warning messages during the compile. Copy those programs to /sbin, and you're ready to play.
Keep the unpacked distribution, it's got several examples and utility scripts which may well be useful, especially if you're going to be using the tc program to do QoS stuff. In that case, you may also need to edit a kernel header file. In the tc/ directory of the distribution there's a file named "README.last", that explains the possible values and meanings of PSCHED_CLOCK_SOURCE in the file linux/include/net/pkt_sched.h. The default "PSCHED_JIFFIES" there is probably what you want anyway.
Kernel Configuration
Most of the capabilities of the kernel accessed by the ip and tc programs probably aren't included in your distribution's stock kernel configuration. If you can't get a feature to work, make sure that all of the required kernel configuration values are set properly. I've tried to note what kernel configuration options are required or related in the Command Syntax section. I'm often guessing, so don't be suprised if I've missed something (please do tell me about it, though).
Documentation
If you can't find the answers to your questions in this document or in the documentation listed below, you might try searching through the linux-net mailing list and linux-kernel mailing list archives and/or asking the habitants of those lists.
from the Distribution
In the iproute2/ directory you'll find the README file, which contains build instructions; the RELNOTES file, and the README.iproute2+tc file, which discusses the tc Traffic Control tools. The README.decnet file contains some points relevant to DECnet. There's a README.last file in the iproute2/tc/ directory that discusses the clock source for the QoS code; it's a bit dated but it explains what the options are.
In the iproute2/doc/ directory are several documents; the IP Command Reference (HTML, TeX, and Postscript); Tunnels over IP in Linux-2.2 (HTML, TeX, and Postscript); and IPv6 Flow Labels in Linux-2.2 (HTML, TeX, and Postscript).
from the Kernel
The kernel's Documentation/networking/ has a couple of files; the routing.txt file, and the policy-routing.txt file. These are slim and outdated, but they introduce a couple of important concepts. I've got a partial kernel Configure.help file, mostly networking options, that should be a little easier to read than via make config.
from everywhere else
Timur A. Bolokhov has started an "Advanced routing mini-HOWTO" (HTML,TeX and Postscript), which ain't bad at all. The date on those files from the original site (ftp://post.tepkom.ru/pub/vol2/Linux/docs/) was Sep 29 1998; they might be a little dated, but not so much so as the kernel docs.
Saravanan Radhakrishan has put together a fine collection of links and documentation at the IITC QoS site, dealing with the implementation details of the QoS system and how to attack it from C via netlink. I haven't read it all yet but it should be helpful. Specifically see the Linux-Qos-HOWTO (local copy) and the Netlink HOWTO(local copy).
The README.iproute2+tc file mentions "Terminology and advices about setting CBQ parameters may be found in Sally Floyd papers." Some of those can be found at her site. She 's also got pages listing resources for CBQ, and RED. Programmers looking to interface with the kernel side of the new networking capabilities may find some joy in Andi Kleen's new network man pages.
Juergen Jaeschke pointed me to the Differentiated Services on Linux Internet Draft, which contains some good discussion of the iproute tc traffic control tools.
Werner Almesberger has posted the slides he used for his presentations at the Singapore Linux conference '99 and the Linux Expo '99; these are PostScript files. I've got copies of both, Linux Network Traffic Control Implementation Overview slides from Linux Expo, and the QoS Networking with Linux slides from the Singapore conference. He also has an excellent Linux Traffic Control - Implementation Overview paper on his site (which I've copied).
I've collected some mail list messages that discuss the usage of these tools. One of the mail list messages responding to a request for information on the traffic control toys said "see Cisco Docs"; the IOS v12 Configuration Guides and References and most specifically the Quality of Service Solutions Configuration Guide are likely to be helpful.
Open Resource has an article on linux 2.2, which includes a some discussion of the new networking.
Command Syntax
ip command
From file iproute2/ip/ip.c
Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }
where OBJECT := { link | addr | route | rule | neigh | tunnel |
maddr | mroute | monitor }
OPTIONS := { -V[ersion] | -s[tatistics] | -r[esolve] |
-f[amily] { inet | inet6 | dnet | link } | -o[neline] }See the IP Command Reference.
If it's called as "ipaddr", "iproute", "iprule", "ipneigh", "iplink", "iptunnel", or "ipmonitor" it runs the appropriate subcommand. So you can create a link to "iproute", and use:
# iproute (whatever)
instead of:
# ip route (whatever)
ip link
From file iproute2/ip/iplink.c
Usage: ip link set DEVICE { up | down | arp { on | off } |
dynamic { on | off } |
multicast { on | off } | txqueuelen PACKETS |
name NEWNAME |
address LLADDR | broadcast LLADDR |
mtu MTU }
ip link show [ DEVICE ]See the IP Command Reference.
ip address
From file iproute2/ip/ipaddress.c
Usage: ip addr {add|del} IFADDR dev STRING
ip addr {show|flush} [ dev STRING ] [ scope SCOPE-ID ]
[ to PREFIX ] [ FLAG-LIST ] [ label PATTERN ]
IFADDR := PREFIX | ADDR peer PREFIX
[ broadcast ADDR ] [ anycast ADDR ]
[ label STRING ] [ scope SCOPE-ID ]
SCOPE-ID := [ host | link | global | NUMBER ]
FLAG-LIST := [ FLAG-LIST ] FLAG
FLAG := [ permanent | dynamic | secondary | primary |
tentative | deprecated ]See the IP Command Reference.
ip neighbor
From file iproute2/ip/ipneigh.c
Usage: ip neigh { add | del | change | replace } { ADDR [ lladdr LLADDR ]
[ nud { permanent | noarp | stale | reachable } ]
| proxy ADDR } [ dev DEV ]
ip neigh {show|flush} [ to PREFIX ] [ dev DEV ] [ nud STATE ]See the IP Command Reference.
ip rule
Kernel Config: IP: advanced router, IP: policy routing, IP: use TOS value as routing key, IP: use FWMARK value as routing key, IP: fast network address translation
From file iproute2/ip/iprule.c
Usage: ip rule [ list | add | del ] SELECTOR ACTION
SELECTOR := [ from PREFIX ] [ to PREFIX ] [ tos TOS ] [ fwmark FWMARK ]
[ dev STRING ] [ pref NUMBER ]
ACTION := [ table TABLE_ID ] [ nat ADDRESS ]
[ prohibit | reject | unreachable ]
[ realms [SRCREALM/]DSTREALM ]
TABLE_ID := [ local | main | default | NUMBER ]See the IP Command Reference.
ip route
Kernel Config: IP: advanced router, IP: policy routing, IP: equal cost multipath, IP: large routing tables, IP: fast network address translation
From file iproute2/ip/iproute.c
Usage: ip route { list | flush } SELECTOR
ip route get ADDRESS [ from ADDRESS iif STRING ]
[ oif STRING ] [ tos TOS ]
ip route { add | del | replace | change | append | replace | monitor } ROUTE
SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]
[ table TABLE_ID ] [ proto RTPROTO ]
[ type TYPE ] [ scope SCOPE ]
ROUTE := NODE_SPEC [ INFO_SPEC ]
NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]
[ table TABLE_ID ] [ proto RTPROTO ]
[ scope SCOPE ] [ metric METRIC ]
INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...
NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS
OPTIONS := FLAGS [ mtu NUMBER ] [ advmss NUMBER ]
[ rtt NUMBER ] [ rttvar NUMBER ]
[ window NUMBER] [ cwnd NUMBER ] [ ssthresh REALM ]
[ realms REALM ]
TYPE := [ unicast | local | broadcast | multicast | throw |
unreachable | prohibit | blackhole | nat ]
TABLE_ID := [ local | main | default | all | NUMBER ]
SCOPE := [ host | link | global | NUMBER ]
FLAGS := [ equalize ]
NHFLAGS := [ onlink | pervasive ]
RTPROTO := [ kernel | boot | static | NUMBER ]See the IP Command Reference.
ip tunnel
Kernel Config: IP: tunneling, IP: GRE tunnels over IP, IP: broadcast GRE over IP
From file iproute2/ip/iptunnel.c
Usage: ip tunnel { add | change | del | show } [ NAME ]
[ mode { ipip | gre | sit } ] [ remote ADDR ] [ local ADDR ]
[ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]
[ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ] Where: NAME := STRING
ADDR := { IP_ADDRESS | any }
TOS := { NUMBER | inherit }
TTL := { 1..255 | inherit }
KEY := { DOTTED_QUAD | NUMBER }See the IP Command Reference, also see Tunnels over IP in Linux-2.2 and my (old) tunnel notes.
ip maddress
From file iproute2/ip/ipmaddr.c
Usage: ip maddr [ add | del ] MULTIADDR dev STRING
ip maddr show [ dev STRING ]See the IP Command Reference.
ip mroute
From file iproute2/ip/ipmroute.c
Usage: ip mroute show [ PREFIX ] [ from PREFIX ] [ iif DEVICE ]
See the IP Command Reference.
ip monitor
Kernel Config: IP: verbose route monitoring
From file iproute2/ip/ipmonitor.c
Usage: ip monitor [ all | LISTofOBJECTS ]
See the IP Command Reference.
rtmon command
Kernel Config: IP: verbose route monitoring
From file iproute2/ip/rtmon.c
Usage: rtmon file FILE [ all | LISTofOBJECTS]
LISTofOBJECTS := [ link ] [ address ] [ route ]
tc command
Kernel Config: QoS support, QoS and/or fair queueing
From file iproute2/tc/tc.c
Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }
where OBJECT := { qdisc | class | filter }
OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] }Where it's expecting a number for BPS; it understands some suffixes: kbps (*1024), mbps (*1024*1024), kbit (*1024/8), and mbit (*1024*1024/8). If I'm reading the code correctly; "BPS" means Bytes Per Second; if you give a number without a suffix it assumes you want BITS per second (it divides the number you give it by 8). It also understands bps as a suffix.
Where it's expecting a time value, it seems it understands suffixes of s, sec, and secs for seconds, ms, msec, and msecs for milliseconds, and us, usec, and usecs for microseconds.
Where it wants a size parameter, it assumes non-suffixed numbers to be specified in bytes. It also understands suffixes of k and kb to mean kilobytes (*1024), m and mb to mean megabytes (*1024*1024), kbit to mean kilobit (*1024/8), and mbit to mean megabits (*1024*1024/8).
1Mbit == 128Kbps
qdisc
Kernel Config: QoS support, QoS and/or fair queueing
From file iproute2/tc/tc_qdisc.c
Usage: tc qdisc [ add | del | replace | change | get ] dev STRING
[ handle QHANDLE ] [ root | ingress | parent CLASSID ]
[ estimator INTERVAL TIME_CONSTANT ]
[ [ QDISC_KIND ] [ help | OPTIONS ] ] tc qdisc show [ dev STRING ] [ingress]
Where:
QDISC_KIND := { [p|b]fifo | tbf | prio | cbq | red | etc. }
OPTIONS := ... try tc qdisc add <desired QDISC_KIND> helpFrom file linux/net/sched/sch_api.c
Short review.
------------- This file consists of two interrelated parts: 1. queueing disciplines manager frontend.
2. traffic classes manager frontend. Generally, queueing discipline ("qdisc") is a black box,
which is able to enqueue packets and to dequeue them (when
device is ready to send something) in order and at times
determined by algorithm hidden in it. qdisc's are divided to two categories:
- "queues", which have no internal structure visible from outside.
- "schedulers", which split all the packets to "traffic classes",
using "packet classifiers" (look at cls_api.c) In turn, classes may have child qdiscs (as rule, queues)
attached to them etc. etc. etc. The goal of the routines in this file is to translate
information supplied by user in the form of handles
to more intelligible for kernel form, to make some sanity
checks and part of work, which is common to all qdiscs
and to provide rtnetlink notifications. All real intelligent work is done inside qdisc modules. Every discipline has two major routines: enqueue and dequeue. ---dequeue dequeue usually returns a skb to send. It is allowed to return NULL,
but it does not mean that queue is empty, it just means that
discipline does not want to send anything this time.
Queue is really empty if q->q.qlen == 0.
For complicated disciplines with multiple queues q->q is not
real packet queue, but however q->q.qlen must be valid. ---enqueue enqueue returns number of enqueued packets i.e. this number is 1,
if packet was enqueued successfully and <1 if something (not
necessary THIS packet) was dropped. Auxiliary routines: ---requeue requeues once dequeued packet. It is used for non-standard or
just buggy devices, which can defer output even if dev->tbusy=0. ---reset returns qdisc to initial state: purge all buffers, clear all
timers, counters (except for statistics) etc. ---init initializes newly created qdisc. ---destroy destroys resources allocated by init and during lifetime of qdisc. ---change changes qdisc parameters.
qdisc [p|b]fifo
From file iproute2/tc/q_fifo.c
Usage: ... [p|b]fifo [ limit NUMBER ]
From file linux/net/sched/sch_fifo.c
/* 1 band FIFO pseudo-"scheduler" */
qdisc tbf
Kernel Config: QoS support, QoS and/or fair queueing, TBF queue, Rate estimator
From file iproute2/tc/
Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ]
[ peakrate KBPS ] [ latency TIME ]From file linux/net/sched/sch_tbf.c
/* Simple Token Bucket Filter.
======================================= SOURCE.
------- None. Description.
------------ A data flow obeys TBF with rate R and depth B, if for any
time interval t_i...t_f the number of transmitted bits
does not exceed B + R*(t_f-t_i). Packetized version of this definition:
The sequence of packets of sizes s_i served at moments t_i
obeys TBF, if for any i<=k: s_i+....+s_k <= B + R*(t_k - t_i) Algorithm.
---------- Let N(t_i) be B/R initially and N(t) grow continuously with time as: N(t+delta) = min{B/R, N(t) + delta} If the first packet in queue has length S, it may be
transmited only at the time t_* when S/R <= N(t_*),
and in this case N(t) jumps: N(t_* + 0) = N(t_* - 0) - S/R. Actually, QoS requires two TBF to be applied to a data stream.
One of them controls steady state burst size, another
one with rate P (peak rate) and depth M (equal to link MTU)
limits bursts at a smaller time scale. It is easy to see that P>R, and B>M. If P is infinity, this double
TBF is equivalent to a single one. When TBF works in reshaping mode, latency is estimated as: lat = max ((L-B)/R, (L-M)/P) NOTES.
------ If TBF throttles, it starts a watchdog timer, which will wake it up
when it is ready to transmit.
Note that the minimal timer resolution is 1/HZ.
If no new packets arrive during this period,
or if the device is not awaken by EOI for some previous packet,
TBF can stop its activity for 1/HZ. This means, that with depth B, the maximal rate is R_crit = B*HZ F.e. for 10Mbit ethernet and HZ=100 the minimal allowed B is ~10Kbytes. Note that the peak rate TBF is much more tough: with MTU 1500
P_crit = 150Kbytes/sec. So, if you need greater peak
rates, use alpha with HZ=1000 :-)
*/
qdisc cbq
Kernel Config: QoS support, QoS and/or fair queueing, CBQ packet scheduler, Rate estimator
From file iproute2/tc/q_cbq.c
Usage: ... cbq bandwidth BPS avpkt BYTES [ mpu BYTES ]
[ cell BYTES ] [ ewma LOG ]From file linux/net/sched/sch_cbq.c
/* Class-Based Queueing (CBQ) algorithm.
======================================= Sources: [1] Sally Floyd and Van Jacobson, "Link-sharing and Resource
Management Models for Packet Networks",
IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995 [2] Sally Floyd, "Notes on CBQ and Guaranted Service", 1995 [3] Sally Floyd, "Notes on Class-Based Queueing: Setting
Parameters", 1996 [4] Sally Floyd and Michael Speer, "Experimental Results
for Class-Based Queueing", 1998, not published. ----------------------------------------------------------------------- Algorithm skeleton was taken from NS simulator cbq.cc.
If someone wants to check this code against the LBL version,
he should take into account that ONLY the skeleton was borrowed,
the implementation is different. Particularly: --- The WRR algorithm is different. Our version looks more
reasonable (I hope) and works when quanta are allowed to be
less than MTU, which is always the case when real time classes
have small rates. Note, that the statement of [3] is
incomplete, delay may actually be estimated even if class
per-round allotment is less than MTU. Namely, if per-round
allotment is W*r_i, and r_1+...+r_k = r < 1 delay_i <= ([MTU/(W*r_i)]*W*r + W*r + k*MTU)/B In the worst case we have IntServ estimate with D = W*r+k*MTU
and C = MTU*r. The proof (if correct at all) is trivial. --- It seems that cbq-2.0 is not very accurate. At least, I cannot
interpret some places, which look like wrong translations
from NS. Anyone is advised to find these differences
and explain to me, why I am wrong 8). --- Linux has no EOI event, so that we cannot estimate true class
idle time. Workaround is to consider the next dequeue event
as sign that previous packet is finished. This is wrong because of
internal device queueing, but on a permanently loaded link it is true.
Moreover, combined with clock integrator, this scheme looks
very close to an ideal solution. */tristate 'CBQ packet scheduler' CONFIG_NET_SCH_CBQ
qdisc red
Kernel Config: QoS support, QoS and/or fair queueing, RED queue
From file iproute2/tc/q_red.c
Usage: ... red limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS
probability PROBABILITY bandwidth KBPSFrom file linux/net/sched/sch_red.c
/* Random Early Detection (RED) algorithm.
======================================= Source: Sally Floyd and Van Jacobson, "Random Early Detection Gateways
for Congestion Avoidance", 1993, IEEE/ACM Transactions on Networking. This file codes a "divisionless" version of RED algorithm
as written down in Fig.17 of the paper. Short description.
------------------ When a new packet arrives we calculate the average queue length: avg = (1-W)*avg + W*current_queue_len, W is the filter time constant (choosen as 2^(-Wlog)), it controls
the inertia of the algorithm. To allow larger bursts, W should be
decreased. if (avg > th_max) -> packet marked (dropped).
if (avg < th_min) -> packet passes.
if (th_min < avg < th_max) we calculate probability: Pb = max_P * (avg - th_min)/(th_max-th_min) and mark (drop) packet with this probability.
Pb changes from 0 (at avg==th_min) to max_P (avg==th_max).
max_P should be small (not 1), usually 0.01..0.02 is good value. max_P is chosen as a number, so that max_P/(th_max-th_min)
is a negative power of two in order arithmetics to contain
only shifts. Parameters, settable by user:
----------------------------- limit - bytes (must be > qth_max + burst) Hard limit on queue length, should be chosen >qth_max
to allow packet bursts. This parameter does not
affect the algorithms behaviour and can be chosen
arbitrarily high (well, less than ram size)
Really, this limit will never be reached
if RED works correctly. qth_min - bytes (should be < qth_max/2)
qth_max - bytes (should be at least 2*qth_min and less limit)
Wlog - bits (<32) log(1/W).
Plog - bits (<32) Plog is related to max_P by formula: max_P = (qth_max-qth_min)/2^Plog; F.e. if qth_max=128K and qth_min=32K, then Plog=22
corresponds to max_P=0.02 Scell_log
Stab Lookup table for log((1-W)^(t/t_ave). NOTES: Upper bound on W.
----------------- If you want to allow bursts of L packets of size S,
you should choose W: L + 1 - th_min/S < (1-(1-W)^L)/W th_min/S = 32 th_min/S = 4 log(W) L
-1 33
-2 35
-3 39
-4 46
-5 57
-6 75
-7 101
-8 135
-9 190
etc.
*/
qdisc sfq
From file iproute2/tc/q_sfq.c
Usage: ... sfq [ perturb SECS ] [ quantum BYTES ]
Kernel Config: QoS support, QoS and/or fair queueing, SFQ queue
From file linux/net/sched/
/* Stochastic Fairness Queuing algorithm.
======================================= Source:
Paul E. McKenney "Stochastic Fairness Queuing",
IEEE INFOCOMM'90 Proceedings, San Francisco, 1990. Paul E. McKenney "Stochastic Fairness Queuing",
"Interworking: Research and Experience", v.2, 1991, p.113-131. See also:
M. Shreedhar and George Varghese "Efficient Fair
Queuing using Deficit Round Robin", Proc. SIGCOMM 95. This is not the thing that is usually called (W)FQ nowadays.
It does not use any timestamp mechanism, but instead
processes queues in round-robin order. ADVANTAGE: - It is very cheap. Both CPU and memory requirements are minimal. DRAWBACKS: - "Stochastic" -> It is not 100% fair.
When hash collisions occur, several flows are considered as one. - "Round-robin" -> It introduces larger delays than virtual clock
based schemes, and should not be used for isolating interactive
traffic from non-interactive. It means, that this scheduler
should be used as leaf of CBQ or P3, which put interactive traffic
to higher priority band. We still need true WFQ for top level CSZ, but using WFQ
for the best effort traffic is absolutely pointless:
SFQ is superior for this purpose. IMPLEMENTATION:
This implementation limits maximal queue length to 128;
maximal mtu to 2^15-1; number of hash buckets to 1024.
The only goal of this restrictions was that all data
fit into one 4K page :-). Struct sfq_sched_data is
organized in anti-cache manner: all the data for a bucket
are scattered over different locations. This is not good,
but it allowed me to put it into 4K. It is easy to increase these values, but not in flight. */
qdisc prio
Kernel Config: QoS support, QoS and/or fair queueing, The simplest PRIO pseudo scheduler
From file iproute2/tc/q_prio.c
Usage: ... prio bands NUMBER priomap P1 P2...
From file linux/net/sched/sch_prio.c
* net/sched/sch_prio.c Simple 3-band priority "scheduler".
qdisc csz
Kernel Config: QoS support, QoS and/or fair queueing, CSZ packet scheduler
From file iproute2/tc/q_csz.c
NOT IMPLEMENTED
From file linux/net/sched/sch_csz.c
/* Clark-Shenker-Zhang algorithm.
======================================= SOURCE. David D. Clark, Scott Shenker and Lixia Zhang
"Supporting Real-Time Applications in an Integrated Services Packet
Network: Architecture and Mechanism". CBQ presents a flexible universal algorithm for packet scheduling,
but it has pretty poor delay characteristics.
Round-robin scheduling and link-sharing goals
apparently contradict minimization of network delay and jitter.
Moreover, correct handling of predictive flows seems to be
impossible in CBQ. CSZ presents a more precise but less flexible and less efficient
approach. As I understand it, the main idea is to create
WFQ flows for each guaranteed service and to allocate
the rest of bandwith to dummy flow-0. Flow-0 comprises
the predictive services and the best effort traffic;
it is handled by a priority scheduler with the highest
priority band allocated for predictive services, and the rest ---
to the best effort packets. Note that in CSZ flows are NOT limited to their bandwidth. It
is supposed that the flow passed admission control at the edge
of the QoS network and it doesn't need further shaping. Any
attempt to improve the flow or to shape it to a token bucket
at intermediate hops will introduce undesired delays and raise
jitter. At the moment CSZ is the only scheduler that provides
true guaranteed service. Another schemes (including CBQ)
do not provide guaranteed delay and randomize jitter.
There is a proof (Sally Floyd), that delay
can be estimated by a IntServ compliant formula.
This result is true formally, but it is wrong in principle.
It takes into account only round-robin delays,
ignoring delays introduced by link sharing i.e. overlimiting.
Note that temporary overlimits are inevitable because
real links are not ideal, and the real algorithm must take this
into account. ALGORITHM. --- Notations. $B$ is link bandwidth (bits/sec). $I$ is set of all flows, including flow $0$.
Every flow $a \in I$ has associated bandwidth slice $r_a < 1$ and
$\sum_{a \in I} r_a = 1$. --- Flow model. Let $m_a$ is the number of backlogged bits in flow $a$.
The flow is {\em active}, if $m_a > 0$.
This number is a discontinuous function of time;
when a packet $i$ arrives:
\[
m_a(t_i+0) - m_a(t_i-0) = L^i,
\]
where $L^i$ is the length of the arrived packet.
The flow queue is drained continuously until $m_a == 0$:
\[
{d m_a \over dt} = - { B r_a \over \sum_{b \in A} r_b}.
\]
I.e. flow rates are their allocated rates proportionally
scaled to take all available link bandwidth. Apparently,
it is not the only possible policy. F.e. CBQ classes
without borrowing would be modelled by:
\[
{d m_a \over dt} = - B r_a .
\]
More complicated hierarchical bandwidth allocation
policies are possible, but unfortunately, the basic
flow equations have a simple solution only for proportional
scaling. --- Departure times. We calculate the time until the last bit of packet is sent:
\[
E_a^i(t) = { m_a(t_i) - \delta_a(t) \over r_a },
\]
where $\delta_a(t)$ is number of bits drained since $t_i$.
We have to evaluate $E_a^i$ for all queued packets,
then find the packet with minimal $E_a^i$ and send it. This sounds good, but direct implementation of the algorithm
is absolutely infeasible. Luckily, if flow rates
are scaled proportionally, the equations have a simple solution. The differential equation for $E_a^i$ is
\[
{d E_a^i (t) \over dt } = - { d \delta_a(t) \over dt} { 1 \over r_a} =
{ B \over \sum_{b \in A} r_b}
\]
with initial condition
\[
E_a^i (t_i) = { m_a(t_i) \over r_a } .
\] Let's introduce an auxiliary function $R(t)$: --- Round number. Consider the following model: we rotate over active flows,
sending $r_a B$ bits from every flow, so that we send
$B \sum_{a \in A} r_a$ bits per round, that takes
$\sum_{a \in A} r_a$ seconds. Hence, $R(t)$ (round number) is a monotonically increasing
linear function of time when $A$ is not changed
\[
{ d R(t) \over dt } = { 1 \over \sum_{a \in A} r_a }
\]
and it is continuous when $A$ changes. The central observation is that the quantity
$F_a^i = R(t) + E_a^i(t)/B$ does not depend on time at all!
$R(t)$ does not depend on flow, so that $F_a^i$ can be
calculated only once on packet arrival, and we need not
recalculate $E$ numbers and resorting queues.
The number $F_a^i$ is called finish number of the packet.
It is just the value of $R(t)$ when the last bit of packet
is sent out. Maximal finish number on flow is called finish number of flow
and minimal one is "start number of flow".
Apparently, flow is active if and only if $F_a \leq R$. When a packet of length $L_i$ bit arrives to flow $a$ at time $t_i$,
we calculate $F_a^i$ as: If flow was inactive ($F_a < R$):
$F_a^i = R(t) + {L_i \over B r_a}$
otherwise
$F_a^i = F_a + {L_i \over B r_a}$ These equations complete the algorithm specification. It looks pretty hairy, but there is a simple
procedure for solving these equations.
See procedure csz_update(), that is a generalization of
the algorithm from S. Keshav's thesis Chapter 3
"Efficient Implementation of Fair Queeing". NOTES. * We implement only the simplest variant of CSZ,
when flow-0 is a explicit 4band priority fifo.
This is bad, but we need a "peek" operation in addition
to "dequeue" to implement complete CSZ.
I do not want to do that, unless it is absolutely
necessary. * A primitive support for token bucket filtering
presents itself too. It directly contradicts CSZ, but
even though the Internet is on the globe ... :-)
"the edges of the network" really exist. BUGS. * Fixed point arithmetic is overcomplicated, suboptimal and even
wrong. Check it later. */ /* This number is arbitrary */
class
Kernel Config: QoS support, QoS and/or fair queueing, Packet classifier API
From file iproute2/tc/tc_class.c
Usage: tc class [ add | del | change | get ] dev STRING
[ classid CLASSID ] [ root | parent CLASSID ]
[ [ QDISC_KIND ] [ help | OPTIONS ] ] tc class show [ dev STRING ] [ root | parent CLASSID ]
Where:
QDISC_KIND := { prio | cbq | etc. }
OPTIONS := ... try tc class add <desired QDISC_KIND> help
class cbq
From file iproute2/tc/q_cbq.c
Usage: ... cbq bandwidth BPS rate BPS maxburst PKTS [ avpkt BYTES ]
[ minburst PKTS ] [ bounded ] [ isolated ]
[ allot BYTES ] [ mpu BYTES ] [ weight RATE ]
[ prio NUMBER ] [ cell BYTES ] [ ewma LOG ]
[ estimator INTERVAL TIME_CONSTANT ]
[ split CLASSID ] [ defmap MASK/CHANGE ]
estimator
Kernel Config: QoS support, QoS and/or fair queueing, Rate estimator
From file iproute2/tc/m_estimator.c
Usage: ... estimator INTERVAL TIME-CONST
INTERVAL is interval between measurements
TIME-CONST is averaging time constant
Example: ... est 1sec 8sec\From file linux/net/sched/estimator.c
This code is NOT intended to be used for statistics collection,
its purpose is to provide a base for statistical multiplexing
for controlled load service.
If you need only statistics, run a user level daemon which
periodically reads byte counters. Unfortunately, rate estimation is not a very easy task.
F.e. I did not find a simple way to estimate the current peak rate
and even failed to formulate the problem 8)8) So I preferred not to built an estimator into the scheduler,
but run this task separately.
Ideally, it should be kernel thread(s), but for now it runs
from timers, which puts apparent top bounds on the number of rated
flows, has minimal overhead on small, but is enough
to handle controlled load service, sets of aggregates. We measure rate over A=(1<<interval) seconds and evaluate EWMA: avrate = avrate*(1-W) + rate*W where W is chosen as negative power of 2: W = 2^(-ewma_log) The resulting time constant is: T = A/(-ln(1-W)) NOTES. * The stored value for avbps is scaled by 2^5, so that maximal
rate is ~1Gbit, avpps is scaled by 2^10. * Minimal interval is HZ/4=250msec (it is the greatest common divisor
for HZ=100 and HZ=1024 8)), maximal interval
is (HZ/4)*2^EST_MAX_INTERVAL = 8sec. Shorter intervals
are too expensive, longer ones can be implemented
at user level painlessly.
filter
Kernel Config: QoS support, QoS and/or fair queueing, Packet classifier API
From file iproute2/tc/tc_filter.c
Usage: tc filter [ add | del | change | get ] dev STRING
[ pref PRIO ] [ protocol PROTO ]
[ estimator INTERVAL TIME_CONSTANT ]
[ root | classid CLASSID ] [ handle FILTERID ]
[ [ FILTER_TYPE ] [ help | OPTIONS ] ] tc filter show [ dev STRING ] [ root | parent CLASSID ]
Where:
FILTER_TYPE := { rsvp | u32 | fw | route | etc. }
FILTERID := ... format depends on classifier, see there
OPTIONS := ... try tc filter add <desired FILTER_KIND> help
filter rsvp
From file iproute2/tc/f_rsvp.c
Usage: ... rsvp ipproto PROTOCOL session DST[/PORT | GPI ]
[ sender SRC[/PORT | GPI ]
[ classid CLASSID ] [ police POLICE_SPEC ]
[ tunnelid ID ] [ tunnel ID skip NUMBER ]
Where: GPI := { flowlabel NUMBER | spi/ah SPI | spi/esp SPI |
u{8|16|32} NUMBER mask MASK at OFFSET}
POLICE_SPEC := ... look at TBF
FILTERID := X:YFrom file linux/net/sched/cls_rsvp.h
/*
Comparing to general packet classification problem,
RSVP needs only sevaral relatively simple rules: * (dst, protocol) are always specified,
so that we are able to hash them.
* src may be exact, or may be wildcard, so that
we can keep a hash table plus one wildcard entry.
* source port (or flow label) is important only if src is given. IMPLEMENTATION. We use a two level hash table: The top level is keyed by
destination address and protocol ID, every bucket contains a list
of "rsvp sessions", identified by destination address, protocol and
DPI(="Destination Port ID"): triple (key, mask, offset). Every bucket has a smaller hash table keyed by source address
(cf. RSVP flowspec) and one wildcard entry for wildcard reservations.
Every bucket is again a list of "RSVP flows", selected by
source address and SPI(="Source Port ID" here rather than
"security parameter index"): triple (key, mask, offset). NOTE 1. All the packets with IPv6 extension headers (but AH and ESP)
and all fragmented packets go to the best-effort traffic class. NOTE 2. Two "port id"'s seems to be redundant, rfc2207 requires
only one "Generalized Port Identifier". So that for classic
ah, esp (and udp,tcp) both *pi should coincide or one of them
should be wildcard. At first sight, this redundancy is just a waste of CPU
resources. But DPI and SPI add the possibility to assign different
priorities to GPIs. Look also at note 4 about tunnels below. NOTE 3. One complication is the case of tunneled packets.
We implement it as following: if the first lookup
matches a special session with "tunnelhdr" value not zero,
flowid doesn't contain the true flow ID, but the tunnel ID (1...255).
In this case, we pull tunnelhdr bytes and restart lookup
with tunnel ID added to the list of keys. Simple and stupid 8)8)
It's enough for PIMREG and IPIP. NOTE 4. Two GPIs make it possible to parse even GRE packets.
F.e. DPI can select ETH_P_IP (and necessary flags to make
tunnelhdr correct) in GRE protocol field and SPI matches
GRE key. Is it not nice? 8)8) Well, as result, despite its simplicity, we get a pretty
powerful classification engine. */
filter u32
From file iproute2/tc/f_u32.c
Usage: ... u32 [ match SELECTOR ... ] [ link HTID ] [ classid CLASSID ]
[ police POLICE_SPEC ] [ offset OFFSET_SPEC ]
[ ht HTID ] [ hashkey HASHKEY_SPEC ]
[ sample SAMPLE ]
or u32 divisor DIVISOR Where: SELECTOR := SAMPLE SAMPLE ...
SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} } SAMPLE_ARGS
FILTERID := X:Y:ZFrom file linux/net/sched/cls_u32.c
* The filters are packed to hash tables of key nodes
* with a set of 32bit key/mask pairs at every node.
* Nodes reference next level hash tables etc.
*
* This scheme is the best universal classifier I managed to
* invent; it is not super-fast, but it is not slow (provided you
* program it correctly), and general enough. And its relative
* speed grows as the number of rules becomes larger.
*
* It seems that it represents the best middle point between
* speed and manageability both by human and by machine.
*
* It is especially useful for link sharing combined with QoS;
* pure RSVP doesn't need such a general approach and can use
* much simpler (and faster) schemes, sort of cls_rsvp.c.Faster than ipchains, according to Werner Almesburger (as reported in Rusty's Diary, May 22 1999).
filter fw
From file iproute2/tc/f_fw.c
Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC ]
POLICE_SPEC := ... look at TBF
CLASSID := X:YFrom file linux/net/sched/cls_fw.c
* net/sched/cls_fw.c Classifier mapping ipchains' fwmark to traffic class.
filter route
From file iproute2/tc/f_route.c
Usage: ... route [ from REALM | fromif TAG ] [ to REALM ]
[ flowid CLASSID ] [ police POLICE_SPEC ]
POLICE_SPEC := ... look at TBF
CLASSID := X:Y\nFrom file linux/net/sched/cls_route.c
1. For now we assume that route tags < 256.
It allows to use direct table lookups, instead of hash tables.
2. For now we assume that "from TAG" and "fromdev DEV" statements
are mutually exclusive.
3. "to TAG from ANY" has higher priority, than "to ANY from XXX"
police
From file iproute2/tc/m_police.c
Usage: ... police rate BPS burst BYTES[/BYTES] [ mtu BYTES[/BYTES] ]
[ peakrate BPS ] [ avrate BPS ]
[ ACTION ]
Where: ACTION := reclassify | drop | continue
Copyright © 1999 Mark Lamb. Redistribution via any media permitted as long as this notice remains intact. http://snafu.freedom.org/