iproute2+tc notes

时间:2021-11-17 11:37:28

iproute2+tc notes

The iproute2+tc package allows access to the variety of neat new networking features in the 2.2 kernels. Policy routing, NAT, QoS, advanced tunnels, RSVP and Differentiated services, are just a few of the buzzword capabilities unleashed by the ip and tc programs.


Contents

  1. Introduction
  2. Distribution
    1. Building
    2. Kernel Configuration
  3. Documentation
    1. from the Distribution
    2. from the Kernel
    3. from everywhere else
  4. Command Syntax
    1. ip command
      1. ip link
      2. ip address
      3. ip neighbor
      4. ip rule
      5. ip route
      6. ip tunnel
      7. ip maddress
      8. ip mroute
      9. ip monitor
    2. rtmon command
    3. tc command
      1. qdisc
        1. qdisc [p|b]fifo
        2. qdisc tbf
        3. qdisc cbq
        4. qdisc red
        5. qdisc sfq
        6. qdisc prio
        7. qdisc csz
      2. class
        1. class cbq
      3. estimator
      4. filter
        1. filter rsvp
        2. filter u32
        3. filter fw
        4. filter route
      5. police

Introduction

This is the second version of this document; the first began as a simple command reference for my own use and evolved into a slightly more complex document, including some usage notes and a bunch of links to documents. That version became dated rather rapidly, and so I've redone it from fresh sources. Hopefully the pace of development has slowed somewhat (at least on the user interface side of things), and this version will remain useful a bit longer.

Suggestions, additions, sources, comments, etc to dragon@snafu.freedom.org. Specifically more detailed explanations of the commands listed in the Command Syntax section, for inclusion, would be extremely welcome.

All of the information pulled from linux/ files comes from linux-2.2.14-pre9. The iproute2/ files come from the iproute2-2.2.4-now-ss991023.tar.gz distribution.


Distribution

From the Documentation/Configure.help file of the linux distribution; "QoS and/or fair queueing" item:

  To administer these schedulers, you'll need the user-level utilities
from the package iproute2+tc at ftp://ftp.inr.ac.ru/ip-routing/

That directory is mirrored at:

  • ftp://linux.wauug.org/pub/net
  • ftp://ftp.nc.ras.ru/pub/mirrors/ftp.inr.ac.ru/ip-routing/
  • ftp://ftp.gts.cz/MIRRORS/ftp.inr.ac.ru/
  • ftp://ftp.funet.fi/pub/mirrors/ftp.inr.ac.ru/ip-routing/ (STM1 to USA)
  • ftp://sunsite.icm.edu.pl/pub/Linux/linux-iproute/
  • ftp://ftp.sunet.se/pub/Linux/ip-routing/

    It's probably worth picking up the latest version; if your distribution includes these tools at all they're likely to be a revision or two behind.

    Building

    Unpack the distribution tarball. Edit the Makefile; you need to tell it where to find the kernel include files with the KERNEL_INCLUDES variable. You might also need to tell it about libraries or something; I didn't have to. A simple "make all" compiles the programs ip/ipip/rtmon, and tc/tc. I didn't get but a couple of warning messages during the compile. Copy those programs to /sbin, and you're ready to play.

    Keep the unpacked distribution, it's got several examples and utility scripts which may well be useful, especially if you're going to be using the tc program to do QoS stuff. In that case, you may also need to edit a kernel header file. In the tc/ directory of the distribution there's a file named "README.last", that explains the possible values and meanings of PSCHED_CLOCK_SOURCE in the file linux/include/net/pkt_sched.h. The default "PSCHED_JIFFIES" there is probably what you want anyway.

    Kernel Configuration

    Most of the capabilities of the kernel accessed by the ip and tc programs probably aren't included in your distribution's stock kernel configuration. If you can't get a feature to work, make sure that all of the required kernel configuration values are set properly. I've tried to note what kernel configuration options are required or related in the Command Syntax section. I'm often guessing, so don't be suprised if I've missed something (please do tell me about it, though).


    Documentation

    If you can't find the answers to your questions in this document or in the documentation listed below, you might try searching through the linux-net mailing list and linux-kernel mailing list archives and/or asking the habitants of those lists.

    from the Distribution

    In the iproute2/ directory you'll find the README file, which contains build instructions; the RELNOTES file, and the README.iproute2+tc file, which discusses the tc Traffic Control tools. The README.decnet file contains some points relevant to DECnet. There's a README.last file in the iproute2/tc/ directory that discusses the clock source for the QoS code; it's a bit dated but it explains what the options are.

    In the iproute2/doc/ directory are several documents; the IP Command Reference (HTMLTeX, and Postscript); Tunnels over IP in Linux-2.2 (HTMLTeX, and Postscript); and IPv6 Flow Labels in Linux-2.2 (HTMLTeX, and Postscript).

    from the Kernel

    The kernel's Documentation/networking/ has a couple of files; the routing.txt file, and the policy-routing.txt file. These are slim and outdated, but they introduce a couple of important concepts. I've got a partial kernel Configure.help file, mostly networking options, that should be a little easier to read than via make config.

    from everywhere else

    Timur A. Bolokhov has started an "Advanced routing mini-HOWTO" (HTML,TeX and Postscript), which ain't bad at all. The date on those files from the original site (ftp://post.tepkom.ru/pub/vol2/Linux/docs/) was Sep 29 1998; they might be a little dated, but not so much so as the kernel docs.

    Saravanan Radhakrishan has put together a fine collection of links and documentation at the IITC QoS site, dealing with the implementation details of the QoS system and how to attack it from C via netlink. I haven't read it all yet but it should be helpful. Specifically see the Linux-Qos-HOWTO (local copy) and the Netlink HOWTO(local copy).

    The README.iproute2+tc file mentions "Terminology and advices about setting CBQ parameters may be found in Sally Floyd papers." Some of those can be found at her site. She 's also got pages listing resources for CBQ, and RED. Programmers looking to interface with the kernel side of the new networking capabilities may find some joy in Andi Kleen's new network man pages.

    Juergen Jaeschke pointed me to the Differentiated Services on Linux Internet Draft, which contains some good discussion of the iproute tc traffic control tools.

    Werner Almesberger has posted the slides he used for his presentations at the Singapore Linux conference '99 and the Linux Expo '99; these are PostScript files. I've got copies of both, Linux Network Traffic Control Implementation Overview slides from Linux Expo, and the QoS Networking with Linux slides from the Singapore conference. He also has an excellent Linux Traffic Control - Implementation Overview paper on his site (which I've copied).

    I've collected some mail list messages that discuss the usage of these tools. One of the mail list messages responding to a request for information on the traffic control toys said "see Cisco Docs"; the IOS v12 Configuration Guides and References and most specifically the Quality of Service Solutions Configuration Guide are likely to be helpful.

    Open Resource has an article on linux 2.2, which includes a some discussion of the new networking.


    Command Syntax


    ip command

    From file iproute2/ip/ip.c

    Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }
    where OBJECT := { link | addr | route | rule | neigh | tunnel |
    maddr | mroute | monitor }
    OPTIONS := { -V[ersion] | -s[tatistics] | -r[esolve] |
    -f[amily] { inet | inet6 | dnet | link } | -o[neline] }

    See the IP Command Reference.

    If it's called as "ipaddr", "iproute", "iprule", "ipneigh", "iplink", "iptunnel", or "ipmonitor" it runs the appropriate subcommand. So you can create a link to "iproute", and use:

    # iproute (whatever)
    

    instead of:

    # ip route (whatever)
    

    ip link

    From file iproute2/ip/iplink.c

    Usage: ip link set DEVICE { up | down | arp { on | off } |
    dynamic { on | off } |
    multicast { on | off } | txqueuelen PACKETS |
    name NEWNAME |
    address LLADDR | broadcast LLADDR |
    mtu MTU }
    ip link show [ DEVICE ]

    See the IP Command Reference.


    ip address

    From file iproute2/ip/ipaddress.c

    Usage: ip addr {add|del} IFADDR dev STRING
    ip addr {show|flush} [ dev STRING ] [ scope SCOPE-ID ]
    [ to PREFIX ] [ FLAG-LIST ] [ label PATTERN ]
    IFADDR := PREFIX | ADDR peer PREFIX
    [ broadcast ADDR ] [ anycast ADDR ]
    [ label STRING ] [ scope SCOPE-ID ]
    SCOPE-ID := [ host | link | global | NUMBER ]
    FLAG-LIST := [ FLAG-LIST ] FLAG
    FLAG := [ permanent | dynamic | secondary | primary |
    tentative | deprecated ]

    See the IP Command Reference.


    ip neighbor

    From file iproute2/ip/ipneigh.c

    Usage: ip neigh { add | del | change | replace } { ADDR [ lladdr LLADDR ]
    [ nud { permanent | noarp | stale | reachable } ]
    | proxy ADDR } [ dev DEV ]
    ip neigh {show|flush} [ to PREFIX ] [ dev DEV ] [ nud STATE ]

    See the IP Command Reference.


    ip rule

    Kernel Config: IP: advanced routerIP: policy routingIP: use TOS value as routing keyIP: use FWMARK value as routing keyIP: fast network address translation

    From file iproute2/ip/iprule.c

    Usage: ip rule [ list | add | del ] SELECTOR ACTION
    SELECTOR := [ from PREFIX ] [ to PREFIX ] [ tos TOS ] [ fwmark FWMARK ]
    [ dev STRING ] [ pref NUMBER ]
    ACTION := [ table TABLE_ID ] [ nat ADDRESS ]
    [ prohibit | reject | unreachable ]
    [ realms [SRCREALM/]DSTREALM ]
    TABLE_ID := [ local | main | default | NUMBER ]

    See the IP Command Reference.


    ip route

    Kernel Config: IP: advanced routerIP: policy routingIP: equal cost multipathIP: large routing tablesIP: fast network address translation

    From file iproute2/ip/iproute.c

    Usage: ip route { list | flush } SELECTOR
    ip route get ADDRESS [ from ADDRESS iif STRING ]
    [ oif STRING ] [ tos TOS ]
    ip route { add | del | replace | change | append | replace | monitor } ROUTE
    SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]
    [ table TABLE_ID ] [ proto RTPROTO ]
    [ type TYPE ] [ scope SCOPE ]
    ROUTE := NODE_SPEC [ INFO_SPEC ]
    NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]
    [ table TABLE_ID ] [ proto RTPROTO ]
    [ scope SCOPE ] [ metric METRIC ]
    INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...
    NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS
    OPTIONS := FLAGS [ mtu NUMBER ] [ advmss NUMBER ]
    [ rtt NUMBER ] [ rttvar NUMBER ]
    [ window NUMBER] [ cwnd NUMBER ] [ ssthresh REALM ]
    [ realms REALM ]
    TYPE := [ unicast | local | broadcast | multicast | throw |
    unreachable | prohibit | blackhole | nat ]
    TABLE_ID := [ local | main | default | all | NUMBER ]
    SCOPE := [ host | link | global | NUMBER ]
    FLAGS := [ equalize ]
    NHFLAGS := [ onlink | pervasive ]
    RTPROTO := [ kernel | boot | static | NUMBER ]

    See the IP Command Reference.


    ip tunnel

    Kernel Config: IP: tunnelingIP: GRE tunnels over IPIP: broadcast GRE over IP

    From file iproute2/ip/iptunnel.c

    Usage: ip tunnel { add | change | del | show } [ NAME ]
    [ mode { ipip | gre | sit } ] [ remote ADDR ] [ local ADDR ]
    [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]
    [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ] Where: NAME := STRING
    ADDR := { IP_ADDRESS | any }
    TOS := { NUMBER | inherit }
    TTL := { 1..255 | inherit }
    KEY := { DOTTED_QUAD | NUMBER }

    See the IP Command Reference, also see Tunnels over IP in Linux-2.2 and my (old) tunnel notes.


    ip maddress

    From file iproute2/ip/ipmaddr.c

    Usage: ip maddr [ add | del ] MULTIADDR dev STRING
    ip maddr show [ dev STRING ]

    See the IP Command Reference.


    ip mroute

    From file iproute2/ip/ipmroute.c

    Usage: ip mroute show [ PREFIX ] [ from PREFIX ] [ iif DEVICE ]
    

    See the IP Command Reference.


    ip monitor

    Kernel Config: IP: verbose route monitoring

    From file iproute2/ip/ipmonitor.c

    Usage: ip monitor [ all | LISTofOBJECTS ]
    

    See the IP Command Reference.


    rtmon command

    Kernel Config: IP: verbose route monitoring

    From file iproute2/ip/rtmon.c

    Usage: rtmon file FILE [ all | LISTofOBJECTS]
    LISTofOBJECTS := [ link ] [ address ] [ route ]

    tc command

    Kernel Config: QoS supportQoS and/or fair queueing

    From file iproute2/tc/tc.c

    Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }
    where OBJECT := { qdisc | class | filter }
    OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] }

    Where it's expecting a number for BPS; it understands some suffixes: kbps (*1024), mbps (*1024*1024), kbit (*1024/8), and mbit (*1024*1024/8). If I'm reading the code correctly; "BPS" means Bytes Per Second; if you give a number without a suffix it assumes you want BITS per second (it divides the number you give it by 8). It also understands bps as a suffix.

    Where it's expecting a time value, it seems it understands suffixes of ssec, and secs for seconds, msmsec, and msecs for milliseconds, and ususec, and usecs for microseconds.

    Where it wants a size parameter, it assumes non-suffixed numbers to be specified in bytes. It also understands suffixes of k and kb to mean kilobytes (*1024), m and mb to mean megabytes (*1024*1024), kbit to mean kilobit (*1024/8), and mbit to mean megabits (*1024*1024/8).

    1Mbit == 128Kbps


    qdisc

    Kernel Config: QoS supportQoS and/or fair queueing

    From file iproute2/tc/tc_qdisc.c

    Usage: tc qdisc [ add | del | replace | change | get ] dev STRING
    [ handle QHANDLE ] [ root | ingress | parent CLASSID ]
    [ estimator INTERVAL TIME_CONSTANT ]
    [ [ QDISC_KIND ] [ help | OPTIONS ] ] tc qdisc show [ dev STRING ] [ingress]
    Where:
    QDISC_KIND := { [p|b]fifo | tbf | prio | cbq | red | etc. }
    OPTIONS := ... try tc qdisc add <desired QDISC_KIND> help

    From file linux/net/sched/sch_api.c

       Short review.
    ------------- This file consists of two interrelated parts: 1. queueing disciplines manager frontend.
    2. traffic classes manager frontend. Generally, queueing discipline ("qdisc") is a black box,
    which is able to enqueue packets and to dequeue them (when
    device is ready to send something) in order and at times
    determined by algorithm hidden in it. qdisc's are divided to two categories:
    - "queues", which have no internal structure visible from outside.
    - "schedulers", which split all the packets to "traffic classes",
    using "packet classifiers" (look at cls_api.c) In turn, classes may have child qdiscs (as rule, queues)
    attached to them etc. etc. etc. The goal of the routines in this file is to translate
    information supplied by user in the form of handles
    to more intelligible for kernel form, to make some sanity
    checks and part of work, which is common to all qdiscs
    and to provide rtnetlink notifications. All real intelligent work is done inside qdisc modules. Every discipline has two major routines: enqueue and dequeue. ---dequeue dequeue usually returns a skb to send. It is allowed to return NULL,
    but it does not mean that queue is empty, it just means that
    discipline does not want to send anything this time.
    Queue is really empty if q->q.qlen == 0.
    For complicated disciplines with multiple queues q->q is not
    real packet queue, but however q->q.qlen must be valid. ---enqueue enqueue returns number of enqueued packets i.e. this number is 1,
    if packet was enqueued successfully and <1 if something (not
    necessary THIS packet) was dropped. Auxiliary routines: ---requeue requeues once dequeued packet. It is used for non-standard or
    just buggy devices, which can defer output even if dev->tbusy=0. ---reset returns qdisc to initial state: purge all buffers, clear all
    timers, counters (except for statistics) etc. ---init initializes newly created qdisc. ---destroy destroys resources allocated by init and during lifetime of qdisc. ---change changes qdisc parameters.

    qdisc [p|b]fifo

    From file iproute2/tc/q_fifo.c

    Usage: ... [p|b]fifo [ limit NUMBER ]
    

    From file linux/net/sched/sch_fifo.c

    /* 1 band FIFO pseudo-"scheduler" */
    

    qdisc tbf

    Kernel Config: QoS supportQoS and/or fair queueingTBF queueRate estimator

    From file iproute2/tc/

    Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ]
    [ peakrate KBPS ] [ latency TIME ]

    From file linux/net/sched/sch_tbf.c

    /*	Simple Token Bucket Filter.
    ======================================= SOURCE.
    ------- None. Description.
    ------------ A data flow obeys TBF with rate R and depth B, if for any
    time interval t_i...t_f the number of transmitted bits
    does not exceed B + R*(t_f-t_i). Packetized version of this definition:
    The sequence of packets of sizes s_i served at moments t_i
    obeys TBF, if for any i<=k: s_i+....+s_k <= B + R*(t_k - t_i) Algorithm.
    ---------- Let N(t_i) be B/R initially and N(t) grow continuously with time as: N(t+delta) = min{B/R, N(t) + delta} If the first packet in queue has length S, it may be
    transmited only at the time t_* when S/R <= N(t_*),
    and in this case N(t) jumps: N(t_* + 0) = N(t_* - 0) - S/R. Actually, QoS requires two TBF to be applied to a data stream.
    One of them controls steady state burst size, another
    one with rate P (peak rate) and depth M (equal to link MTU)
    limits bursts at a smaller time scale. It is easy to see that P>R, and B>M. If P is infinity, this double
    TBF is equivalent to a single one. When TBF works in reshaping mode, latency is estimated as: lat = max ((L-B)/R, (L-M)/P) NOTES.
    ------ If TBF throttles, it starts a watchdog timer, which will wake it up
    when it is ready to transmit.
    Note that the minimal timer resolution is 1/HZ.
    If no new packets arrive during this period,
    or if the device is not awaken by EOI for some previous packet,
    TBF can stop its activity for 1/HZ. This means, that with depth B, the maximal rate is R_crit = B*HZ F.e. for 10Mbit ethernet and HZ=100 the minimal allowed B is ~10Kbytes. Note that the peak rate TBF is much more tough: with MTU 1500
    P_crit = 150Kbytes/sec. So, if you need greater peak
    rates, use alpha with HZ=1000 :-)
    */

    qdisc cbq

    Kernel Config: QoS supportQoS and/or fair queueingCBQ packet schedulerRate estimator

    From file iproute2/tc/q_cbq.c

    Usage: ... cbq bandwidth BPS avpkt BYTES [ mpu BYTES ]
    [ cell BYTES ] [ ewma LOG ]

    From file linux/net/sched/sch_cbq.c

    /*	Class-Based Queueing (CBQ) algorithm.
    ======================================= Sources: [1] Sally Floyd and Van Jacobson, "Link-sharing and Resource
    Management Models for Packet Networks",
    IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995 [2] Sally Floyd, "Notes on CBQ and Guaranted Service", 1995 [3] Sally Floyd, "Notes on Class-Based Queueing: Setting
    Parameters", 1996 [4] Sally Floyd and Michael Speer, "Experimental Results
    for Class-Based Queueing", 1998, not published. ----------------------------------------------------------------------- Algorithm skeleton was taken from NS simulator cbq.cc.
    If someone wants to check this code against the LBL version,
    he should take into account that ONLY the skeleton was borrowed,
    the implementation is different. Particularly: --- The WRR algorithm is different. Our version looks more
    reasonable (I hope) and works when quanta are allowed to be
    less than MTU, which is always the case when real time classes
    have small rates. Note, that the statement of [3] is
    incomplete, delay may actually be estimated even if class
    per-round allotment is less than MTU. Namely, if per-round
    allotment is W*r_i, and r_1+...+r_k = r < 1 delay_i <= ([MTU/(W*r_i)]*W*r + W*r + k*MTU)/B In the worst case we have IntServ estimate with D = W*r+k*MTU
    and C = MTU*r. The proof (if correct at all) is trivial. --- It seems that cbq-2.0 is not very accurate. At least, I cannot
    interpret some places, which look like wrong translations
    from NS. Anyone is advised to find these differences
    and explain to me, why I am wrong 8). --- Linux has no EOI event, so that we cannot estimate true class
    idle time. Workaround is to consider the next dequeue event
    as sign that previous packet is finished. This is wrong because of
    internal device queueing, but on a permanently loaded link it is true.
    Moreover, combined with clock integrator, this scheme looks
    very close to an ideal solution. */

    tristate 'CBQ packet scheduler' CONFIG_NET_SCH_CBQ


    qdisc red

    Kernel Config: QoS supportQoS and/or fair queueingRED queue

    From file iproute2/tc/q_red.c

    Usage: ... red limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS
    probability PROBABILITY bandwidth KBPS

    From file linux/net/sched/sch_red.c

    /*	Random Early Detection (RED) algorithm.
    ======================================= Source: Sally Floyd and Van Jacobson, "Random Early Detection Gateways
    for Congestion Avoidance", 1993, IEEE/ACM Transactions on Networking. This file codes a "divisionless" version of RED algorithm
    as written down in Fig.17 of the paper. Short description.
    ------------------ When a new packet arrives we calculate the average queue length: avg = (1-W)*avg + W*current_queue_len, W is the filter time constant (choosen as 2^(-Wlog)), it controls
    the inertia of the algorithm. To allow larger bursts, W should be
    decreased. if (avg > th_max) -> packet marked (dropped).
    if (avg < th_min) -> packet passes.
    if (th_min < avg < th_max) we calculate probability: Pb = max_P * (avg - th_min)/(th_max-th_min) and mark (drop) packet with this probability.
    Pb changes from 0 (at avg==th_min) to max_P (avg==th_max).
    max_P should be small (not 1), usually 0.01..0.02 is good value. max_P is chosen as a number, so that max_P/(th_max-th_min)
    is a negative power of two in order arithmetics to contain
    only shifts. Parameters, settable by user:
    ----------------------------- limit - bytes (must be > qth_max + burst) Hard limit on queue length, should be chosen >qth_max
    to allow packet bursts. This parameter does not
    affect the algorithms behaviour and can be chosen
    arbitrarily high (well, less than ram size)
    Really, this limit will never be reached
    if RED works correctly. qth_min - bytes (should be < qth_max/2)
    qth_max - bytes (should be at least 2*qth_min and less limit)
    Wlog - bits (<32) log(1/W).
    Plog - bits (<32) Plog is related to max_P by formula: max_P = (qth_max-qth_min)/2^Plog; F.e. if qth_max=128K and qth_min=32K, then Plog=22
    corresponds to max_P=0.02 Scell_log
    Stab Lookup table for log((1-W)^(t/t_ave). NOTES: Upper bound on W.
    ----------------- If you want to allow bursts of L packets of size S,
    you should choose W: L + 1 - th_min/S < (1-(1-W)^L)/W th_min/S = 32 th_min/S = 4 log(W) L
    -1 33
    -2 35
    -3 39
    -4 46
    -5 57
    -6 75
    -7 101
    -8 135
    -9 190
    etc.
    */

    qdisc sfq

    From file iproute2/tc/q_sfq.c

    Usage: ... sfq [ perturb SECS ] [ quantum BYTES ]
    

    Kernel Config: QoS supportQoS and/or fair queueingSFQ queue

    From file linux/net/sched/

    /*	Stochastic Fairness Queuing algorithm.
    ======================================= Source:
    Paul E. McKenney "Stochastic Fairness Queuing",
    IEEE INFOCOMM'90 Proceedings, San Francisco, 1990. Paul E. McKenney "Stochastic Fairness Queuing",
    "Interworking: Research and Experience", v.2, 1991, p.113-131. See also:
    M. Shreedhar and George Varghese "Efficient Fair
    Queuing using Deficit Round Robin", Proc. SIGCOMM 95. This is not the thing that is usually called (W)FQ nowadays.
    It does not use any timestamp mechanism, but instead
    processes queues in round-robin order. ADVANTAGE: - It is very cheap. Both CPU and memory requirements are minimal. DRAWBACKS: - "Stochastic" -> It is not 100% fair.
    When hash collisions occur, several flows are considered as one. - "Round-robin" -> It introduces larger delays than virtual clock
    based schemes, and should not be used for isolating interactive
    traffic from non-interactive. It means, that this scheduler
    should be used as leaf of CBQ or P3, which put interactive traffic
    to higher priority band. We still need true WFQ for top level CSZ, but using WFQ
    for the best effort traffic is absolutely pointless:
    SFQ is superior for this purpose. IMPLEMENTATION:
    This implementation limits maximal queue length to 128;
    maximal mtu to 2^15-1; number of hash buckets to 1024.
    The only goal of this restrictions was that all data
    fit into one 4K page :-). Struct sfq_sched_data is
    organized in anti-cache manner: all the data for a bucket
    are scattered over different locations. This is not good,
    but it allowed me to put it into 4K. It is easy to increase these values, but not in flight. */

    qdisc prio

    Kernel Config: QoS supportQoS and/or fair queueingThe simplest PRIO pseudo scheduler

    From file iproute2/tc/q_prio.c

    Usage: ... prio bands NUMBER priomap P1 P2...
    

    From file linux/net/sched/sch_prio.c

     * net/sched/sch_prio.c	Simple 3-band priority "scheduler".
    

    qdisc csz

    Kernel Config: QoS supportQoS and/or fair queueingCSZ packet scheduler

    From file iproute2/tc/q_csz.c

    NOT IMPLEMENTED
    

    From file linux/net/sched/sch_csz.c

    /*	Clark-Shenker-Zhang algorithm.
    ======================================= SOURCE. David D. Clark, Scott Shenker and Lixia Zhang
    "Supporting Real-Time Applications in an Integrated Services Packet
    Network: Architecture and Mechanism". CBQ presents a flexible universal algorithm for packet scheduling,
    but it has pretty poor delay characteristics.
    Round-robin scheduling and link-sharing goals
    apparently contradict minimization of network delay and jitter.
    Moreover, correct handling of predictive flows seems to be
    impossible in CBQ. CSZ presents a more precise but less flexible and less efficient
    approach. As I understand it, the main idea is to create
    WFQ flows for each guaranteed service and to allocate
    the rest of bandwith to dummy flow-0. Flow-0 comprises
    the predictive services and the best effort traffic;
    it is handled by a priority scheduler with the highest
    priority band allocated for predictive services, and the rest ---
    to the best effort packets. Note that in CSZ flows are NOT limited to their bandwidth. It
    is supposed that the flow passed admission control at the edge
    of the QoS network and it doesn't need further shaping. Any
    attempt to improve the flow or to shape it to a token bucket
    at intermediate hops will introduce undesired delays and raise
    jitter. At the moment CSZ is the only scheduler that provides
    true guaranteed service. Another schemes (including CBQ)
    do not provide guaranteed delay and randomize jitter.
    There is a proof (Sally Floyd), that delay
    can be estimated by a IntServ compliant formula.
    This result is true formally, but it is wrong in principle.
    It takes into account only round-robin delays,
    ignoring delays introduced by link sharing i.e. overlimiting.
    Note that temporary overlimits are inevitable because
    real links are not ideal, and the real algorithm must take this
    into account. ALGORITHM. --- Notations. $B$ is link bandwidth (bits/sec). $I$ is set of all flows, including flow $0$.
    Every flow $a \in I$ has associated bandwidth slice $r_a < 1$ and
    $\sum_{a \in I} r_a = 1$. --- Flow model. Let $m_a$ is the number of backlogged bits in flow $a$.
    The flow is {\em active}, if $m_a > 0$.
    This number is a discontinuous function of time;
    when a packet $i$ arrives:
    \[
    m_a(t_i+0) - m_a(t_i-0) = L^i,
    \]
    where $L^i$ is the length of the arrived packet.
    The flow queue is drained continuously until $m_a == 0$:
    \[
    {d m_a \over dt} = - { B r_a \over \sum_{b \in A} r_b}.
    \]
    I.e. flow rates are their allocated rates proportionally
    scaled to take all available link bandwidth. Apparently,
    it is not the only possible policy. F.e. CBQ classes
    without borrowing would be modelled by:
    \[
    {d m_a \over dt} = - B r_a .
    \]
    More complicated hierarchical bandwidth allocation
    policies are possible, but unfortunately, the basic
    flow equations have a simple solution only for proportional
    scaling. --- Departure times. We calculate the time until the last bit of packet is sent:
    \[
    E_a^i(t) = { m_a(t_i) - \delta_a(t) \over r_a },
    \]
    where $\delta_a(t)$ is number of bits drained since $t_i$.
    We have to evaluate $E_a^i$ for all queued packets,
    then find the packet with minimal $E_a^i$ and send it. This sounds good, but direct implementation of the algorithm
    is absolutely infeasible. Luckily, if flow rates
    are scaled proportionally, the equations have a simple solution. The differential equation for $E_a^i$ is
    \[
    {d E_a^i (t) \over dt } = - { d \delta_a(t) \over dt} { 1 \over r_a} =
    { B \over \sum_{b \in A} r_b}
    \]
    with initial condition
    \[
    E_a^i (t_i) = { m_a(t_i) \over r_a } .
    \] Let's introduce an auxiliary function $R(t)$: --- Round number. Consider the following model: we rotate over active flows,
    sending $r_a B$ bits from every flow, so that we send
    $B \sum_{a \in A} r_a$ bits per round, that takes
    $\sum_{a \in A} r_a$ seconds. Hence, $R(t)$ (round number) is a monotonically increasing
    linear function of time when $A$ is not changed
    \[
    { d R(t) \over dt } = { 1 \over \sum_{a \in A} r_a }
    \]
    and it is continuous when $A$ changes. The central observation is that the quantity
    $F_a^i = R(t) + E_a^i(t)/B$ does not depend on time at all!
    $R(t)$ does not depend on flow, so that $F_a^i$ can be
    calculated only once on packet arrival, and we need not
    recalculate $E$ numbers and resorting queues.
    The number $F_a^i$ is called finish number of the packet.
    It is just the value of $R(t)$ when the last bit of packet
    is sent out. Maximal finish number on flow is called finish number of flow
    and minimal one is "start number of flow".
    Apparently, flow is active if and only if $F_a \leq R$. When a packet of length $L_i$ bit arrives to flow $a$ at time $t_i$,
    we calculate $F_a^i$ as: If flow was inactive ($F_a < R$):
    $F_a^i = R(t) + {L_i \over B r_a}$
    otherwise
    $F_a^i = F_a + {L_i \over B r_a}$ These equations complete the algorithm specification. It looks pretty hairy, but there is a simple
    procedure for solving these equations.
    See procedure csz_update(), that is a generalization of
    the algorithm from S. Keshav's thesis Chapter 3
    "Efficient Implementation of Fair Queeing". NOTES. * We implement only the simplest variant of CSZ,
    when flow-0 is a explicit 4band priority fifo.
    This is bad, but we need a "peek" operation in addition
    to "dequeue" to implement complete CSZ.
    I do not want to do that, unless it is absolutely
    necessary. * A primitive support for token bucket filtering
    presents itself too. It directly contradicts CSZ, but
    even though the Internet is on the globe ... :-)
    "the edges of the network" really exist. BUGS. * Fixed point arithmetic is overcomplicated, suboptimal and even
    wrong. Check it later. */ /* This number is arbitrary */

    class

    Kernel Config: QoS supportQoS and/or fair queueingPacket classifier API

    From file iproute2/tc/tc_class.c

    Usage: tc class [ add | del | change | get ] dev STRING
    [ classid CLASSID ] [ root | parent CLASSID ]
    [ [ QDISC_KIND ] [ help | OPTIONS ] ] tc class show [ dev STRING ] [ root | parent CLASSID ]
    Where:
    QDISC_KIND := { prio | cbq | etc. }
    OPTIONS := ... try tc class add <desired QDISC_KIND> help

    class cbq

    From file iproute2/tc/q_cbq.c

    Usage: ... cbq bandwidth BPS rate BPS maxburst PKTS [ avpkt BYTES ]
    [ minburst PKTS ] [ bounded ] [ isolated ]
    [ allot BYTES ] [ mpu BYTES ] [ weight RATE ]
    [ prio NUMBER ] [ cell BYTES ] [ ewma LOG ]
    [ estimator INTERVAL TIME_CONSTANT ]
    [ split CLASSID ] [ defmap MASK/CHANGE ]

    estimator

    Kernel Config: QoS supportQoS and/or fair queueingRate estimator

    From file iproute2/tc/m_estimator.c

    Usage: ... estimator INTERVAL TIME-CONST
    INTERVAL is interval between measurements
    TIME-CONST is averaging time constant
    Example: ... est 1sec 8sec\

    From file linux/net/sched/estimator.c

       This code is NOT intended to be used for statistics collection,
    its purpose is to provide a base for statistical multiplexing
    for controlled load service.
    If you need only statistics, run a user level daemon which
    periodically reads byte counters. Unfortunately, rate estimation is not a very easy task.
    F.e. I did not find a simple way to estimate the current peak rate
    and even failed to formulate the problem 8)8) So I preferred not to built an estimator into the scheduler,
    but run this task separately.
    Ideally, it should be kernel thread(s), but for now it runs
    from timers, which puts apparent top bounds on the number of rated
    flows, has minimal overhead on small, but is enough
    to handle controlled load service, sets of aggregates. We measure rate over A=(1<<interval) seconds and evaluate EWMA: avrate = avrate*(1-W) + rate*W where W is chosen as negative power of 2: W = 2^(-ewma_log) The resulting time constant is: T = A/(-ln(1-W)) NOTES. * The stored value for avbps is scaled by 2^5, so that maximal
    rate is ~1Gbit, avpps is scaled by 2^10. * Minimal interval is HZ/4=250msec (it is the greatest common divisor
    for HZ=100 and HZ=1024 8)), maximal interval
    is (HZ/4)*2^EST_MAX_INTERVAL = 8sec. Shorter intervals
    are too expensive, longer ones can be implemented
    at user level painlessly.

    filter

    Kernel Config: QoS supportQoS and/or fair queueingPacket classifier API

    From file iproute2/tc/tc_filter.c

    Usage: tc filter [ add | del | change | get ] dev STRING
    [ pref PRIO ] [ protocol PROTO ]
    [ estimator INTERVAL TIME_CONSTANT ]
    [ root | classid CLASSID ] [ handle FILTERID ]
    [ [ FILTER_TYPE ] [ help | OPTIONS ] ] tc filter show [ dev STRING ] [ root | parent CLASSID ]
    Where:
    FILTER_TYPE := { rsvp | u32 | fw | route | etc. }
    FILTERID := ... format depends on classifier, see there
    OPTIONS := ... try tc filter add <desired FILTER_KIND> help

    filter rsvp

    From file iproute2/tc/f_rsvp.c

    Usage: ... rsvp ipproto PROTOCOL session DST[/PORT | GPI ]
    [ sender SRC[/PORT | GPI ]
    [ classid CLASSID ] [ police POLICE_SPEC ]
    [ tunnelid ID ] [ tunnel ID skip NUMBER ]
    Where: GPI := { flowlabel NUMBER | spi/ah SPI | spi/esp SPI |
    u{8|16|32} NUMBER mask MASK at OFFSET}
    POLICE_SPEC := ... look at TBF
    FILTERID := X:Y

    From file linux/net/sched/cls_rsvp.h

    /*
    Comparing to general packet classification problem,
    RSVP needs only sevaral relatively simple rules: * (dst, protocol) are always specified,
    so that we are able to hash them.
    * src may be exact, or may be wildcard, so that
    we can keep a hash table plus one wildcard entry.
    * source port (or flow label) is important only if src is given. IMPLEMENTATION. We use a two level hash table: The top level is keyed by
    destination address and protocol ID, every bucket contains a list
    of "rsvp sessions", identified by destination address, protocol and
    DPI(="Destination Port ID"): triple (key, mask, offset). Every bucket has a smaller hash table keyed by source address
    (cf. RSVP flowspec) and one wildcard entry for wildcard reservations.
    Every bucket is again a list of "RSVP flows", selected by
    source address and SPI(="Source Port ID" here rather than
    "security parameter index"): triple (key, mask, offset). NOTE 1. All the packets with IPv6 extension headers (but AH and ESP)
    and all fragmented packets go to the best-effort traffic class. NOTE 2. Two "port id"'s seems to be redundant, rfc2207 requires
    only one "Generalized Port Identifier". So that for classic
    ah, esp (and udp,tcp) both *pi should coincide or one of them
    should be wildcard. At first sight, this redundancy is just a waste of CPU
    resources. But DPI and SPI add the possibility to assign different
    priorities to GPIs. Look also at note 4 about tunnels below. NOTE 3. One complication is the case of tunneled packets.
    We implement it as following: if the first lookup
    matches a special session with "tunnelhdr" value not zero,
    flowid doesn't contain the true flow ID, but the tunnel ID (1...255).
    In this case, we pull tunnelhdr bytes and restart lookup
    with tunnel ID added to the list of keys. Simple and stupid 8)8)
    It's enough for PIMREG and IPIP. NOTE 4. Two GPIs make it possible to parse even GRE packets.
    F.e. DPI can select ETH_P_IP (and necessary flags to make
    tunnelhdr correct) in GRE protocol field and SPI matches
    GRE key. Is it not nice? 8)8) Well, as result, despite its simplicity, we get a pretty
    powerful classification engine. */

    filter u32

    From file iproute2/tc/f_u32.c

    Usage: ... u32 [ match SELECTOR ... ] [ link HTID ] [ classid CLASSID ]
    [ police POLICE_SPEC ] [ offset OFFSET_SPEC ]
    [ ht HTID ] [ hashkey HASHKEY_SPEC ]
    [ sample SAMPLE ]
    or u32 divisor DIVISOR Where: SELECTOR := SAMPLE SAMPLE ...
    SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} } SAMPLE_ARGS
    FILTERID := X:Y:Z

    From file linux/net/sched/cls_u32.c

     *	The filters are packed to hash tables of key nodes
    * with a set of 32bit key/mask pairs at every node.
    * Nodes reference next level hash tables etc.
    *
    * This scheme is the best universal classifier I managed to
    * invent; it is not super-fast, but it is not slow (provided you
    * program it correctly), and general enough. And its relative
    * speed grows as the number of rules becomes larger.
    *
    * It seems that it represents the best middle point between
    * speed and manageability both by human and by machine.
    *
    * It is especially useful for link sharing combined with QoS;
    * pure RSVP doesn't need such a general approach and can use
    * much simpler (and faster) schemes, sort of cls_rsvp.c.

    Faster than ipchains, according to Werner Almesburger (as reported in Rusty's Diary, May 22 1999).


    filter fw

    From file iproute2/tc/f_fw.c

    Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC ]
    POLICE_SPEC := ... look at TBF
    CLASSID := X:Y

    From file linux/net/sched/cls_fw.c

     * net/sched/cls_fw.c	Classifier mapping ipchains' fwmark to traffic class.
    

    filter route

    From file iproute2/tc/f_route.c

    Usage: ... route [ from REALM | fromif TAG ] [ to REALM ]
    [ flowid CLASSID ] [ police POLICE_SPEC ]
    POLICE_SPEC := ... look at TBF
    CLASSID := X:Y\n

    From file linux/net/sched/cls_route.c

       1. For now we assume that route tags < 256.
    It allows to use direct table lookups, instead of hash tables.
    2. For now we assume that "from TAG" and "fromdev DEV" statements
    are mutually exclusive.
    3. "to TAG from ANY" has higher priority, than "to ANY from XXX"

    police

    From file iproute2/tc/m_police.c

    Usage: ... police rate BPS burst BYTES[/BYTES] [ mtu BYTES[/BYTES] ]
    [ peakrate BPS ] [ avrate BPS ]
    [ ACTION ]
    Where: ACTION := reclassify | drop | continue

    Copyright © 1999 Mark Lamb. Redistribution via any media permitted as long as this notice remains intact. http://snafu.freedom.org/