Linux 网络子系统之网络协议接口层(一)

时间:2023-03-09 17:04:26
Linux 网络子系统之网络协议接口层(一)

Linux 网络设备驱动之网络协议接口层介绍。


* 当上层ARP或IP需要发送数据包时,它将调用网络协议接口层的`dev_queue_xmit` 函数发送该数据包,
同时还需要传递给改函数一个 `struct sk_buff` 数据结构的指针。

dev_queue_xmit() 函数的原型为:
`int dev_queue_xmit(struct sk_buff *skb);`

* 同样的,上层对数据包的接收也通过向 `netif_rx() 函数传递 `struct sk_buff` 数据结构的指针来完成。

netif_rx() 函数的原型为:
`int netif_rx(struct sk_buff *skb);`

* sk_buff 结构体

sk_buff 结构体非常重要,它定义于 `include/linux/skbuff.h` 文件中,含义为"套接字缓冲区",
用于在Linux 网络子系统中的各层之间的传递数据,是Linux 网络子系统数据传递的中枢神经。 当发送数据时, Linux kernel 的网络处理模块必须建立一个包含要传输的数据的数据包的 `sk_buff`,
然后将 `sk_buff` 递交给下层,各层在 sk_buff 中添加不同的协议头直至交给网络设备发送。
  • 下面描述了该结构体的一些详细信息。
* struct sk_buff - socket buffer
* @next: Next buffer in list
* @prev: Previous buffer in list
* @tstamp: Time we arrived/left
* @rbnode: RB tree node, alternative to next/prev for netem/tcp
* @sk: Socket we are owned by
* @dev: Device we arrived on/are leaving by
* @cb: Control buffer. Free for use by every layer. Put private vars here
* @_skb_refdst: destination entry (with norefcount bit)
* @sp: the security path, used for xfrm
* @len: Length of actual data
* @data_len: Data length
* @mac_len: Length of link layer header
* @hdr_len: writable header length of cloned skb
* @csum: Checksum (must include start/offset pair)
* @csum_start: Offset from skb->head where checksumming should start
* @csum_offset: Offset from csum_start where checksum should be stored
* @priority: Packet queueing priority
* @ignore_df: allow local fragmentation
* @cloned: Head may be cloned (check refcnt to be sure)
* @ip_summed: Driver fed us an IP checksum
* @nohdr: Payload reference only, must not modify header
* @nfctinfo: Relationship of this skb to the connection
* @pkt_type: Packet class
* @fclone: skbuff clone status
* @ipvs_property: skbuff is owned by ipvs
* @peeked: this packet has been seen already, so stats have been
* done for it, don't do them again
* @nf_trace: netfilter packet trace flag
* @protocol: Packet protocol from driver
* @destructor: Destruct function
* @nfct: Associated connection, if any
* @nf_bridge: Saved data about a bridged frame - see br_netfilter.c
* @skb_iif: ifindex of device we arrived on
* @tc_index: Traffic control index
* @tc_verd: traffic control verdict
* @hash: the packet hash
* @queue_mapping: Queue mapping for multiqueue devices
* @xmit_more: More SKBs are pending for this queue
* @ndisc_nodetype: router type (from link layer)
* @ooo_okay: allow the mapping of a socket to a queue to be changed
* @l4_hash: indicate hash is a canonical 4-tuple hash over transport
* ports.
* @sw_hash: indicates hash was computed in software stack
* @wifi_acked_valid: wifi_acked was set
* @wifi_acked: whether frame was acked on wifi or not
* @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS
* @napi_id: id of the NAPI struct this skb came from
* @secmark: security marking
* @offload_fwd_mark: fwding offload mark
* @mark: Generic packet mark
* @vlan_proto: vlan encapsulation protocol
* @vlan_tci: vlan tag control information
* @inner_protocol: Protocol (encapsulation)
* @inner_transport_header: Inner transport layer header (encapsulation)
* @inner_network_header: Network layer header (encapsulation)
* @inner_mac_header: Link layer header (encapsulation)
* @transport_header: Transport layer header
* @network_header: Network layer header
* @mac_header: Link layer header
* @tail: Tail pointer
* @end: End pointer
* @head: Head of buffer
* @data: Data head pointer
* @truesize: Buffer size
* @users: User count - see {datagram,tcp}.c
*/ /* 套接字缓冲区 */
struct sk_buff {
union {
struct {
/* These two members must be first. */
struct sk_buff *next; // 指向下一个缓冲区
struct sk_buff *prev; // 指向上一个缓冲区 union {
ktime_t tstamp; //我们到达和离开的时间
struct skb_mstamp skb_mstamp;
struct rb_node rbnode; /* used in netem & tcp stack */
struct sock *sk; //我们所拥有的套接字
struct net_device *dev; /*
* This is the control buffer. It is free to use for every
* layer. Please put your private variables there. If you
* want to keep them across layers you have to do a skb_clone()
* first. This is owned by whoever has the skb queued ATM.
char cb[48] __aligned(8); unsigned long _skb_refdst;
void (*destructor)(struct sk_buff *skb); // 套接字缓冲区的自毁功能
struct sec_path *sp;
struct nf_conntrack *nfct; // 相关的连接
struct nf_bridge_info *nf_bridge; // 保存数据的桥接设计
unsigned int len, //实际的data 长度
data_len; // 数据长度
__u16 mac_len, // 链路层头的长度
hdr_len; // 有关克隆的套接字缓冲区可写长度
/* Following fields are _not_ copied in __copy_skb_header()
* Note that queue_mapping is here mostly to fill a hole.
__u16 queue_mapping; // 多重队列设备的队列地图
__u8 cloned:1, // 头有可能被克隆
nohdr:1, // 只有有效荷载参考,不能修改头
fclone:2, // 套接字缓冲区克隆的状态位
peeked:1, // 这个包已经被发现了,所以该状态位说明他已经做过了相关操作,不要再做一次
xmit_more:1; // 更多的套接字缓冲区未决定在这个队列
/* one bit hole */
kmemcheck_bitfield_end(flags1); /* fields enclosed in headers_start/headers_end are copied
* using a single memcpy() in __copy_skb_header()
/* private: */
__u32 headers_start[0];
/* public: */
/* if you move pkt_type around you also must adapt those constants */
#define PKT_TYPE_MAX (7 << 5)
#define PKT_TYPE_MAX 7
#define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset) __u8 __pkt_type_offset[0];
__u8 pkt_type:3; // 包的类别
__u8 pfmemalloc:1;
__u8 ignore_df:1; // 允许当前分裂
__u8 nfctinfo:3; // 这个套接字到连接的关系 __u8 nf_trace:1; // 网络过滤器包追踪标志
__u8 ip_summed:2; // 驱动饲养我们一个ip 和校验
__u8 ooo_okay:1; // 允许 映射一个套接字队列被改变
__u8 l4_hash:1; // 表明哈希是一个权威的 4 元组哈希 运输端口
__u8 sw_hash:1; // 表明哈希是一个在软件的堆栈内可以计算
__u8 wifi_acked_valid:1; // wifi 访问被设定有效
__u8 wifi_acked:1; // 帧是否在访问wifi
__u8 no_fcs:1; // 请求NIC 去修正最后4位在以太网的FCS
/* Indicates the inner headers are valid in the skbuff. */
__u8 encapsulation:1;
__u8 encap_hdr_csum:1;
__u8 csum_valid:1;
__u8 csum_complete_sw:1;
__u8 csum_level:2;
__u8 csum_bad:1; #ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2; // 路由器类型
__u8 ipvs_property:1; // 套接字缓冲区是属于 ip 虚拟服务器
__u8 inner_protocol_type:1;
__u8 remcsum_offload:1;
/* 3 or 5 bit hole */ #ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */ // 交通控制指引
__u16 tc_verd; /* traffic control verdict */ // 交通控制裁定
#endif union {
__wsum csum; // checksum
struct {
__u16 csum_start; // 当要执行checksum 的时候skb->head 要偏移的地址
__u16 csum_offset; // checksum 到什么地方
__u32 priority; // 数据包队列优先级
int skb_iif; // 我们来到设备的指引
__u32 hash; // 包的哈希
__be16 vlan_proto; // VLAN (virtual Local area network) 封装协议
__u16 vlan_tci; // VLAN 控制标志信息
#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
union {
unsigned int napi_id; // 这个套接字缓冲区来自于NAPI的ID
unsigned int sender_cpu;
union {
__u32 secmark; // 安全掩码
__u32 offload_fwd_mark; // fwding 卸载 掩码
}; union {
__u32 mark; // 普通包掩码
__u32 reserved_tailroom;
}; union {
__be16 inner_protocol; // 内部协议(封装)
__u8 inner_ipproto;
}; __u16 inner_transport_header; // 内部的运输链路头(封装好了)
__u16 inner_network_header;
__u16 inner_mac_header; // 内部 链路层的头(封装好了) __be16 protocol; // 来自驱动的包协议
__u16 transport_header;
__u16 network_header; // 网络层的头
__u16 mac_header; // 链路层的头
/* private: */
__u32 headers_end[0];
/* public: */ /* These elements must be at the end, see alloc_skb() for details. */
sk_buff_data_t tail; // 尾指针
sk_buff_data_t end; // 结束指针
unsigned char *head, // 缓冲区的头指针
*data; // 数据的头指针
unsigned int truesize; // 缓冲区的大小
atomic_t users; // 用户统计
  • 上面简单对 sk_buff 结构体进行了简单的注释。


    • 分配空间。

      Linux 内核中用于分配套接字缓冲区的函数有:
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
   alloc_skb() 函数分配一个套接字缓冲区和一个数据缓冲区,参数size 为数据缓冲区的空间大小
通常以L1_CACHE_BYTES字节 对齐,参数priority 为内存分配的优先级。
static inline struct sk_buff *dev_alloc_skb(unsigned int length)
return netdev_alloc_skb(NULL, length);
    dev_alloc_skb() 函数以GFP_ATOMIC 优先级进行skb的分配,原因是该函数经常在设备驱动的接收中断中被调用

* 释放

Linux kernel 中用于释放套接字缓冲区的函数有:
// Linux kernel 内部使用kfree_skb()函数
void kfree_skb(struct sk_buff *skb);
// 在驱动中则最好使用dev_kfree_skb();
#define dev_kfree_skb(a) consume_skb(a)
// 使用在中断上下文释放
static inline void dev_kfree_skb_irq(struct sk_buff *skb)
__dev_kfree_skb_irq(skb, SKB_REASON_DROPPED);
// 在中断上下文中销毁
static inline void dev_consume_skb_irq(struct sk_buff *skb)
__dev_kfree_skb_irq(skb, SKB_REASON_CONSUMED);
// 在中断上下文以及非中断上下文中都可以使用释放
static inline void dev_kfree_skb_any(struct sk_buff *skb)
__dev_kfree_skb_any(skb, SKB_REASON_DROPPED);
// 在中断上下文以及非中断上下文都可以销毁
static inline void dev_consume_skb_any(struct sk_buff *skb)
__dev_kfree_skb_any(skb, SKB_REASON_CONSUMED);

* 变更

在 Linux kernel 中,可以使用如下方法对sk_buff 进行添加修改数据
* Add data to an sk_buff
unsigned char *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len);
unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
unsigned char *tmp = skb_tail_pointer(skb);
skb->tail += len;
skb->len += len;
return tmp;
} unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
skb->data -= len;
skb->len += len;
return skb->data;
} unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
skb->len -= len;
BUG_ON(skb->len < skb->data_len);
return skb->data += len;