深入理解linux网络技术内幕读书笔记(八)--设备注册与初始化

时间:2023-03-09 07:12:43
深入理解linux网络技术内幕读书笔记(八)--设备注册与初始化

设备注册之时

网络设备的注册发生在下列几种情况之下:

  • 加载NIC设备驱动程序
    • 如果NIC设备驱动程序内核至内核中,则在引导期间初始化。
    • 以模块加载,就会在运行期间初始化。
  • 插入可热插拔网络设备
    当用户把可热插拔NIC设备插入进来时,内核会通知其驱动程序,而驱动程序再注册该设备。

加载PCI驱动程序将调用pci_driver->probe函数的执行,此函数由驱动程序提供,其负责设备的注册。

设备除名之时

  • 卸载NIC设备驱动程序
    卸载PCI设备驱动程序将导致pci_driver->remove函数的执行,通常命名为xxx_remove_one,此函数负责设备的除名。
  • 删除可热插拔设备

分配net_device结构

  1:  /**
2: * alloc_netdev_mq - allocate network device
3: * @sizeof_priv: size of private data to allocate space for
4: * @name: device name format string
5: * @setup: callback to initialize device
6: * @queue_count: the number of subqueues to allocate
7: *
8: * Allocates a struct net_device with private data area for driver use
9: * and performs basic initialization. Also allocates subquue structs
10: * for each queue on the device at the end of the netdevice.
11: */
12: struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
13: void (*setup)(struct net_device *), unsigned int queue_count)
14: {
15: struct netdev_queue *tx;
16: struct net_device *dev;
17: size_t alloc_size;
18: struct net_device *p;
19: #ifdef CONFIG_RPS
20: struct netdev_rx_queue *rx;
21: int i;
22: #endif
23:
24: BUG_ON(strlen(name) >= sizeof(dev->name));
25:
26: alloc_size = sizeof(struct net_device);
27: if (sizeof_priv) {
28: /* ensure 32-byte alignment of private area */
29: alloc_size = ALIGN(alloc_size, NETDEV_ALIGN);
30: alloc_size += sizeof_priv;
31: }
32: /* ensure 32-byte alignment of whole construct */
33: alloc_size += NETDEV_ALIGN - ;
34:
35: p = kzalloc(alloc_size, GFP_KERNEL);
36: if (!p) {
37: printk(KERN_ERR "alloc_netdev: Unable to allocate device.\n");
38: return NULL;
39: }
40:
41: tx = kcalloc(queue_count, sizeof(struct netdev_queue), GFP_KERNEL);
42: if (!tx) {
43: printk(KERN_ERR "alloc_netdev: Unable to allocate "
44: "tx qdiscs.\n");
45: goto free_p;
46: }
47: #ifdef CONFIG_RPS
48: rx = kcalloc(queue_count, sizeof(struct netdev_rx_queue), GFP_KERNEL);
49: if (!rx) {
50: printk(KERN_ERR "alloc_netdev: Unable to allocate "
51: "rx queues.\n");
52: goto free_tx;
53: }
54:
55: atomic_set(&rx->count, queue_count);
56:
57: /*
58: * Set a pointer to first element in the array which holds the
59: * reference count.
60: */
61: for (i = ; i < queue_count; i++)
62: rx[i].first = rx;
63: #endif
64:
65: dev = PTR_ALIGN(p, NETDEV_ALIGN);
66: dev->padded = (char *)dev - (char *)p;
67:
68: if (dev_addr_init(dev))
69: goto free_rx;
70:
71: dev_mc_init(dev);
72: dev_uc_init(dev);
73:
74: dev_net_set(dev, &init_net);
75:
76: dev->_tx = tx;
77: dev->num_tx_queues = queue_count;
78: dev->real_num_tx_queues = queue_count;
79:
80: #ifdef CONFIG_RPS
81: dev->_rx = rx;
82: dev->num_rx_queues = queue_count;
83: #endif
84:
85: dev->gso_max_size = GSO_MAX_SIZE;
86:
87: netdev_init_queues(dev);
88:
89: INIT_LIST_HEAD(&dev->ethtool_ntuple_list.list);
90: dev->ethtool_ntuple_list.count = ;
91: INIT_LIST_HEAD(&dev->napi_list);
92: INIT_LIST_HEAD(&dev->unreg_list);
93: INIT_LIST_HEAD(&dev->link_watch_list);
94: dev->priv_flags = IFF_XMIT_DST_RELEASE;
95: setup(dev);
96: strcpy(dev->name, name);
97: return dev;
98:
99: free_rx:
100: #ifdef CONFIG_RPS
101: kfree(rx);
102: free_tx:
103: #endif
104: kfree(tx);
105: free_p:
106: kfree(p);
107: return NULL;
108: }
109: EXPORT_SYMBOL(alloc_netdev_mq);

[注] net/core/dev.c

NIC注册和除名架构

注册

 1:  xxx_probe/module_init
2: dev=alloc_ethdev(sizeof(driver_private_structure))
3: alloc_netdev(sizeof(private), "eth%d", ether_setup)
4: dev=kmalloc(sizeof(net_device)+sizeof(private)+padding)
5: ether_setup(dev)
6: strcpy(dev->name, "eth%d")
7: return (dev)
8: ......
9: netdev_boot_setup_check(dev)
10: ......
11: register_netdev(dev)
12: register_netdevice(dev)

除名

1:  xxx_remove_one/module_exit
2: unregister_netdev(dev)
3: unregister_netdevice(dev)
4: ......
5: free_netdev(dev)

设备初始化

表8-2 由xxx_setup和xxx_probe初始化的net_device函数指针
初始化程序 函数指针名称
xxx_setup change_mtu
set_mac_address
rebuild_header
hard_headser_cache
header_cache_update
hard_header_parse
设备驱动程序的探测函数 open
stop
hard_start_xmit
tx_timeout
watchdog_timeo
get_stats
get_wireless_stats
set_multicast_list
do_ioctl
init
uninit
poll
ethtool_ops
表8-3 xxx_setup和xxx_probe初始化的net_device字段
初始化程序 变量名称
xxx_setup type
hard_header_len
mtu
addr_len
tx_queue_len
broadcast
flags
设备驱动程序的探测函数 base_addr
irq
if_port
priv
features

设备类型初始化: xxx_setup函数

 1:  const struct header_ops eth_header_ops ____cacheline_aligned = {
2: .create = eth_header,
3: .parse = eth_header_parse,
4: .rebuild = eth_rebuild_header,
5: .cache = eth_header_cache,
6: .cache_update = eth_header_cache_update,
7: };
8:
9: /**
10: * ether_setup - setup Ethernet network device
11: * @dev: network device
12: * Fill in the fields of the device structure with Ethernet-generic values.
13: */
14: void ether_setup(struct net_device *dev)
15: {
16: dev->header_ops = &eth_header_ops;
17: dev->type = ARPHRD_ETHER;
18: dev->hard_header_len = ETH_HLEN;
19: dev->mtu = ETH_DATA_LEN;
20: dev->addr_len = ETH_ALEN;
21: dev->tx_queue_len = ; /* Ethernet wants good queues */
22: dev->flags = IFF_BROADCAST|IFF_MULTICAST;
23:
24: memset(dev->broadcast, 0xFF, ETH_ALEN);
25:
26: }
27: EXPORT_SYMBOL(ether_setup);

net_device结构的组织

net_device数据结构插入一个全局列表和两张hash表中。这些不同的结构可让内核按需浏览或查询net_device数据库。

  • dev_base
  • dev_name_head
    这是一种hash表,以设备名称为索引。
  • dev_index_head
    这是一张hash表,以设备ID dev->ifindex为索引。

查询

  • 通过名称查询: dev_get_by_name()

     1:  /**
    2: * dev_get_by_name - find a device by its name
    3: * @net: the applicable net namespace
    4: * @name: name to find
    5: *
    6: * Find an interface by name. This can be called from any
    7: * context and does its own locking. The returned handle has
    8: * the usage count incremented and the caller must use dev_put() to
    9: * release it when it is no longer needed. %NULL is returned if no
    10: * matching device is found.
    11: */
    12:
    13: struct net_device *dev_get_by_name(struct net *net, const char *name)
    14: {
    15: struct net_device *dev;
    16:
    17: rcu_read_lock();
    18: dev = dev_get_by_name_rcu(net, name);
    19: if (dev)
    20: dev_hold(dev);
    21: rcu_read_unlock();
    22: return dev;
    23: }
    24: EXPORT_SYMBOL(dev_get_by_name);
  • 通过索引查询:dev_get_by_index()

     1:  /**
    2: * dev_get_by_index - find a device by its ifindex
    3: * @net: the applicable net namespace
    4: * @ifindex: index of device
    5: *
    6: * Search for an interface by index. Returns NULL if the device
    7: * is not found or a pointer to the device. The device returned has
    8: * had a reference added and the pointer is safe until the user calls
    9: * dev_put to indicate they have finished with it.
    10: */
    11: struct net_device *dev_get_by_index(struct net *net, int ifindex)
    12: {
    13: struct net_device *dev;
    14:
    15: rcu_read_lock();
    16: dev = dev_get_by_index_rcu(net, ifindex);
    17: if (dev)
    18: dev_hold(dev);
    19: rcu_read_unlock();
    20: return dev;
    21: }
    22: EXPORT_SYMBOL(dev_get_by_index);
  • 通过MAC查询: dev_getbyhwaddr()

     1:  /**
    2: * dev_getbyhwaddr - find a device by its hardware address
    3: * @net: the applicable net namespace
    4: * @type: media type of device
    5: * @ha: hardware address
    6: *
    7: * Search for an interface by MAC address. Returns NULL if the device
    8: * is not found or a pointer to the device. The caller must hold the
    9: * rtnl semaphore. The returned device has not had its ref count increased
    10: * and the caller must therefore be careful about locking
    11: *
    12: * BUGS:
    13: * If the API was consistent this would be __dev_get_by_hwaddr
    14: */
    15:
    16: struct net_device *dev_getbyhwaddr(struct net *net, unsigned short type, char *ha)
    17: {
    18: struct net_device *dev;
    19:
    20: ASSERT_RTNL();
    21:
    22: for_each_netdev(net, dev)
    23: if (dev->type == type &&
    24: !memcmp(dev->dev_addr, ha, dev->addr_len))
    25: return dev;
    26:
    27: return NULL;
    28: }
    29: EXPORT_SYMBOL(dev_getbyhwaddr);
  • 通过设备标志查询:dev_get_by_flags()

     1:  /**
    2: * dev_get_by_flags - find any device with given flags
    3: * @net: the applicable net namespace
    4: * @if_flags: IFF_* values
    5: * @mask: bitmask of bits in if_flags to check
    6: *
    7: * Search for any interface with the given flags. Returns NULL if a device
    8: * is not found or a pointer to the device. The device returned has
    9: * had a reference added and the pointer is safe until the user calls
    10: * dev_put to indicate they have finished with it.
    11: */
    12: struct net_device *dev_get_by_flags(struct net *net, unsigned short if_flags,
    13: unsigned short mask)
    14: {
    15: struct net_device *dev, *ret;
    16:
    17: ret = NULL;
    18: rcu_read_lock();
    19: for_each_netdev_rcu(net, dev) {
    20: if (((dev->flags ^ if_flags) & mask) == ) {
    21: dev_hold(dev);
    22: ret = dev;
    23: break;
    24: }
    25: }
    26: rcu_read_unlock();
    27: return ret;
    28: }
    29: EXPORT_SYMBOL(dev_get_by_flags);

设备状态

  • flags
    用于存储各种的位域, 多数标识都代表设备的能力。
  • reg_state
    设备的注册状态
  • state
    和其队列规则有关的设备状态。
    • __LINK_STATE_START
      设备开启。此标识可以由netif_running检查。
    • __LINK_STATE_PRESENT
      设备存在。此标识可以由netif_device_present检查。
    • __LINK_STATE_NOCARRIER
      没有载波,此标识可以由netif_carrier_ok检查。
    • __LINK_STATE_LINKWATCH_EVENT
      设备的链路状态已变更
    • __LINK_STATE_XOFF
    • __LINK_STATE_SHED
    • __LINK_STATE_RX_SHED
      这三个标识由负责管理设备的入口和出口流量代码所使用。

注册状态

  • NETREG_UNINITIALIZED
    定义成0, 当net_device数据结构已分配且其内容都清成零时,此值代表的就是dev->reg_state中的0.
  • NETREG_REGISTERING
    net_device结构已经添加到所需的结构,但内核依然要在/sys文件系统中添加一个项目。
  • NETREG_REGISTERED
    设备已完成注册。
  • NETREG_UNREGISTERING
    已过时,已被删除。
  • NETREG_UNREGISTERED
    设备已完全被除名(包括善策/sys中的项目),但是net_device结构还没有被释放掉。
  • NETREG_RELEASED
    所有对net_device结构的引用都已释放。

设备的注册和除名

网络设备通过register_netdev和unregister_netdev在内核注册和除名。

设备注册状态通知

  • netdev_chain
    内核组件可以注册此通知链。
  • netlink的RTMGRP_LINK多播群组
    用户空间应用程序可以netlink的RTMGRP_LINK多播群组。

netdev_chain通知链

可以通过register_netdev_notifier和unregister_netdev_notifier分别对该链注册或除名。
所有由netdev_chain报告的NETDEV_XXX事件都在include/linux/notifier.h中。以下是我们在本章看过的
几种事件以及触发这些事件的条件:

  • NETDEV_UP
  • NETDEV_GOING_DOWN
  • NETDEV_DOWN
    送出NETDEV_UP以报告设备开启, 而且此事件是有dev_open产生。
    当设备要关闭时,就会送出NETDEV_GOING_DOWN。当设备已关闭时,就会送出
    NETDEV_DOWN。这些事件都是由dev_close产生的。
  • NETDEV_REGISTER
    设备已注册,此事件是由register_netdevice产生的。
  • NETDEV_UNREGISTER
    设备已经除名。此事件是由unregister_netdevice产生的。
  • NETDEV_REBOOT
    因为硬件失败,设备已重启。目前没有用
  • NETDEV_CHANGEADDR
    设备的硬件地址(或相关联的广播地址)已改变。
  • NETDEV_CHANGENAME
    设备已改变其名称。
  • NETDEV_CHANGE
    设备的状态或配置改变。此事件会用在NETDEV_CHANGEADDR和NETDEV_CHANGENAME没包括在内的所有情况下。

[注意] 向链注册时,register_netdevice_notifier也会(仅对新注册者)重新当前已注册设备所有过去的NETDEV_REGISTER和
NETDEV_UP通知信息。这样就能给新注册者有关已注册设备的状态的清晰图像。
有不少内核组件都在netdev_chain注册。其中一些如下所述:

  • 路由
    路由子系统使用此通知信息新增或删除和此设备相关联的所有路由项目。
  • 防火墙
    例如, 如果防火墙之间把来自某设备的包暂存在缓冲区内,则必须根据其策略把包丢掉,获取采取另一种动作。
  • 协议代码(也就是ARP,IP等等)
    例如,当你改变一个本地设备的MAC地址时,ARP表也必须据此更新。
  • 虚拟设备

RTnetlink链接通知

当设备的状态或配置变更时,就会用到rtmsg_ifinfo把通知信息传递给link多播群组RTMRGP_LINK。

设备注册

设备注册不是简单的把net_device结构插入到全局列表和hash表就行了,还涉及到一些参数的初始化,产生广播通知信息已通知其他内核组件有关此次注册。
register_netdev()调用register_netdevice()。

register_netdevice函数

  1:  /**
2: * register_netdevice - register a network device
3: * @dev: device to register
4: *
5: * Take a completed network device structure and add it to the kernel
6: * interfaces. A %NETDEV_REGISTER message is sent to the netdev notifier
7: * chain. 0 is returned on success. A negative errno code is returned
8: * on a failure to set up the device, or if the name is a duplicate.
9: *
10: * Callers must hold the rtnl semaphore. You may want
11: * register_netdev() instead of this.
12: *
13: * BUGS:
14: * The locking appears insufficient to guarantee two parallel registers
15: * will not get the same name.
16: */
17:
18: int register_netdevice(struct net_device *dev)
19: {
20: int ret;
21: struct net *net = dev_net(dev);
22:
23: BUG_ON(dev_boot_phase);
24: ASSERT_RTNL();
25:
26: might_sleep();
27:
28: /* When net_device's are persistent, this will be fatal. */
29: BUG_ON(dev->reg_state != NETREG_UNINITIALIZED);
30: BUG_ON(!net);
31:
32: spin_lock_init(&dev->addr_list_lock);
33: netdev_set_addr_lockdep_class(dev);
34: netdev_init_queue_locks(dev);
35:
36: dev->iflink = -;
37:
38: #ifdef CONFIG_RPS
39: if (!dev->num_rx_queues) {
40: /*
41: * Allocate a single RX queue if driver never called
42: * alloc_netdev_mq
43: */
44:
45: dev->_rx = kzalloc(sizeof(struct netdev_rx_queue), GFP_KERNEL);
46: if (!dev->_rx) {
47: ret = -ENOMEM;
48: goto out;
49: }
50:
51: dev->_rx->first = dev->_rx;
52: atomic_set(&dev->_rx->count, );
53: dev->num_rx_queues = ;
54: }
55: #endif
56: /* Init, if this function is available */
57: if (dev->netdev_ops->ndo_init) {
58: ret = dev->netdev_ops->ndo_init(dev);
59: if (ret) {
60: if (ret > )
61: ret = -EIO;
62: goto out;
63: }
64: }
65:
66: ret = dev_get_valid_name(dev, dev->name, );
67: if (ret)
68: goto err_uninit;
69:
70: dev->ifindex = dev_new_index(net);
71: if (dev->iflink == -)
72: dev->iflink = dev->ifindex;
73:
74: /* Fix illegal checksum combinations */
75: if ((dev->features & NETIF_F_HW_CSUM) &&
76: (dev->features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
77: printk(KERN_NOTICE "%s: mixed HW and IP checksum settings.\n",
78: dev->name);
79: dev->features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
80: }
81:
82: if ((dev->features & NETIF_F_NO_CSUM) &&
83: (dev->features & (NETIF_F_HW_CSUM|NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
84: printk(KERN_NOTICE "%s: mixed no checksumming and other settings.\n",
85: dev->name);
86: dev->features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM|NETIF_F_HW_CSUM);
87: }
88:
89: dev->features = netdev_fix_features(dev->features, dev->name);
90:
91: /* Enable software GSO if SG is supported. */
92: if (dev->features & NETIF_F_SG)
93: dev->features |= NETIF_F_GSO;
94:
95: ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
96: ret = notifier_to_errno(ret);
97: if (ret)
98: goto err_uninit;
99:
100: ret = netdev_register_kobject(dev);
101: if (ret)
102: goto err_uninit;
103: dev->reg_state = NETREG_REGISTERED;
104:
105: /*
106: * Default initial state at registry is that the
107: * device is present.
108: */
109:
110: set_bit(__LINK_STATE_PRESENT, &dev->state);
111:
112: dev_init_scheduler(dev);
113: dev_hold(dev);
114: list_netdevice(dev);
115:
116: /* Notify protocols, that a new device appeared. */
117: ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);
118: ret = notifier_to_errno(ret);
119: if (ret) {
120: rollback_registered(dev);
121: dev->reg_state = NETREG_UNREGISTERED;
122: }
123: /*
124: * Prevent userspace races by waiting until the network
125: * device is fully setup before sending notifications.
126: */
127: if (!dev->rtnl_link_ops ||
128: dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
129: rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U);
130:
131: out:
132: return ret;
133:
134: err_uninit:
135: if (dev->netdev_ops->ndo_uninit)
136: dev->netdev_ops->ndo_uninit(dev);
137: goto out;
138: }
139: EXPORT_SYMBOL(register_netdevice);

设备除名

[注意] 每当设备之间存在依赖性而把其中一个设备除名时,就会强调其他所有(或部分)设备除名。例如虚拟设备。

unregister_netdevice函数

 1:  static inline void unregister_netdevice(struct net_device *dev)
2: {
3: unregister_netdevice_queue(dev, NULL);
4: }
5:
6: /**
7: * unregister_netdev - remove device from the kernel
8: * @dev: device
9: *
10: * This function shuts down a device interface and removes it
11: * from the kernel tables.
12: *
13: * This is just a wrapper for unregister_netdevice that takes
14: * the rtnl semaphore. In general you want to use this and not
15: * unregister_netdevice.
16: */
17: void unregister_netdev(struct net_device *dev)
18: {
19: rtnl_lock();
20: unregister_netdevice(dev);
21: rtnl_unlock();
22: }
23: EXPORT_SYMBOL(unregister_netdev);

引用计数

net_device结构无法释放,除非对该结构的所有引用都已释放。该结构的引用计数放在dev->refcnt中,每次以dev_hold或dev_put新增或删除引用时,其值就会更新一次。

netdev_wait_allrefs函数

netdev_wait_allrefs函数由一个循环组成,只有当dev->refcnt减至零时才会结束。此函数每秒都会送出一个NETDEV_UNREGISTER通知信息,而每10秒钟都会在控制台上打印
一套警告。剩余时间都在休眠,此函数不会放弃,直到对输入net_device结构的所有引用都已释放为止。
有两种常见情况需要传递一个以上通知信息:

  • bug
    例如,有段代码持有net_device结构的引用,但是因为没有在netdev_chain通知链注册,或者因为没有正确处理通知信息,使其无法释放引用。
  • 未决的定时器
    例如, 假设当定时器到期时要执行的那个函数必须访问的数据中,包含了对net_device结构的引用。在这种情况下你必须等待直到该定时器到期,而且其处理函数有望会释放其引用。

开启和关闭网络设备

设备一旦注册就可用了,但是需要用户或用户应用程序明确开启,否则还是无法传输和接收数据流。开启设备请求有dev_open负责。

  • 开启时需要做如下任务:
    • 如果有定义的话,调用dev->open,并非所有设备驱动程序都有初始化函数。
    • 设置dev->state总的__LINK_STATE_START标识,把设备标识为开启和运行中。
    • 设置dev->flags中的IFF_UP标识,把设备标识为开启。
    • 调用dev_activate初始化流量控制使用的出口规则队列,然后启用看门狗定时器。如果没有配置流量控制,就指默认的FIFO先进先出队列。
  • 关闭设备时,有如下任务要做:
    • 传送NETDEV_GOING_DOWN通知信息给netdev_chain通知链,以通知感兴趣的内核组件该设备即将关闭。
    • 调用dev_deactive以关闭出口规则队列,使得该设备再也无法用于传输,然后因为不再需要,停止看门狗定时器。
    • 设置dev->state标识为关闭
    • 如果有轮询读包动作未决,需等待。
    • 如果有定义,调用dev->stop()。
    • 设置dev->flags中的IFF_UP标识,把设备标识为关闭。
    • 传送NETDEV_DOWN通知给netdev_chain通知链,以通知感兴趣的内核组件改设备现在已经关闭。

与电源管理之间的交互

当内核支持电源管理时,只要系统进入挂入模式或者重新继续,NIC设备驱动程序就可以接到通知。
当系统进入挂起模式时,就会执行设备驱动程序所提供的suspend函数,让驱动程序据此采取动作。
电源管理状态变更不会影响注册状态dev->reg_state,但是设备状态dev->state必须变更。

链路状态变更侦测

当NIC设备驱动程序侦测载波或信号是否存在,也许是有NIC通知,或者NIC读取配置寄存器以明确检查,
可以分别利用netif_carrier_on和netif_carrier_off通知内核。
导致链路状态变更的常见情况:

  • 电缆先插入NIC,或者从NIC中拔除。
  • 电缆线另一端设备电源关掉或关闭了。

从用户空间配置设备相关信息

  • ifconfig和mii-tool,来自于net-tools套件。
  • ethtool, 来自于ethtool套件。
  • ip link, 来自于IPROUTE2套件。

通过/proc文件系统调整

/proc里没有可用于调整设备注册和除名任务的文件。