如何检测无效的fd/句柄

时间:2021-07-18 05:26:03

I have a server application which handles network clients with an async i/o. The client connections are accepted then added to a descriptor set which can be monitored with poll/epoll/select/etc. I'm using the apr_pollset_poll() apache APR library call to check for descriptors which can be read or written to. This uses epoll/poll/select/etc internally depending on the platform.

我有一个服务器应用程序,它使用异步I /o处理网络客户端。接受客户端连接,然后将其添加到描述符集中,该描述符集可以使用poll/epoll/select/etc监视。我正在使用apr_pollset_poll() apache APR库调用来检查可以读取或写入的描述符。这在内部使用epoll/poll/select/etc,这取决于平台。

Problem is that somehow one of the socket descriptors gets corrupt and the apr_pollset_poll returns errno 10038 which is WSAENOTSOCK: An operation was attempted on something that is not a socket. Unfortunately this causes my application to stop working at all instead of just being able to kick that particular client connection. If I could somehow ignore or remove this socket from the descriptor set, then it could continue to function and properly read/write the other sockets. I know I should find the root cause which causes the socket to go corrupt, but I need a failsafe workaround.

问题是,一个套接字描述符在某种程度上被破坏,apr_pollset_poll返回errno 10038,这是WSAENOTSOCK:一个操作被尝试在一个不是套接字的东西上。不幸的是,这导致我的应用程序完全停止工作,而不是仅仅能够启动特定的客户端连接。如果我可以忽略或从描述符集中删除这个套接字,那么它可以继续工作并正确地读写其他套接字。我知道我应该找到导致套接字损坏的根本原因,但我需要一个万无一失的解决方案。

Once the descriptors are added to the pollset, these are then handled by the OS/kernel and I see no way of retrieving them to be able to iterate on. Maintaining these also in my own list would probably create other problems further down, because on socket close I would need to clean them up somehow which occurs automatically for the in-kernel pollset.

一旦将描述符添加到pollset中,就会由操作系统/内核处理这些描述符,而我认为没有办法检索它们以进行迭代。在我自己的列表中维护它们可能还会产生其他问题,因为在套接字关闭时,我需要以某种方式清理它们,这对于内核中的pollset是自动发生的。

Any suggestions?

有什么建议吗?

2 个解决方案

#1


2  

It sounds dire, but it is an emergency situation when it occurs. So, I suggest going through all the descriptors in your working pollset, and trying to do an operation on that descriptor that will trigger that error if the descriptor is bogus. For example, you could create a new, temporary pollset and try a non-blocking zero timeout poll operation and see whether you can get the error.

这听起来很可怕,但当它发生时,却是一个紧急情况。因此,我建议在工作的pollset中检查所有的描述符,并尝试在该描述符上执行一个操作,如果描述符是假的,将触发这个错误。例如,您可以创建一个新的、临时的轮询集,并尝试一个非阻塞的零超时轮询操作,看看是否可以获得错误。

If you've got more than, say, a dozen descriptors in your pollset, you might consider a binary search instead of a one-at-a-time approach. You could put half your descriptors into the temporary pollset, and then do the operation. If it fails, you know you've got a bogus descriptor in the set you tried; divide in two and try again; if it does not fail, you can presume the bogus descriptor is in the other set, and you can either validate that the other half fails or assume it will and split the remainder in two and try again. Keep going until you've isolated the one failing descriptor. Clearly, if you have several bogus descriptors rather than just one, you may have to repeat the process a few times.

如果您的轮询集中有超过12个描述符,那么您可以考虑使用二进制搜索而不是一次搜索。您可以将一半的描述符放入临时的pollset中,然后执行操作。如果它失败了,你知道在你尝试的集合中有一个伪描述符;分成两份再试一次;如果它没有失败,您可以假定伪描述符在另一个集合中,您可以验证另一个失败,也可以假设它会失败,并将其余的分割为两个,然后再尝试。继续,直到您隔离了一个失败的描述符。显然,如果您有几个伪描述符而不是一个,您可能需要重复这个过程几次。

With the one descriptor isolated, you can decide what you need to do about it and how. And if/when the problem recurs, you can repeat the isolation process. Clearly, you wouldn't try this unless you detected the problem in the first place. But when things are going wrong, you need to isolate the problem, and this would (should) achieve that.

通过隔离一个描述符,您可以决定需要对它做什么以及如何做。当问题再次出现时,您可以重复隔离过程。显然,除非你首先发现了问题,否则你不会尝试。但是当事情出错时,你需要把问题隔离开来,而这将(应该)实现这一点。

#2


0  

It turned out that I was doing a close() on a socket descriptor which was being polled in another thread and the pollset implementation based on select() does not like this. On the other hand, it would be possible to modify apr library code to return the descriptor when select detects an invalid socket, or it could even remove it automatically.

原来,我正在对另一个线程轮询的套接字描述符执行close(),基于select()的pollset实现不喜欢这样。另一方面,当select检测到一个无效的套接字时,可以修改apr库代码来返回描述符,或者它甚至可以自动删除它。

#1


2  

It sounds dire, but it is an emergency situation when it occurs. So, I suggest going through all the descriptors in your working pollset, and trying to do an operation on that descriptor that will trigger that error if the descriptor is bogus. For example, you could create a new, temporary pollset and try a non-blocking zero timeout poll operation and see whether you can get the error.

这听起来很可怕,但当它发生时,却是一个紧急情况。因此,我建议在工作的pollset中检查所有的描述符,并尝试在该描述符上执行一个操作,如果描述符是假的,将触发这个错误。例如,您可以创建一个新的、临时的轮询集,并尝试一个非阻塞的零超时轮询操作,看看是否可以获得错误。

If you've got more than, say, a dozen descriptors in your pollset, you might consider a binary search instead of a one-at-a-time approach. You could put half your descriptors into the temporary pollset, and then do the operation. If it fails, you know you've got a bogus descriptor in the set you tried; divide in two and try again; if it does not fail, you can presume the bogus descriptor is in the other set, and you can either validate that the other half fails or assume it will and split the remainder in two and try again. Keep going until you've isolated the one failing descriptor. Clearly, if you have several bogus descriptors rather than just one, you may have to repeat the process a few times.

如果您的轮询集中有超过12个描述符,那么您可以考虑使用二进制搜索而不是一次搜索。您可以将一半的描述符放入临时的pollset中,然后执行操作。如果它失败了,你知道在你尝试的集合中有一个伪描述符;分成两份再试一次;如果它没有失败,您可以假定伪描述符在另一个集合中,您可以验证另一个失败,也可以假设它会失败,并将其余的分割为两个,然后再尝试。继续,直到您隔离了一个失败的描述符。显然,如果您有几个伪描述符而不是一个,您可能需要重复这个过程几次。

With the one descriptor isolated, you can decide what you need to do about it and how. And if/when the problem recurs, you can repeat the isolation process. Clearly, you wouldn't try this unless you detected the problem in the first place. But when things are going wrong, you need to isolate the problem, and this would (should) achieve that.

通过隔离一个描述符,您可以决定需要对它做什么以及如何做。当问题再次出现时,您可以重复隔离过程。显然,除非你首先发现了问题,否则你不会尝试。但是当事情出错时,你需要把问题隔离开来,而这将(应该)实现这一点。

#2


0  

It turned out that I was doing a close() on a socket descriptor which was being polled in another thread and the pollset implementation based on select() does not like this. On the other hand, it would be possible to modify apr library code to return the descriptor when select detects an invalid socket, or it could even remove it automatically.

原来,我正在对另一个线程轮询的套接字描述符执行close(),基于select()的pollset实现不喜欢这样。另一方面,当select检测到一个无效的套接字时,可以修改apr库代码来返回描述符,或者它甚至可以自动删除它。