日志翻转时的主管异常导致应用服务器冻结？

I am running a flask app with gunicorn on an EC2 server. I use supervisord to monitor and restart the app server. Yesterday, the server was not responding to http requests. We looked at the status using supervisorctl, and it showed up as running. We looked at the supervisor logs and saw the following error:

我在EC2服务器上运行带有gunicorn的烧瓶应用程序。我使用supervisord来监控和重启应用服务器。昨天,服务器没有响应http请求。我们使用supervisorctl查看了状态,它显示为正在运行。我们查看了主管日志,发现以下错误:

CRIT uncaptured python exception, closing channel <POutputDispatcher at 34738328
for <Subprocess at 34314576 with name flask in state RUNNING> (stdout)>
(<type 'exceptions.OSError'>:[Errno 2] No such file or directory

[/usr/local/lib/python2.7/dist-packages/supervisor/supervisord.py|runforever|233] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|handle_read_event|231] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|record_output|165] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|_log|141]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|info|273] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|log|291] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|emit|186]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|doRollover|220])

Restarting supervisord fixed the issue for us. Below are the relevant parts of our supervisor config:

重启supervisord为我们解决了这个问题。以下是我们的主管配置的相关部分:

[supervisord]
childlogdir = /var/log/supervisord/
logfile = /var/log/supervisord/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /var/log/supervisord/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false

[program:flask]
directory=%(here)s
environment=PATH="/home/ubuntu/.virtualenvs/flask/bin"
command=newrelic-admin run-program gunicorn app:app -c gunicorn_conf.py
autostart=true
autorestart=true
redirect_stderr=true

What's strange is that we have 2 servers running behind an ELB and both of them had the same issue 10 mins from each other. I am guessing that the logs for both reached the limit around the same time (which is possible since they both see about the same amount of traffic) and the rollover failed. Any ideas as to why that could've happened?

奇怪的是,我们有两个服务器在ELB后面运行,并且两个服务器相同的问题相互间隔10分钟。我猜两个日志大约在同一时间达到了极限(这是可能的,因为它们都看到相同的流量)并且翻转失败。关于为什么会发生这种情况的任何想法?

1 个解决方案

#1

AFAIK supervisor uses its own logging implementation, not the one in the Python stdlib - although the class and method names are pretty similar.

AFAIK管理程序使用自己的日志记录实现,而不是Python stdlib中的日志记录实现 - 尽管类和方法名称非常相似。

There is a potential race condition when deleting files during rollover - you will need to check the source code of your specific supervisor version and compare that with the latest supervisor version, if different. Here is an excerpt from the supervisor code on my system (in the doRollover() method):

在翻转期间删除文件时存在潜在的竞争条件 - 您需要检查特定主管版本的源代码,并将其与最新的主管版本进行比较(如果不同)。以下是我系统上的管理员代码的摘录(在doRollover()方法中):

try:
    os.remove(dfn)
except OSError, why:
    # catch race condition (already deleted)
    if why[0] != errno.ENOENT:
        raise

If your rollover code doesn't do this, you might need to upgrade your supervisor version.

如果您的翻转代码不这样做,您可能需要升级您的主管版本。

Update: If the error happens on the rename, then it might be a race condition which hasn't yet been caught. Consider asking on the supervisor mailing list.

更新:如果重命名时发生错误,则可能是尚未捕获的竞争条件。考虑询问主管邮件列表。

#1