原创文章，转载请注明出处：服务器非业余研究http://blog.csdn.net/erlib 作者Sunface

监督者及start_link的语义

In complex production systems, most faults and errors are transient, and retrying an operation is a good way to do things — Jim Gray’s paper 7 quotes Mean Times Between Failures (MTBF) of systems handling transient bugs being better by a factor of 4 when doing this.
Still, supervisors aren’t just about restarting.

在一个复杂产品系统中，绝大部分错误(faults,errors)都是转瞬即逝的，重新操作无疑是一个好办法——Jim Gray's的论文指出：系统的平均故障处理时间(Mean Times Between Failures )大约是重新操作的4倍。superviors的失败处理机制就是重操作，但是supervisors不仅仅是简单的重启进程。

One very important part of Erlang supervisors and their supervision trees is that their start phases are synchronous. Each OTP process has the potential to prevent its siblings and cousins from booting.If the process dies, it’s retried again, and again, until it works, or fails too often.

Erlang supervisors和它们的监督树结构有一个非常重要的特性：启动阶段(init)是同步的，因此每个OTP进程都有可能阻止他的兄弟姐妹启动。如果一个进程死掉，它会不断地重启，直到启动状态正常或失败一定次数后停止启动。

That’s where people make a very common mistake. There isn’t a backoff or cooldown period before a supervisor restarts a crashed child. When a network-based application tries to set up a connection during its initialization phase and the remote service is down, the application fails to boot after too many fruitless restarts. Then the system may shut down.

这就是人们常犯错误的地方：supervisor对子进程的重启是没有冷却时间的。当一个基于网络的application尝试在初始化时建立一个连接，但远程服务已挂掉了，那么此applicaiton就会启动失败，并不断地被重启，然后导致整个系统崩溃掉。

Many Erlang developers end up arguing in favor of a supervisor that has a cooldown period. I strongly oppose the sentiment for one simple reason: it’s all about the guarantees.

许多的Erlang开发者都在争论是不是要给supervisor加一个冷却时间。但我强烈反对这种做法，因为一个非常简单的理由： it’s all about the guarantees( 稳定正确初始化的担保)

[7] http://mononcqc.tumblr.com/post/35165909365/why-do-computers-stop

[注7]：http://mononcqc.tumblr.com/post/35165909365/why-do-computers-stop

It’s About the Guarantees

关于担保

Restarting a process is about bringing it back to a stable, known state. From there, things can be retried. When the initialization isn’t stable, supervision is worth very little.

An initialized process should be stable no matter what happens. That way, when its siblings and cousins get started later on, they can be booted fully knowing that the rest of the system that came up before them is healthy.

重启一个进程的目标是为了让它回归已知的稳定状态。但是如果进程的初始化都不稳定报错，supervision重启策略就没用了。一个进程的初始化应当在任何情况下都非常稳定.这样的话，当它的兄弟姐妹进程启动时，之前启动的系统都是处于健康状态。

If you don’t provide that stable state, or if you were to start the entire system asynchronously, you would get very little benefit from this structure that a try ... catch in a loop wouldn’t provide.

如果你不能确保进程的启动处于稳定正确的状态，或你是异步启动整个系统，那就从supervisor树结构中得不到任何好处，不如使用try...catch进行启动。

Supervised processes provide guarantees in their initialization phase, not a best effort.
This means that when you’re writing a client for a database or service, you shouldn’t need a connection to be established as part of the initialization phase unless you’re ready to say it will always be available no matter what happens.

要保证进程在启动过程中的正确性，这意味着，当你为一个数据库或服务写客户端时，你不能在初始化中建立连接，除非你已准备处理所有会发生的情况了。

You could force a connection during initialization if you know the database is on the same host and should be booted before your Erlang system, for example. Then a restart should work.
In case of something incomprehensible and unexpected that breaks these guarantees, the node will end up crashing, which is desirable: a pre-condition to starting your system hasn’t been met.
It’s a system-wide assertion that failed.

如果你确定数据库在同一个主机，并在Erlang系统启动之前就已启动了，那么你可以在初始化中强制建立连接，这样初始化就是有担保的。
当某些意料之外、措手不及的情况破坏了这种担保(guarantee)时，这个节点就会崩溃，但正是我们期望的，因为并没有满足启动你的系统的前提条件，这种情况下崩溃有助于提前发现错误，这是一个系统级的错误。

If, on the other hand, your database is on a remote host, you should expect the connection to fail. It’s just a reality of distributed systems that things go down ⁸.
In this case, the only guarantee you can make in the client process is that your client will be able to handle requests, but not that it will communicate to the database. It could return {error, not_connected} on all calls during a net split, for example.

但另一方面，如果你的数据库在一个远程主机上，你应该要处理好连接失败的情况，因为在真实的分布式系统中经常会发生这种情况⁸。在这种情况下，唯一能做的保证就是：在客户端进程中，你的客户端有能力处理请求，而不是直接与数据库通信。比如：它会网络不通时在返回{error,not_connected}，而不是直接去连数据库获取结果。

The reconnection to the database can then be done using whatever cooldown or backoff strategy you believe is optimal, without impacting the stability of the system. It can be attempted in the initialization phase as an optimization, but the process should be able to reconnect later on if anything ever disconnects.
If you expect failure to happen on an external service, do not make its presence a guarantee of your system. We’re dealing with the real world here, and failure of external dependencies is always an option.

可以使用你觉得最合适且不会影响系统稳定性的策略进行数据库重连。可以在初始化阶段重新连接数据库，但是一旦连接不上进程要在稍后进行数据库重连。
如果你觉得外部服务会发生错误，那就不要让外部服务出现在你的系统稳定担保(guarantee)中。我们现在处理的是真实的世界，失败的外部依赖(failure of external dependencies )是常见的情景。

[8] Or latency shoots up enough that it is impossible to tell the difference from failure.

[注8]：或网络延迟得很利害，以至于和失败没什么两样。

Side Effects

Of course, the libraries and processes that call such a client will then error out if they don’t expect to work without a database.

当然，如果上文那种客户端进程工作时认为有数据库的，但是实际上数据库并没有启动，那么该进程连会出错。

That’s an entirely different issue in a different problem space, one that depends on your business rules and what you can or can’t do to a client, but one that is possible to work around.

这是两个在不同角度下，完全不同的问题，一个是取决于你的业务需求：客户端什么能做，什么不能做;另一个则是正常工作的可能性。

For example, consider a client for a service that stores operational metrics — the code that calls that client could very well ignore the errors without adverse effects to the system as a whole.
The difference in both initialization and supervision approaches is that the client’s callers make the decision about how much failure they can tolerate, not the client itself.

比如，考虑一种客户端，功能是为某种服务存储操作数据---调用这个客户端的代码就很可能会忽略那些比较小的错误(这种错误不会对整个系统造成不利的影响)。
初始化和监控方法的不同之处：客户端的使用者来决定他们能容忍什么程序的错误，而不是客户端自己本身。

That’s a very important distinction when it comes to designing fault-tolerant systems.
Yes, supervisors are about restarts, but they should be about restarts to a stable known state.

在设计容错系统中，这有非常大的区别。总之,supervisor是可以重启进程，但进程应当被重启到一个稳定已知的状态。

Example: Initializing without guaranteeing connections

示例：没有保证连接的初始化

The following code attempts to guarantee a connection as part of the process’ state:

下面代码认为进程在初始化过程中必然连接成功,实际有可能失败

----------------------------------------------------------------------------------
1 init(Args) ->
2 Opts = parse_args(Args),
3 {ok, Port} = connect(Opts),
4 {ok, #state{sock=Port, opts=Opts}}.
5
6 [...]
7
8 handle_info(reconnect, S = #state{sock=undefined, opts=Opts}) ->
9 %% try reconnecting in a loop
10 case connect(Opts) of
11 {ok, New} -> {noreply, S#state{sock=New}};
12 _ -> self() ! reconnect, {noreply, S}
13 end;
----------------------------------------------------------------------------------
Instead, consider rewriting it as:

重写如下：

----------------------------------------------------------------------------------
1 init(Args) ->
2 Opts = parse_args(Args),
3 %% you could try connecting here anyway, for a best
4%% effort thing, but be ready to not have a connection.
5 self() ! reconnect,
6{ok, #state{sock=undefined, opts=Opts}}.
7
8 [...]
9
10handle_info(reconnect, S = #state{sock=undefined, opts=Opts}) ->
11%% try reconnecting in a loop
12case connect(Opts) of
13{ok, New} -> {noreply, S#state{sock=New}};
14_ -> self() ! reconnect, {noreply, S}
15end;
----------------------------------------------------------------------------------
You now allow initializations with fewer guarantees: they went from the connection is available to the connection manager is available.

重写后，能保证初始化必然成功，因为在初始化时并没有进行数据库连接，连接都是在初始化成功之后的handle_info中。

In a nutshell

总结

Production systems I have worked with have been a mix of both approaches.
Things like configuration files, access to the file system (say for logging purposes), local resources that can be depended on (opening UDP ports for logs), restoring a stable state from disk or network, and so on, are things I’ll put into requirements of a supervisor and may decide to synchronously load no matter how long it takes (some applications may just end up having over 10 minute boot times in rare cases, but that’s okay because we’re possibly syncing gigabytes that we need to work with as a base state if we don’t want to serve incorrect information.)

我所写过的产品级的系统都是这两种方法(初始化时全部处理完毕和初始化必定成功)的混合。
例如，配置文件，访问文件系统(写log),本地资源(为logs打开UDP端口)，从磁盘或网络恢复正常状态等等这些事件，我都会把他们放到supervisor下，然后再决定是否同步加载(synchronously load)，一些applications在极少的情况下可能会有超过10分钟的启动时间，但这是ok的，因为我们是同步进行，一个操作没完成，就不会进入下一个操作中，在这个过程中我们只提供正确的信息。

On the other hand, code that depends on non-local databases and external services will adopt partial startups with quicker supervision tree booting because if the failure is expected to happen often during regular operations, then there’s no difference between now and later.

另一方面，依赖非本地数据库和外部服务的代码，会部分采用更快的监控树启动方式(不在初始化中做可能会失败的重连操作)，因为如果在正常操作中失败也经常发生，现在启动还是晚点再启动就没有区别。

You have to handle it the same, and for these parts of the system, far less strict guarantees are often the better solution.

对于不得不同步处理的系统，不做严格的限制(far less strict guarantees)往往是更好的解决方案。

Application Strategies

No matter what, a sequence of failures is not a death sentence for the node. Once a system has been divided into various OTP applications, it becomes possible to choose which applications are vital or not to the node.
Each OTP application can be started in 3 ways: temporary, transient, permanent, either by doing it manually in application:start(Name, Type) , or in the config file for your release:

不管怎样，一连串的失败对节点来说并不可怕。一旦系统被分成多个OTP applications时，就有可以在节点上按重要性排序applications.不论是手动用application:start(Name,Type)还是根据release里面的config文件启动，每一个OTP application 都可以有3种启动方式：temporary,transient,permanent,

• permanent: if the app terminates, the entire system is taken down, excluding manual termination of the app with application:stop/1.
• transient: if the app terminates for reason normal, that’s ok. Any other reason for termination shuts down the entire system.
• temporary: the application is allowed to stop for any reason. It will be reported, but nothing bad will happen.

- permanent：当app结束时，整个系统都会停止，例外情况是：手动调用application:stop/1
- transient： app的结束原因是nomal的时候，对系统没额外影响，其他任何情况，app的结束都会导致整个系统的停止
- temporary：app允许随意停止，它只会报告，但不会发生什么错误的事件。

It is also possible to start an application as an included application, which starts it under your own OTP supervisor with its own strategy to restart it.

也可以在一个application A中再启动另一个application B.让B被A中的 supervisor根据相应的策略来操作重启。

秒客网

[Erlang危机](2.2)监督者及start_link的语义