task和jobmanager不知道为什么挂了
日志如下:
2021-04-04 10:03:15,058 INFO - NettyConfig [server address: /192.168.11.132, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2021-04-04 10:03:15,297 INFO - Temporary file directory '/tmp': total 26 GB, usable 22 GB (84.62% usable)
2021-04-04 10:03:16,123 INFO - Allocated 102 MB for network buffer pool (number of memory segments: 3278, bytes per segment: 32768).
2021-04-04 10:03:16,197 INFO - Starting the network environment and its components.
2021-04-04 10:03:16,252 INFO - Successful initialization (took 52 ms).
2021-04-04 10:03:16,309 INFO - Successful initialization (took 56 ms). Listening on SocketAddress /192.168.11.132:37718.
2021-04-04 10:03:16,310 INFO - Limiting managed memory to 0.7 of the currently free heap space (641 MB), memory will be allocated lazily.
2021-04-04 10:03:16,314 INFO - I/O manager uses directory /tmp/flink-io-5cb46d08-d7bd-41bb-91d0-e67a2ca8ab47 for spill files.
2021-04-04 10:03:16,409 INFO - Messages have a max timeout of 10000 ms
2021-04-04 10:03:16,421 INFO - Starting RPC endpoint for at akka://flink/user/taskmanager_0 .
2021-04-04 10:03:16,438 INFO - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2021-04-04 10:03:16,439 INFO - Start job leader service.
2021-04-04 10:03:16,441 INFO - User file cache uses directory /tmp/flink-dist-cache-9bd42cb9-9f68-419a-9381-95693ff61ac5
2021-04-04 10:03:16,452 INFO - Connecting to ResourceManager ://flink@localhost:46715/user/resourcemanager(97844b5c0749ea747b4749fffa964081).
2021-04-04 10:03:16,570 WARN - Remote connection to [null] failed with : 拒绝连接: localhost/127.0.0.1:46715
2021-04-04 10:03:16,577 WARN - Association with remote system [://flink@localhost:46715] has failed, address is now gated for [50] ms. Reason: [Association failed with [://flink@localhost:46715]] Caused by: [拒绝连接: localhost/127.0.0.1:46715]
2021-04-04 10:03:16,583 INFO - Could not resolve ResourceManager address ://flink@localhost:46715/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address ://flink@localhost:46715/user/resourcemanager..
2021-04-04 10:03:26,617 WARN - Remote connection to [null] failed with : 拒绝连接: localhost/127.0.0.1:46715
2021-04-04 10:03:26,623 WARN
......
2021-04-04 10:08:07,454 WARN - Remote connection to [null] failed with : 拒绝连接: localhost/127.0.0.1:46715
2021-04-04 10:08:07,455 WARN - Association with remote system [://flink@localhost:46715] has failed, address is now gated for [50] ms. Reason: [Association failed with [://flink@localhost:46715]] Caused by: [拒绝连接: localhost/127.0.0.1:46715]
2021-04-04 10:08:07,456 INFO - Could not resolve ResourceManager address ://flink@localhost:46715/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address ://flink@localhost:46715/user/resourcemanager..
2021-04-04 10:08:16,468 ERROR - Fatal error occurred in TaskExecutor ://flink@192.168.11.132:45382/user/taskmanager_0.
: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at (:1034)
at $startRegistrationTimeout$3(:1020)
at (:392)
at (:185)
at (:147)
at $$anonfun$receive$(:165)
at (:502)
at $(:500)
at (:95)
at (:526)
at (:495)
at (:257)
at (:224)
at (:234)
at (:289)
at $(:1056)
at (:1692)
at (:157)
2021-04-04 10:08:16,472 ERROR - Fatal error occurred while executing the TaskManager. Shutting it down...
: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at (:1034)
at $startRegistrationTimeout$3(:1020)
at (:392)
at (:185)
at (:147)
at $$anonfun$receive$(:165)
at (:502)
at $(:500)
at (:95)
at (:526)
at (:495)
at (:257)
at (:224)
at (:234)
at (:289)
at $(:1056)
at (:1692)
at (:157)
2021-04-04 10:08:16,478 INFO - Stopping TaskExecutor ://flink@192.168.11.132:45382/user/taskmanager_0.
2021-04-04 10:08:16,478 INFO - Stop job leader service.
2021-04-04 10:08:16,507 INFO - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2021-04-04 10:08:16,507 INFO - Shutting down TaskExecutorLocalStateStoresManager.
2021-04-04 10:08:16,514 INFO - I/O manager removed spill file directory /tmp/flink-io-5cb46d08-d7bd-41bb-91d0-e67a2ca8ab47
2021-04-04 10:08:16,514 INFO - Shutting down the network environment and its components.
2021-04-04 10:08:16,515 INFO - Successful shutdown (took 0 ms).
2021-04-04 10:08:16,518 INFO - Successful shutdown (took 1 ms).
2021-04-04 10:08:16,532 INFO - Stop job leader service.
2021-04-04 10:08:16,532 INFO - removed file cache directory /tmp/flink-dist-cache-9bd42cb9-9f68-419a-9381-95693ff61ac5
2021-04-04 10:08:16,539 INFO - Stopped TaskExecutor ://flink@192.168.11.132:45382/user/taskmanager_0.
2021-04-04 10:08:16,540 INFO - Shutting down BLOB cache
2021-04-04 10:08:16,540 INFO - Shutting down BLOB cache
2021-04-04 10:08:16,553 INFO - backgroundOperationsLoop exiting
2021-04-04 10:08:16,565 INFO - Session: 0x10000007e9d0008 closed
2021-04-04 10:08:16,565 INFO - Stopping Akka RPC service.
2021-04-04 10:08:16,583 INFO $RemotingTerminator - Shutting down remote daemon.
2021-04-04 10:08:16,594 INFO - EventThread shut down for session: 0x10000007e9d0008
2021-04-04 10:08:16,597 INFO $RemotingTerminator - Shutting down remote daemon.
2021-04-04 10:08:16,601 INFO $RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2021-04-04 10:08:16,611 INFO $RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2021-04-04 10:08:16,640 INFO $RemotingTerminator - Remoting shut down.
2021-04-04 10:08:16,641 INFO $RemotingTerminator - Remoting shut down.
2021-04-04 10:08:16,661 INFO - Stopped Akka RPC service.
原因:配置zookeeper错误,改正后
: node1:2181,node2:2181,node3:2181
另外lib里面jar的权限改为了755,后面就正确了。
另外,虚拟机直接reboot发现,或3台机器一起启动taskmanager,也可能造成上面的错误,估计是多个taskmanager启动太过于同步导致的