主/从模式中的pgpo -ii:如何最容易地触发故障转移?

时间:2022-12-23 21:07:53

So I am testing some toy postgresql infrastructure with some local virtual machines to determine how pgpool behaves on failover. I've configured a rudimentary setup where I have two database machines (192.168.0.2 and 192.168.0.3) and a pgpool machine (192.168.0.4). 192.168.0.3 has been setup as a slave to 192.168.0.2 using streaming replication. pgpool-ii has been configured using the following:

因此,我正在使用一些本地虚拟机测试一些玩具postgresql基础设施,以确定pgpool在故障转移时的行为。我配置了一个基本设置,其中有两台数据库机器(192.168.0.2和192.168.0.3)和一台pgpool机器(192.168.0.4)。192.168.0.3被设置为192.168.0.2的奴隶,使用流式复制。pgpo -ii的配置如下:

listen_addresses = '*'
backend_hostname0 = '192.168.0.2'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/postgresql/9.4/main/'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = '192.168.0.3'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/postgresql/9.4/main/'
backend_flag1 = 'ALLOW_TO_FAILOVER'
enable_pool_hba = on
replication_mode = false
master_slave_mode = on
master_slave_sub_mode = 'stream'
fail_over_on_backend_error = true
failover_command = '/root/pgpool_failover_stream.sh %d %H /tmp/postgresql.trigger.5432'
load_balance_mode = false

I've confirmed this all works. That is, the replication is working when I change the master database, and I can connect to the master, slave, and pgpool-ii with a sample application and get results I expect.

我已经确认这一切都有效。也就是说,当我更改主数据库时,复制正在工作,并且我可以通过一个示例应用程序连接到主、从和pgpool-ii,并得到我期望的结果。

Now, I've started a long running application connecting to pgpool, then attempted to cause failover by SSHing into the master database server and forcibly ending the postgres task (service postgresql stop as root). My application keeps executing queries correctly, but no failover occurs (the script has not been run by). I've even tested connecting directly to the master database, and when I stop the postgres service, I do end up crashing the application.

现在,我已经启动了一个连接到pgpool的长期运行的应用程序,然后尝试通过SSHing进入主数据库服务器并强制结束postgres任务(服务postgresql stop作为根)来进行故障转移。我的应用程序一直正确地执行查询,但没有发生故障转移(脚本没有运行)。我甚至还测试了直接连接到主数据库,当我停止postgres服务时,我最终会导致应用程序崩溃。

Am I doing something wrong? Have I not configured my pgpool correctly? Or is there a better way to trigger failover?

我做错什么了吗?我没有正确配置我的pgpool吗?或者是否有更好的方式来触发故障转移?

EDIT: As requested, here is the portion of the log where the first error occurs:

编辑:按要求,这是日志中第一个错误发生的部分:

...
2016-03-15 18:47:15: pid 1232: DEBUG:  initializing backend status
2016-03-15 18:47:15: pid 1231: DEBUG:  initializing backend status
2016-03-15 18:47:15: pid 1230: DEBUG:  initializing backend status
2016-03-15 18:47:15: pid 1209: ERROR:  failed to authenticate
2016-03-15 18:47:15: pid 1209: DETAIL:  invalid authentication message response type, Expecting 'R' and received 'E'
2016-03-15 18:47:15: pid 1209: LOG:  find_primary_node: checking backend no 1
2016-03-15 18:47:15: pid 1209: ERROR:  failed to authenticate
2016-03-15 18:47:15: pid 1209: DETAIL:  invalid authentication message response type, Expecting 'R' and received 'E'
2016-03-15 18:47:15: pid 1209: DEBUG:  find_primary_node: no primary node found
...

Strangely, I can still connect to the pgpool and perform queries, so clearly I don't understand something there.

奇怪的是,我仍然可以连接到pgpool并执行查询,所以很明显我在那里不懂什么。

Edit 2: These are the errors I get after service postgresql shutdown on the master. I show everything up to start of shutdown of pgpool.

编辑2:这些是我在主服务器上关闭服务后所得到的错误。我显示了pgpool关闭的所有内容。

...
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: clearing doing extended query messaging. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: setting doing extended query messaging. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: setting query in progress. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  reading backend data packet kind
2016-03-16 17:24:57: pid 1012: DETAIL:  backend:0 of 2 kind = 'E'
2016-03-16 17:24:57: pid 1012: DEBUG:  processing backend response
2016-03-16 17:24:57: pid 1012: DETAIL:  received kind 'E'(45) from backend
2016-03-16 17:24:57: pid 1012: ERROR:  unable to forward message to frontend
2016-03-16 17:24:57: pid 1012: DETAIL:  FATAL error occured on backend
2016-03-16 17:24:57: pid 1012: DEBUG:  session context: setting query in progress. DONE
2016-03-16 17:24:57: pid 1012: DEBUG:  decide where to send the queries
2016-03-16 17:24:57: pid 1012: DETAIL:  destination = 3 for query= "DISCARD ALL"
2016-03-16 17:24:57: pid 1012: DEBUG:  waiting for query response
2016-03-16 17:24:57: pid 1012: DETAIL:  waiting for backend:0 to complete the query
2016-03-16 17:24:57: pid 1012: FATAL:  unable to read data from DB node 0
2016-03-16 17:24:57: pid 1012: DETAIL:  EOF encountered with backend
2016-03-16 17:24:57: pid 998: DEBUG:  reaper handler
2016-03-16 17:24:57: pid 998: LOG:  child process with pid: 1012 exits with status 256
2016-03-16 17:24:57: pid 998: LOG:  fork a new child process with pid: 1033
2016-03-16 17:24:57: pid 998: DEBUG:  reaper handler: exiting normally
2016-03-16 17:24:57: pid 1033: DEBUG:  initializing backend status
2016-03-16 17:25:02: pid 1031: DEBUG:  PCP child receives shutdown request signal 2
2016-03-16 17:25:02: pid 1029: LOG:  child process received shutdown request signal 2
...

Note that my sample application did in fact die when the master was shutdown.

注意,我的示例应用程序实际上是在主程序关闭时死亡的。

EDIT 3: Errors I am getting in new log, after properly setting sr_check_period, sr_check_user, sr_check_password, all previous errors are now gone:

编辑3:在正确设置sr_check_period、sr_check_user、sr_check_password之后,我正在获取新日志中的错误,所有以前的错误现在都消失了:

2016-03-31 17:45:00: pid 18363: DEBUG:  detect error: kind: 1
2016-03-31 17:45:00: pid 18363: DEBUG:  reading backend data packet kind
2016-03-31 17:45:00: pid 18363: DETAIL:  backend:0 of 2 kind = '1'
...
2016-03-31 17:45:00: pid 18363: DEBUG:  detect error: kind: S

1 个解决方案

#1


0  

There can be multiple reasons for the failover script not getting executed. The primary step would be to enable the log_destination property to syslog and enable the debug mode (debug_level =1) .

故障转移脚本无法执行可能有多种原因。主要步骤是将log_destination属性启用syslog并启用调试模式(debug_level =1)。

I have witnessed scenarios where the failover script is unable to get the parameters for %d, %H (the special characters) due to which the script would be unable to ssh to the slave and touch the trigger file.

我曾目睹过这样的场景:故障转移脚本无法获得%d、%H(特殊字符)的参数,因此脚本无法ssh到从服务器并触摸触发器文件。

I can give more details if you post the log file for the same.

如果您发布日志文件,我可以提供更多细节。

Based on the new logs : I can see an ERROR: failed to authenticate . Can you check the following parameters of pgpool if they have been configured correctly

基于新日志:我可以看到一个错误:无法进行身份验证。如果配置正确,可以检查pgpool的以下参数吗?

health_check_user
health_check_password
recovery_user
recovery_password
wd_lifecheck_user
wd_lifecheck_password
sr_check_user
sr_check_password

health_check_user health_check_password recovery_user recovery_password wd_lifecheck_user wd_lifecheck_user wd_password sr_password

I hope you have followed the step of altering the postgres user password

我希望您已经遵循了更改postgres用户密码的步骤

alter user postgres password 'yourpassword'

and ensure you give the same password in all cases.

并确保在所有情况下都给出相同的密码。

From the logs , it looks like an authentication issue. Can you tell me the version of pgpool you are using.

从日志中可以看出,这是一个身份验证问题。你能告诉我你使用的pgpool的版本吗?

These are the configurations that we are using for a setup with 3 machines (1 master, 1 slave and 1 machine for pgpool) I have modified to suit your ip address

这些是我们在使用3台机器(1台主机、1台从机和1台pgpool机器)进行设置时使用的配置

 listen_addresses = '*'
  port = 5433
  socket_dir = '/var/run/postgresql'
  pcp_port = 9898
  pcp_socket_dir = '/var/run/postgresql'

  backend_hostname0 = '192.168.0.2'
  backend_port0 = 5432
  backend_weight0 = 1
  backend_data_directory0 = '/var/lib/postgresql/9.4/main'
  backend_flag0 = 'ALLOW_TO_FAILOVER'

  backend_hostname1 = '192.168.0.3'
  backend_port1 = 5432
  backend_weight1 = 1
  backend_data_directory1 = '/var/lib/postgresql/9.4/main'
  backend_flag1 = 'ALLOW_TO_FAILOVER'

  enable_pool_hba = on
  pool_passwd = ''
  authentication_timeout = 60
  ssl = off
  num_init_children = 4
  max_pool = 2
  child_life_time = 300 
  child_max_connections = 0
  connection_life_time = 0
  client_idle_limit = 0
  log_destination = 'stderr,syslog'
  print_timestamp = on
  log_connections = on
  log_hostname = on
  log_statement = on
  log_per_node_statement = on
  log_standby_delay = 'none'
  syslog_facility = 'LOCAL0'
  syslog_ident = 'pgpool'
  debug_level = 1
  pid_file_name = '/var/run/postgresql/pgpool.pid'
  logdir = '/var/log/postgresql'
  connection_cache = on
  reset_query_list = 'ABORT; DISCARD ALL'

  replication_mode = off
  replicate_select = off
  insert_lock = on
  lobj_lock_table = ''
  replication_stop_on_mismatch = off
  failover_if_affected_tuples_mismatch = off

  load_balance_mode = off
  ignore_leading_white_space = on
  white_function_list = ''
  black_function_list = 'nextval,setval'

  master_slave_mode = on
  master_slave_sub_mode = 'stream'
  sr_check_period = 10
  sr_check_user = 'postgres'
  sr_check_password = 'postgres123'
  delay_threshold = 0
  follow_master_command = ''
  parallel_mode = off
  pgpool2_hostname = 'pgmaster'

  system_db_hostname  = 'localhost'
  system_db_port = 5432
  system_db_dbname = 'pgpool'
  system_db_schema = 'pgpool_catalog'
  system_db_user = 'pgpool'
  system_db_password = ''

  health_check_period = 5
  health_check_timeout = 20
  health_check_user = 'postgres'
  health_check_password = 'postgres123'
  health_check_max_retries = 2
  health_check_retry_delay = 1

  failover_command = '/usr/sbin/failover_modified.sh %d "%H" %P /var/lib/postgresql/9.4/main/pgsql.trigger.5432'
  failback_command = ''
  fail_over_on_backend_error = on
  search_primary_node_timeout = 10

  recovery_user = 'postgres'
  recovery_password = 'postgres123'
  recovery_1st_stage_command = ''
  recovery_2nd_stage_command = ''
  recovery_timeout = 90
  client_idle_limit_in_recovery = 0

  use_watchdog = off
  trusted_servers = ''
  ping_path = '/bin'
  wd_hostname = ''
  wd_port = 9000
  wd_authkey = ''
  delegate_IP = ''
  ifconfig_path = '/sbin'
  if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0'
  if_down_cmd = 'ifconfig eth0:0 down'
  arping_path = '/usr/sbin'  
  arping_cmd = 'arping -U $_IP_$ -w 1'

  clear_memqcache_on_escalation = on
  wd_escalation_command = ''

  wd_lifecheck_method = 'heartbeat'
  wd_interval = 10
  wd_heartbeat_port = 9694
  wd_heartbeat_keepalive = 2
  wd_heartbeat_deadtime = 30
  heartbeat_destination0 = '192.168.0.2'
  heartbeat_destination_port0 = 9694
  heartbeat_device0 = ''

  heartbeat_destination1 = '192.168.0.3'
  wd_life_point = 3
  wd_lifecheck_query = 'SELECT 1'
  wd_lifecheck_dbname = 'postgres'
  wd_lifecheck_user = 'postgres'
  wd_lifecheck_password = 'postgres123'

  relcache_expire = 0
  relcache_size = 256
  check_temp_table = on

  memory_cache_enabled = off
  memqcache_method = 'shmem'
  memqcache_memcached_host = 'localhost'
  memqcache_memcached_port = 11211
  memqcache_total_size = 67108864
  memqcache_max_num_cache = 1000000
  memqcache_expire = 0
  memqcache_auto_cache_invalidation = on
  memqcache_maxcache = 409600
  memqcache_cache_block_size = 1048576
  memqcache_oiddir = '/var/log/pgpool/oiddir'
  white_memqcache_table_list = ''
  black_memqcache_table_list = ''

Also, i hope you have modified the pool_hba.conf to enable access to master and slave

另外,我希望您已经修改了pool_hba。允许访问主和从

#1


0  

There can be multiple reasons for the failover script not getting executed. The primary step would be to enable the log_destination property to syslog and enable the debug mode (debug_level =1) .

故障转移脚本无法执行可能有多种原因。主要步骤是将log_destination属性启用syslog并启用调试模式(debug_level =1)。

I have witnessed scenarios where the failover script is unable to get the parameters for %d, %H (the special characters) due to which the script would be unable to ssh to the slave and touch the trigger file.

我曾目睹过这样的场景:故障转移脚本无法获得%d、%H(特殊字符)的参数,因此脚本无法ssh到从服务器并触摸触发器文件。

I can give more details if you post the log file for the same.

如果您发布日志文件,我可以提供更多细节。

Based on the new logs : I can see an ERROR: failed to authenticate . Can you check the following parameters of pgpool if they have been configured correctly

基于新日志:我可以看到一个错误:无法进行身份验证。如果配置正确,可以检查pgpool的以下参数吗?

health_check_user
health_check_password
recovery_user
recovery_password
wd_lifecheck_user
wd_lifecheck_password
sr_check_user
sr_check_password

health_check_user health_check_password recovery_user recovery_password wd_lifecheck_user wd_lifecheck_user wd_password sr_password

I hope you have followed the step of altering the postgres user password

我希望您已经遵循了更改postgres用户密码的步骤

alter user postgres password 'yourpassword'

and ensure you give the same password in all cases.

并确保在所有情况下都给出相同的密码。

From the logs , it looks like an authentication issue. Can you tell me the version of pgpool you are using.

从日志中可以看出,这是一个身份验证问题。你能告诉我你使用的pgpool的版本吗?

These are the configurations that we are using for a setup with 3 machines (1 master, 1 slave and 1 machine for pgpool) I have modified to suit your ip address

这些是我们在使用3台机器(1台主机、1台从机和1台pgpool机器)进行设置时使用的配置

 listen_addresses = '*'
  port = 5433
  socket_dir = '/var/run/postgresql'
  pcp_port = 9898
  pcp_socket_dir = '/var/run/postgresql'

  backend_hostname0 = '192.168.0.2'
  backend_port0 = 5432
  backend_weight0 = 1
  backend_data_directory0 = '/var/lib/postgresql/9.4/main'
  backend_flag0 = 'ALLOW_TO_FAILOVER'

  backend_hostname1 = '192.168.0.3'
  backend_port1 = 5432
  backend_weight1 = 1
  backend_data_directory1 = '/var/lib/postgresql/9.4/main'
  backend_flag1 = 'ALLOW_TO_FAILOVER'

  enable_pool_hba = on
  pool_passwd = ''
  authentication_timeout = 60
  ssl = off
  num_init_children = 4
  max_pool = 2
  child_life_time = 300 
  child_max_connections = 0
  connection_life_time = 0
  client_idle_limit = 0
  log_destination = 'stderr,syslog'
  print_timestamp = on
  log_connections = on
  log_hostname = on
  log_statement = on
  log_per_node_statement = on
  log_standby_delay = 'none'
  syslog_facility = 'LOCAL0'
  syslog_ident = 'pgpool'
  debug_level = 1
  pid_file_name = '/var/run/postgresql/pgpool.pid'
  logdir = '/var/log/postgresql'
  connection_cache = on
  reset_query_list = 'ABORT; DISCARD ALL'

  replication_mode = off
  replicate_select = off
  insert_lock = on
  lobj_lock_table = ''
  replication_stop_on_mismatch = off
  failover_if_affected_tuples_mismatch = off

  load_balance_mode = off
  ignore_leading_white_space = on
  white_function_list = ''
  black_function_list = 'nextval,setval'

  master_slave_mode = on
  master_slave_sub_mode = 'stream'
  sr_check_period = 10
  sr_check_user = 'postgres'
  sr_check_password = 'postgres123'
  delay_threshold = 0
  follow_master_command = ''
  parallel_mode = off
  pgpool2_hostname = 'pgmaster'

  system_db_hostname  = 'localhost'
  system_db_port = 5432
  system_db_dbname = 'pgpool'
  system_db_schema = 'pgpool_catalog'
  system_db_user = 'pgpool'
  system_db_password = ''

  health_check_period = 5
  health_check_timeout = 20
  health_check_user = 'postgres'
  health_check_password = 'postgres123'
  health_check_max_retries = 2
  health_check_retry_delay = 1

  failover_command = '/usr/sbin/failover_modified.sh %d "%H" %P /var/lib/postgresql/9.4/main/pgsql.trigger.5432'
  failback_command = ''
  fail_over_on_backend_error = on
  search_primary_node_timeout = 10

  recovery_user = 'postgres'
  recovery_password = 'postgres123'
  recovery_1st_stage_command = ''
  recovery_2nd_stage_command = ''
  recovery_timeout = 90
  client_idle_limit_in_recovery = 0

  use_watchdog = off
  trusted_servers = ''
  ping_path = '/bin'
  wd_hostname = ''
  wd_port = 9000
  wd_authkey = ''
  delegate_IP = ''
  ifconfig_path = '/sbin'
  if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0'
  if_down_cmd = 'ifconfig eth0:0 down'
  arping_path = '/usr/sbin'  
  arping_cmd = 'arping -U $_IP_$ -w 1'

  clear_memqcache_on_escalation = on
  wd_escalation_command = ''

  wd_lifecheck_method = 'heartbeat'
  wd_interval = 10
  wd_heartbeat_port = 9694
  wd_heartbeat_keepalive = 2
  wd_heartbeat_deadtime = 30
  heartbeat_destination0 = '192.168.0.2'
  heartbeat_destination_port0 = 9694
  heartbeat_device0 = ''

  heartbeat_destination1 = '192.168.0.3'
  wd_life_point = 3
  wd_lifecheck_query = 'SELECT 1'
  wd_lifecheck_dbname = 'postgres'
  wd_lifecheck_user = 'postgres'
  wd_lifecheck_password = 'postgres123'

  relcache_expire = 0
  relcache_size = 256
  check_temp_table = on

  memory_cache_enabled = off
  memqcache_method = 'shmem'
  memqcache_memcached_host = 'localhost'
  memqcache_memcached_port = 11211
  memqcache_total_size = 67108864
  memqcache_max_num_cache = 1000000
  memqcache_expire = 0
  memqcache_auto_cache_invalidation = on
  memqcache_maxcache = 409600
  memqcache_cache_block_size = 1048576
  memqcache_oiddir = '/var/log/pgpool/oiddir'
  white_memqcache_table_list = ''
  black_memqcache_table_list = ''

Also, i hope you have modified the pool_hba.conf to enable access to master and slave

另外,我希望您已经修改了pool_hba。允许访问主和从