就以pdsp node3 down了为例,如下
==========================START=====================================
The Whole solution of a server down or instance crash
For example:
Alert comes like below:
Then we need to check this server status, also send mail to linux team Lst-Techops.DLRS@nike.com ,let them help check or start the server(most cases that server will start automatically).
Also send mail to APP team to inform them the server status (when server up and service online ,also need inform them and let them check APP status)
(Here is application DL for reference)
Below is some solutions and troubleshooting related to database:
Login ora-plus-p-1.va2.b2c.nike.com
Then find can’t connect, so it may went down
Then login another cluster node like ora-plus-p-1.va2.b2c.nike.com
check db status:
oracle@ora-plus-p-1:PDSP1:/u01/home/oracle $ srvctl status database -d PDSP
Instance PDSP1 is running on node ora-plus-p-1
Instance PDSP2 is running on node ora-plus-p-2
Instance PDSP3 is not running on node ora-plus-p-3
Instance PDSP4 is running on node ora-plus-p-4
After sometime or with the help of linux team, the host up.
Normally crs resource and db resource will auto-start with the server start.
Check crs resource or db status command:
sudo /u01/root/11.2.0.4/grid/bin/crsctl status res –t (execute this using your own nikeid)
srvctl status database –d db_name
make sure necessary resource is online.
if crs is not started,
sudo /u01/root/11.2.0.4/grid/bin/crsctl enable crs(so crs will auto-start when node reboot)
sudo /u01/root/11.2.0.4/grid/bin/crsctl start crs
Then check service whether need relocated.
For all of nike database ,services are recorded in this shared drive.
\\NKE-WIN-NAS-P21nike.com\DCIT_DBA\Dataguard\
And for this PDSP service part, the directory is
\\NKE-WIN-NAS-P21nike.com\DCIT_DBA\Dataguard\PDSP
First check service running node:
srvctl status service -d db_name
And then relocate services to the right node:
In this case,we do below:
srvctl relocate service -d PDSP -s PDSPBATCH -i PDSP4 -t PDSP3
srvctl relocate service -d PDSP -s PDSPMISCL -i PDSP4 -t PDSP3
srvctl relocate service -d PDSP -s PDSPNODE3 -i PDSP2 -t PDSP3
srvctl relocate service -d PDSP -s PDSPSOCIAL -i PDSP4 -t PDSP3
srvctl relocate service -d PDSP -s SLOGICSVC -i PDSP4 -t PDSP3
Finally check service is on right/proper node.
Check Stream status:
Select apply_name,status from dba_apply;
Select capture_name,status from dba_capture;
If they are not enabled,start them like below:
Login as strmadmin user:
exec DBMS_CAPTURE_ADM.START_CAPTURE(capture_name => ‘Capture_name’);
exec DBMS_APPLY_ADM.START_APPLY(apply_name => 'APPLY_name');
then check capture and apply status again,make sure they are started and working.
Fire below query and make sure capture_time should always be changing,so that means capture process is working good.
SELECT
c.CAPTURE_NAME,
to_char(CAPTURE_TIME, 'dd-mon-yy hh24:mi:ss') CAPTURE_TIME,
c.capture_message_number CAPTURE_MSG,
c.STATE,
c.TOTAL_MESSAGES_CAPTURED TOT_MESG_CAPTURE,
c.TOTAL_MESSAGES_ENQUEUED TOT_MESG_ENQUEUE,
SUBSTR(s.PROGRAM,INSTR(s.PROGRAM,'(')+1,4) PROCESS_NAME,
c.SID,
c.inst_id,
s.event
FROM GV$STREAMS_CAPTURE c, GV$SESSION s
WHERE c.SID = s.SID
and c.inst_id=s.inst_id
AND c.SERIAL# = s.SERIAL# order by c.CAPTURE_NAME;
Fire below query and make sure APPLY_CREATE_TIME should always be changing, so that means apply process is working good.
select ac.apply_name, ac.state,
to_char(applied_message_create_time, 'dd-mon-yyyy hh24:mi:ss') APPLY_CREATE_TIME,
round((sysdate-applied_message_create_time)*86400) "LATENCY_IN_SEC"
from dba_apply_progress ap,GV$STREAMS_APPLY_COORDINATOR ac
where ac.apply_name=ap.apply_name
order by apply_name;
Check agent status:
oracle@ora-plus-p-3:PDSP3:/u01/home/oracle $ cd /u01/app/oracle/agent12c/agent_inst/bin/
oracle@ora-plus-p-3:PDSP3:/u01/app/oracle/agent12c/agent_inst/bin $ ./emctl status agent
Oracle Enterprise Manager Cloud Control 12c Release 4
Copyright (c) 1996, 2014 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent is Not Running
oracle@ora-plus-p-3:PDSP3:/u01/app/oracle/agent12c/agent_inst/bin $ ./emctl start agent
Oracle Enterprise Manager Cloud Control 12c Release 4
Copyright (c) 1996, 2014 Oracle Corporation. All rights reserved.
Starting agent .................. started.
oracle@ora-plus-p-3:PDSP3:/u01/app/oracle/agent12c/agent_inst/bin $ ./emctl status agent
Oracle Enterprise Manager Cloud Control 12c Release 4
…………………..
---------------------------------------------------------------
Agent is Running and Ready
Check if this is a goldengate node. Unfortunately, this node3 is a goldengate node. So need to
Start mgr/extract/pump processed which are abended.
ggsci
start mgr
start xxxx
If they are started successfully and working good(RBA are moving ),then we are lucky and good.
But for some cases, they may can’t start or when started got hung. We can refer to below document.
(this doc recorded some solutions with many goldengate issues,will share it at another blog)
In this case, after starting r1_cp,r2_cp,r3_cp,RBA didn’t moving ,send status command get timeout,
So they probably get hung.
So try to kill them and restart, but still no use .
Then look into database side:
SELECT s.sid,s.serial#,s.inst_id,s.sql_id,last_call_et "Run_in_sec",s.osuser "OS_user",s.machine,a.sql_text,
s.module,s.event,s.blocking_session
FROM gv$session s,gv$sqlarea a
WHERE s.sql_id = a.sql_id(+) and s.inst_id=a.inst_id and status='ACTIVE' and username='GGADMIN'
and type='USER' order by last_call_et desc;
From sql result, we can see a lot of locks are blocking goldengate processes.
So we can know that goldengate hung processes are caused by these blocking sessions.
And after sometime, the locks still exist. So we need to send mail to APP team to check if can kill these sessions.
Just like below:
After their permission, we can kill these sessions, and then restart r1-r3 processes ,goldengate works good.
At other side, we need to find why this node reboot.
We can always find useful information in other survived nodes.
In this case, node3 rebooted, I search some info on node1 like below:
oracle@ora-plus-p-1:PDSP1:/u01/home/oracle $ cd /u01/app/11.2.0.4/grid/log/ora-plus-p-1/
oracle@ora-plus-p-1:PDSP1:/u01/app/11.2.0.4/grid/log/ora-plus-p-1 $ less alertora-plus-p-1.log
2016-10-18 09:46:01.809:
[cssd(2206)]CRS-1612:Network communication with node ora-plus-p-3 (3) missing for 50% of timeout interval. Removal of this node from cluster in 14.610 seconds
2016-10-18 09:46:09.858:
[cssd(2206)]CRS-1611:Network communication with node ora-plus-p-3 (3) missing for 75% of timeout interval. Removal of this node from cluster in 6.560 seconds
2016-10-18 09:46:13.860:
[cssd(2206)]CRS-1610:Network communication with node ora-plus-p-3 (3) missing for 90% of timeout interval. Removal of this node from cluster in 2.560 seconds
Also we can check node3’s osw network file to confirm if any network errors:
cd /cust/app/oracle/OSW/oswbb/archive/oswnetstat/
cat ora-plus-p-3.va2.b2c.nike.com_netstat_16.11.28.1400.dat|grep -in "receive errors"
cat ora-plus-p-3.va2.b2c.nike.com_netstat_16.11.28.1400.dat|grep -in timeout
we can get many packet receive error and timeout error from above commands:
312009 packet receive errors
RcvbufErrors: 818
SndbufErrors: 6294
312009 packet receive errors
RcvbufErrors: 818
SndbufErrors: 6294
…
So next step,we need work with linux team,network team to work with this.
And for else node eviction case, we can also use this method to troubleshoot.
======================ENDED==================================================