登录数据库hang住

时间:2021-06-18 21:46:05
登录测试数据库,发现普通用户登录就hang住没反应
[oracle@rac2 ~]$ sqlplus scott/tiger
SQL*Plus: Release 11.2.0.1.0 Production on Wed Jan 15 19:22:25 2014
Copyright (c) 1982, 2009, Oracle.  All rights reserved.
。。。。
hang在这里没有反应

用sys登录没有异常
查看hang住的session 的event
SQL> select sid,username,event,p1,p2,p3 from v$session where status='ACTIVE' and username is not null;
       SID USERNAME                  EVENT                                                           P1         P2
---------- ------------------------- ------------------------------------------------------- ---------- ----------
        P3
----------
        36 SYS                       SQL*Net message to client                               1650815232          1
         0
        59 SCOTT                     write complete waits                                             5        176
         0
        76 SYS                       Streams AQ: waiting for messages in the queue                12884 1388097736

         1


一直都是write complete waits 等待
这个等待一般是由于IO性能引起的,或者dbwr进程工作量大导致,跟我的情况明显不同
测试库上没有跑任何作业或者其他的操作
操作系统的负载也很低,几乎没有
这个等待的P1 ,P2分别是file#,block#
SQL> select name,parameter1,parameter2,parameter3,wait_class from v$event_name where name='write complete waits';
NAME PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
--------------------------------------------------------------------------------
write complete waits file# block#

查看等待的对象是什么
SQL> select owner,segment_name, segment_type
  2   from dba_extents
  3   where file_id = &file_id
  4    and &block_id between block_id and block_id + blocks - 1;
Enter value for file_id: 5
old   3:  where file_id = &file_id
new   3:  where file_id = 5
Enter value for block_id: 176
old   4:   and &block_id between block_id and block_id + blocks - 1
new   4:   and 176 between block_id and block_id + blocks - 1
OWNER                                                        SEGMENT_NAME         SEGMENT_TYPE
------------------------------------------------------------ -------------------- ------------------------------------
SYS                                                          _SYSSMU14_277467141$ TYPE2 UNDO

同时发现此时很多操作无法完成
如flush buffer_cahce,switch logfile等

查看alter日志发现:
Fri Jan 10 20:07:17 2014
Suspending MMON action 'undo usage' for 82800 seconds
Fri Jan 10 22:00:01 2014
Setting Resource Manager plan SCHEDULER[0x3007]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Fri Jan 10 22:00:01 2014
Starting background process VKRM
Fri Jan 10 22:00:01 2014
VKRM started with pid=38, OS id=13518
Fri Jan 10 22:34:33 2014
DW00 started with pid=51, OS id=14090, wid=2, job SCOTT.SYS_EXPORT_SCHEMA_01
Fri Jan 10 22:47:38 2014
Suspending MMON slave action kewrmrfsa_ for 82800 seconds
Fri Jan 10 22:49:52 2014
DW00 started with pid=54, OS id=14344, wid=2, job SCOTT.SYS_EXPORT_SCHEMA_01
Fri Jan 10 23:15:20 2014
Suspending MMON slave action kewrmafsa_ for 82800 seconds
Sat Jan 11 19:29:00 2014
Suspending MMON action 'undo usage' for 82800 seconds
Sat Jan 11 23:05:39 2014
Suspending MMON slave action kewrmafsa_ for 82800 seconds
Sun Jan 12 00:35:17 2014
Suspending MMON slave action kewrmrfsa_ for 82800 seconds
Sun Jan 12 18:43:51 2014
Suspending MMON action 'undo usage' for 82800 seconds
Sun Jan 12 23:11:10 2014
Suspending MMON slave action kewrmafsa_ for 82800 seconds
Mon Jan 13 00:35:04 2014
Suspending MMON slave action kewrmrfsa_ for 82800 seconds
Mon Jan 13 18:00:27 2014
Suspending MMON action 'undo usage' for 82800 seconds
Mon Jan 13 23:13:24 2014
Suspending MMON slave action kewrmafsa_ for 82800 seconds

发现2个值得注意的地方
1,Setting Resource Manager plan SCHEDULER[0x3007]:DEFAULT_MAINTENANCE_PLAN via scheduler window
2,Suspending MMON slave action kewrmrfsa_ for 82800 seconds
Suspending MMON action 'undo usage' for 82800 seconds

关于1,有文章说可能是Resource Manager plan引起的bug,
resouce manager的bug:
Oracle Support - March 31, 2012 10:35:20 AM GMT-04:00 [ODM Action Plan] 
The hang analyze shows several chains where the blocking session is waiting on resmgr:cpu quantum.
Please disable resource manager and check it the issue still occurs. 
这里跟我的情况不太一样

2,MMON进程异常,MMON进程跟awr有关,可以kill,这里是测试环境,所以我直接kill了这个进程
[oracle@rac2 trace]$ ps aux | grep mmon
grid      4414  0.0  0.6 419584 13084 ?        Ss   Jan10   0:00 asm_mmon_+ASM2
grid      4749  0.0  1.4 306724 30028 ?        Sl   Jan10   1:09 /opt/app/grid/product/11.2.0/jdk/jre//bin/java -Doracle.supercluster.cluster.server=eonsd -Djava.net.preferIPv4Stack=true -Djava.util.logging.config.file=/opt/app/grid/product/11.2.0/srvm/admin/logging.properties -classpath /opt/app/grid/product/11.2.0/jdk/jre//lib/rt.jar:/opt/app/grid/product/11.2.0/jlib/srvm.jar:/opt/app/grid/product/11.2.0/jlib/srvmhas.jar:/opt/app/grid/product/11.2.0/jlib/supercluster.jar:/opt/app/grid/product/11.2.0/jlib/supercluster-common.jar:/opt/app/grid/product/11.2.0/ons/lib/ons.jar oracle.supercluster.impl.cluster.EONSServerImpl
oracle    5054  0.0  4.8 992696 100980 ?       Ss   Jan10   0:06 ora_mmon_rac2
oracle   26959  0.0  0.0   3920   664 pts/5    R+   19:21   0:00 grep mmon

[oracle@rac2 trace]$ kill -9 5054

[oracle@rac2 trace]$ ps aux | grep mmon
grid      4414  0.0  0.6 419584 13084 ?        Ss   Jan10   0:00 asm_mmon_+ASM2
grid      4749  0.0  1.4 306724 30028 ?        Sl   Jan10   1:09 /opt/app/grid/product/11.2.0/jdk/jre//bin/java -Doracle.supercluster.cluster.server=eonsd -Djava.net.preferIPv4Stack=true -Djava.util.logging.config.file=/opt/app/grid/product/11.2.0/srvm/admin/logging.properties -classpath /opt/app/grid/product/11.2.0/jdk/jre//lib/rt.jar:/opt/app/grid/product/11.2.0/jlib/srvm.jar:/opt/app/grid/product/11.2.0/jlib/srvmhas.jar:/opt/app/grid/product/11.2.0/jlib/supercluster.jar:/opt/app/grid/product/11.2.0/jlib/supercluster-common.jar:/opt/app/grid/product/11.2.0/ons/lib/ons.jar oracle.supercluster.impl.cluster.EONSServerImpl
oracle   26961  0.0  0.0   3920   664 pts/5    R+   19:21   0:00 grep mmon

alter日志:
Wed Jan 15 19:23:45 2014
Restarting dead background process MMON
Wed Jan 15 19:23:45 2014
MMON started with pid=24, OS id=27085
Wed Jan 15 19:26:33 2014

但是kill后发现还是无法正常登录系统

检查是否有其他异常session的阻塞
SQL> select BLOCKING_INSTANCE,BLOCKING_SESSION from v$session where sid=59;
BLOCKING_INSTANCE BLOCKING_SESSION
----------------- ----------------
                2               18

SQL> select sid,username,event,BLOCKING_INSTANCE,BLOCKING_SESSION from v$session where sid=18;
no rows selected

session 59 被session 18阻塞,但是没有session 18的信息

决定重启一下rac2的实例
[grid@rac2 ~]$ srvctl stop instance -d rac -n rac2
这里等待了几分钟,发现没有任何反应

查看alter日志,只有一条信息:
Wed Jan 15 19:26:33 2014
PMON failed to delete process, see PMON trace file

检查pmon tracefile信息:
*** 2014-01-15 19:27:43.564
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead

*** 2014-01-15 19:27:57.503
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead

*** 2014-01-15 19:28:11.474
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead

*** 2014-01-15 19:28:25.360
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead

*** 2014-01-15 19:28:39.244
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead

*** 2014-01-15 19:28:53.137
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead
PMON waiting for 1000 csecs

*** 2014-01-15 19:29:06.990
found process 0x52bcce2c pid=34 serial=6 ospid = 10671 dead

*** 2014-01-15 19:29:06.990
deleting process 0x52bcce2c pid=34 serial=6 priority=0
need redo log switch, current log full
need redo log switch, current log full
need redo log switch, current log full
need redo log switch, current log full
need redo log switch, current log full
need redo log switch, current log full
deletion of process 52bcce2c pid=34 seq=6 prog=FALSE unsuccessful

从pmon信息可以看出来,need redo log switch, current log full
current redo log 需要切换,但是无法完成

查看归档日志信息:
SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /home/oracle/archivelog
Oldest online log sequence     794
Next log sequence to archive   794
Current log sequence           796

SQL> select dest_name,status,error,target,process from v$archive_dest where dest_name=1;
DEST_NAME                      STATUS             ERROR
------------------------------ ------------------ --------------------
TARGET         PROCESS
-------------- --------------------
LOG_ARCHIVE_DEST_1             ERROR              ORA-19502: write
                                                  error on file "",
                                                  block number
                                                  (block size=)
PRIMARY        ARCH

SQL> ! df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              23G   16G  5.5G  75% /
/dev/sda1              99M   12M   83M  12% /boot
tmpfs                1014M  614M  400M  61% /dev/shm

发现归档路径有错误,ORA-19502
才查看磁盘,磁盘还有5个多G空间,完全没有问题,后来测试了磁盘的读写权限,也没有问题

这就比较奇怪
重新指定一个归档路径
SQL> alter system set log_archive_dest_1='location=/home/oracle/';

System altered

SQL> alter system switch logfile;

System altered.

发现可以正常切换日志了

此时普通用户也可以正常连接

[oracle@rac2 archivelog]$ sqlplus scott/tiger

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jan 15 19:42:18 201

Copyright (c) 1982, 2009, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

SQL>