互联网服务hacmpemc

HA第二节点启动失败求助分析

HA第二节点启动失败求助分析环境: 两台p550  组成 hacmp 主备双机,使用emc cx-40f 光纤存储。1、系统升级之前ha启动正常,切换正常。2、系统从5305升级到5312级别oslevel 显示5300-10,主要有一个包没有打上去jtfmistest:/>#oslevel -rl 5300-12FilesetActualLevelR...显示全部
HA第二节点启动失败求助分析

环境: 两台p550  组成 hacmp 主备双机,使用emc cx-40f 光纤存储。

1、系统升级之前ha启动正常,切换正常。


2、系统从5305
升级到5312级别


oslevel 显示5300-10,主要有一个包没有打上去


jtfmistest:/>#oslevel -rl 5300-12


Fileset
ActualLevel
Recommended ML


-----------------------------------------------------------------------------


ifor_ls.html.en_US.base.cli
5.3.7.0
5.3.8.0


3、hacmp从5400 升级到5410最新包。


升级以后ha在第二个启动的节点(可以是A节点,也可以是B节点,按先后算),hacmp无法启动,运行clstart显示进程正常启动,但是资源不启动,hacmp.out无任何输出。


4、回退hacmp所有补丁包,故障现象照旧。


5、试删除hacmp包失败,发现EMCpowerpath 软件对hacmp5.4
4
个基础包有依赖,所以只能直接再安装5410版本hacmp,然后打上最新补丁。所预料到的,故障依旧。


注:目前系统版本回退未测试,因为本次升级主要目标就是系统补丁升级。希望能找到问题所在。


6、powerpath版本:5.0.0
build 161


7、EMC
ha Custom Disk Methods
未配置,ha event未配置,(原未配置也运行正常)。官方指定ha5.4以后无需配置。


现在已经测试,ha 经过以上配置,故障依旧。


:我们环境另外一套系统也有类似情况,第二节点不能启动,但重启clstrmgrES进程后就能启动,本环境重启也不能启动。同样用EMC存储。



现在不知如此情况是否和emc的powerpath有关呢?

未命名.jpg

第二节点clstat 输出


Group Name
Group State
Application state
Node



--------------------------------------------------------------------------------


-------------------------------------


prod_res
OFFLINE(其实是启动的,重启clstrmgrES进程后状态就能正常显示)



prodapp
OFFLINE


test_res
OFFLINE


Command: failed
stdout: yes
stderr: no


Before command completion, additionalinstructions may appear below.



[TOP]


cldump: Waiting for the Cluster SMUX peer(clstrmgrES)
(1节点能正常显示,日过2节点ha先启动也能正常显示)


to stabilize.............


Failed retrieving cluster information.


There are a number of possible causes:


clinfoES or snmpd subsystems are notactive.


snmp is unresponsive.


snmp is not configured correctly.


Cluster services are not active on anynodes.


Refer to the HACMP Administration Guide formore information.

集群基本配置(无变动):

Cluster Name: jtfmiscl

Cluster Connection Authentication Mode:Standard

Cluster Message Authentication Mode: None

Cluster Message Encryption: None

Use Persistent Labels for Communication: No

There are 2 node(s) and 1 network(s)defined

NODE jtfmis:


Network net_ether_01


srv2
10.150.4.243


srv1
10.150.4.242


boot1
30.0.0.1


stdby1
50.0.0.1

               


NODE jtfmistest:


Network net_ether_01


srv2
10.150.4.243


srv1
10.150.4.242


stdby2
50.0.0.2


boot2
30.0.0.2

Resource Group prod_res


Startup Policy
Online On HomeNode Only


Fallover Policy
Fallover To NextPriority Node In The List


Fallback Policy
Never Fallback


Participating Nodes
jtfmisjtfmistest

Service IP Label
srv1

      

Resource Group test_res


Startup Policy
Online On HomeNode Only


Fallover Policy
Fallover To NextPriority Node In The List


Fallback Policy
Never Fallback


Participating Nodes
jtfmistest jtfmis


Service IP Label
srv2

     

Total Heartbeats Missed:
0

Cluster Topology Start Time:
03/26/2012 12:10:37

集群心跳情况:

jtfmistest:/>#lssrc -ls topsvcs

Subsystem

Group
PID
Status


topsvcs
topsvcs
1630276 active

Network Name
Indx Defd
Mbrs
St
Adapter ID
Group ID

net_ether_01_0 [ 0] 2
2
S
30.0.0.2
30.0.0.2

net_ether_01_0 [ 0] en0
0x476fec3f
0x476fec40

HB Interval = 1.000 secs. Sensitivity = 10missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent
: 99381 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 118263 ICMP 0 Dropped: 0

NIM's PID: 2424882

net_ether_01_1 [ 1] 2
2
S
50.0.0.2
50.0.0.2

net_ether_01_1 [ 1] en1
0x476fec41
0x476fec42

HB Interval = 1.000 secs. Sensitivity = 10missed beats

Missed HBs: Total: 0 Current group: 0

Packets sent
: 99386 ICMP 0 Errors: 0 No mbuf: 0

Packets received: 118265 ICMP 0 Dropped: 0

NIM's PID: 1753114


2locally connected Clients with PIDs:

haemd(1179800) hagsd(2031782)


Fast Failure Detection available but off.


Dead Man Switch Enabled:


reset interval = 1 seconds


trip
interval = 20 seconds


Client Heartbeating Disabled.


Configuration Instance = 44


Daemon employs no security


Segments pinned: Text Data.


Text segment size: 900 KB. Static data segment size: 1493 KB.


Dynamic data segment size: 5377. Number of outstanding malloc: 139


User time 8 sec. System time 5 sec.


Number of page faults: 1. Process swapped out 0 times.


Number of nodes up: 2. Number of nodes down: 0.

Ha软件包列表:

Fileset
Level
State
Description


----------------------------------------------------------------------------

Path: /usr/lib/objrepos


cluster.adt.es.client.include


5.4.1.0
COMMITTED
ES Client Include Files


cluster.adt.es.client.samples.clinfo


5.4.1.0
COMMITTED
ES Client CLINFO Samples


cluster.adt.es.client.samples.clstat


5.4.1.0
COMMITTED
ES Client Clstat Samples


cluster.adt.es.client.samples.libcl


5.4.1.0
COMMITTED
ES Client LIBCL Samples


cluster.adt.es.java.demo.monitor


5.4.1.0
COMMITTED
ES Web Based Monitor Demo


cluster.doc.en_US.es.html
5.4.1.0
COMMITTED
HAES Web-based HTML


Documentation - U.S. English


cluster.doc.en_US.es.pdf
5.4.1.0
COMMITTED
HAES PDF Documentation - U.S.



English


cluster.es.cfs.rte
5.4.1.0
COMMITTED
ES Cluster File System Support


cluster.es.client.lib
5.4.1.7
APPLIED
ES Client Libraries


cluster.es.client.rte
5.4.1.11
APPLIED
ES Client Runtime


cluster.es.client.utils
5.4.1.10
APPLIED
ES Client Utilities


cluster.es.client.wsm
5.4.1.0
COMMITTED
Web based Smit


cluster.es.cspoc.cmds
5.4.1.12
APPLIED
ES CSPOC Commands


cluster.es.cspoc.dsh
5.4.1.0
APPLIED
ES CSPOC dsh


cluster.es.cspoc.rte
5.4.1.7
APPLIED
ES CSPOC Runtime Commands


cluster.es.server.cfgast
5.4.1.0
COMMITTED
ES Two-Node Configuration


Assistant


cluster.es.server.diag
5.4.1.12
APPLIED
ES Server Diags


cluster.es.server.events
5.4.1.12
APPLIED
ES Server Events


cluster.es.server.rte
5.4.1.12
APPLIED
ES Base Server Runtime


cluster.es.server.simulator


5.4.1.0
COMMITTED
ES Cluster Simulator


cluster.es.server.testtool


5.4.1.0
COMMITTED
ES Cluster Test Tool


cluster.es.server.utils
5.4.1.12
APPLIED
ES Server Utilities


cluster.license
5.4.1.0
COMMITTED
HACMP Electronic License

Path: /etc/objrepos


cluster.es.client.lib
5.4.1.7
APPLIED
ES Client Libraries


cluster.es.client.rte
5.4.1.11
APPLIED
ES Client Runtime


cluster.es.cspoc.rte
5.4.0.0
COMMITTED
ES CSPOC Runtime Commands


cluster.es.server.diag
5.4.0.0
COMMITTED
ES Server Diags


cluster.es.server.events
5.4.0.0
COMMITTED
ES Server Events


cluster.es.server.rte
5.4.1.12
APPLIED
ES Base Server Runtime


cluster.es.server.simulator


5.4.1.0
COMMITTED
ES Cluster Simulator


cluster.es.server.utils
5.4.1.12
APPLIED
ES Server Utilities

Path: /usr/share/lib/objrepos


cluster.man.en_US.es.data
5.4.1.0
COMMITTED
ES Man Pages - U.S. English

注:powerpath 上表45.4.0.0 ha包依赖

                             


收起
参与30

查看其它 28 个回答davy163pp的回答

davy163ppdavy163pp系统运维工程师天工
回复 23# lazyman


    对  我也觉得  我们网络和系统安全加固太多了    没办法搞回去
互联网服务 · 2015-05-27
浏览2038

回答者

davy163pp
系统运维工程师天工
擅长领域: 云计算私有云

davy163pp 最近回答过的问题

回答状态

  • 发布时间:2015-05-27
  • 关注会员:1 人
  • 回答浏览:2038
  • X社区推广