passpark
作者passpark2022-05-15 15:34
系统工程师, 浪潮商用机器有限公司

Brocade交换机ITW故障分析过程

字数 4943阅读 2620评论 0赞 3

ITW : Invalid transmission words

一般来 Invalid transmit words 是指光纤网络中的传输错误,一个 transmission word 是一个 40 bit 一组数据,包括 4 个 10-bit 的传输数据,我们知道在光纤网络中使用的是 8b/10b 的编码规则。如果在光纤网络中我们拔插光纤线,可能导致 这个 ITW ( Invalid transmission words )上升,但是如果没有上述动作,链路正常使用的情况下,出现 ITW 持续上升的情况,就需要排查可能是链路本身出现故障了。

MAPS ( Monitoring and Alerting Policy Suite )是一个存储区域网络( SAN )的健康监视器。在 Fabric OS 7.2.0 及以后版本支持。通过 MAPS 可提供健康监控、预防性报警等功能,帮助管理员提前发现可能的故障问题,并做相应处置。

下面是一个实际的案例:

GL8510_001:FID128:admin> switchshow -slot 3 |grep 22

34 3 2 312200 id N16 Online FC F-Port 50:06:0e:80:07:2f:35:40

166 3 22 31a640 id N16 Online FC F-Port 10:00:00:10:9b:58:af:b5

GL8510_001:FID128:admin> nodefind 31a640

Local:

Type Pid COS PortName NodeName SCR

N 31a640; 3;10:00:00:10:9b:58:af:b5;20:00:00:10:9b:58:af:b5; 0x00000003

SCR: Fabric-Detected Nx-Port-Detected

Fabric Port Name: 20:a6:88:94:71:43:b6:47

Permanent Port Name: 10:00:00:10:9b:58:af:b5

Device type: Physical Unknown(initiator/target)

Port Index: 166

Share Area: Yes

Redirect: No

Partial: No

LSAN: No

Slow Drain Device: No

Device link speed: 16G

Connected through AG: No

Real device behind AG: No

FCoE: No

Aliases: EQUHST00005723_H1

可能的故障点是端口 31a640 ,上述命令列出了该端口的一些详细配置信息。

使用 mapsdb -- show all 检查日志信息,结果如下:

GL8510_001:FID128:admin> mapsdb --show all

1 Dashboard Information:

=======================

DB start time: Thu Aug 1 00:36:48 2019

Active policy: pab_cus_policy

Configured Notifications: RASLOG,SNMP,FENCE,SW_CRITICAL,SW_MARGINAL,SFP_MARGINAL

Fenced Ports : None

Decommissioned Ports : None

Fenced circuits : N/A

Quarantined Ports : None

Top Zoned PIDs : 0x3194c0(45) 0x3184c0(45) 0x31a4c0(38) 0x31b4c0(38) 0x311000(29)

2 Switch Health Report:

=======================

Current Switch Policy Status: HEALTHY

3.1 Summary Report:

===================

Category |Today |Last 7 days |


Port Health |No Errors |Out of operating range |

BE Port Health |No Errors |No Errors |

Extension GE Port Health |No Errors |No Errors |

Fru Health |In operating range |Out of operating range |

Security Violations |No Errors |No Errors |

Fabric State Changes |No Errors |Out of operating range |

Switch Resource |In operating range |In operating range |

Traffic Performance |In operating range |In operating range |

Extension Health |Not applicable |Not applicable |

Fabric Performance Impact|Out of operating range |Out of operating range |

3.2 Rules Affecting Health:

===========================

Category(Violation Count)|RepeatCount|Rule Name |Execution Time |Object |Triggered Value(Units)|


Port Health(9) |3 |defALL_OTHER_F_PORTSITW_40 |04/27/22 22:39:53|F-Port 3/22 |65 ITWs |

| | | |F-Port 3/22 |92 ITWs |

| | | |F-Port 3/22 |42 ITWs |

|4 |defALL_OTHER_F_PORTSITW_21 |04/27/22 22:39:53|F-Port 3/22 |65 ITWs |

| | | |F-Port 3/22 |92 ITWs |

| | | |F-Port 3/22 |32 ITWs |

| | | |F-Port 3/22 |25 ITWs |

|2 |defALL_OTHER_F_PORTSITW_21 |04/26/22 01:39:30|F-Port 3/22 |22 ITWs |

| | | |F-Port 3/22 |25 ITWs |

Fru Health(24) |2 |defALL_PORTSSFP_STATE_OUT |04/28/22 22:26:28|U-Port 3/22 |OUT |

| | | |U-Port 3/22 |OUT |

|2 |defALL_PORTSSFP_STATE_IN |04/28/22 22:27:00|U-Port 3/22 |IN |

| | | |U-Port 3/22 |IN |

|2 |defALL_PORTSSFP_STATE_IN |04/27/22 22:28:43|U-Port 3/22 |IN |

| | | |U-Port 3/22 |IN |

|2 |defALL_PORTSSFP_STATE_OUT |04/27/22 22:28:26|U-Port 3/22 |OUT |

| | | |U-Port 3/22 |OUT |

|8 |defALL_PORTSSFP_STATE_IN |04/22/22 19:43:00|U-Port 12/28 |IN |

| | | |U-Port 12/4 |IN |

| | | |U-Port 11/28 |IN |

| | | |U-Port 11/4 |IN |

| | | |U-Port 4/28 |IN |

Fabric State Changes(1) |1 |defSWITCHFLOGI_6 |04/24/22 20:09:30|Switch |8 Logins |

Fabric Performance Impact|2 |Tempr_ALL_HOST_PORTSTX_95 |04/29/22 05:14:46|F-Port 2/33 |95.72 % |

(106) | | | | | |

| | | |F-Port 2/33 |96.88 % |

|1 |Tempr_ALL_HOST_PORTSTX_95 |04/29/22 04:38:39|F-Port 2/33 |100.00 % |

|6 |defALL_PORTS_IO_LATENCY_CLE|04/29/22 04:32:22|E-Port 8/2 |IO_LATENCY_CLEAR |

| |AR | | | |

| | | |E-Port 8/0 |IO_LATENCY_CLEAR |

| | | |E-Port 5/3 |IO_LATENCY_CLEAR |

| | | |E-Port 5/2 |IO_LATENCY_CLEAR |

| | | |E-Port 5/1 |IO_LATENCY_CLEAR |

|5 |defALL_PORTS_IO_PERF_IMPACT|04/29/22 04:30:22|E-Port 8/2 |IO_PERF_IMPACT |

| |_UNQUAR | | | |

| | | |E-Port 5/2 |IO_PERF_IMPACT |

| | | |E-Port 8/0 |IO_PERF_IMPACT |

| | | |E-Port 5/3 |IO_PERF_IMPACT |

| | | |E-Port 5/1 |IO_PERF_IMPACT |

|1 |Tempr_ALL_HOST_PORTSTX_95 |04/29/22 01:02:52|F-Port 4/10 |95.24 % |

4 History Data:

===============

输出的结果较长, 我们重点关注如下结果:

1.Configured Notifications: RASLOG,SNMP,FENCE,SW_CRITICAL,SW_MARGINAL,SFP_MARGINAL

这里显示了 configure 的一下变化, Fence, SW_CRITICAL, SW_MARGINAL, SFP_MARGINAL 从 Fence 状态到 SFP_MARGINAL , SFP 故障

2.Summary Report

我们可以看到过去 7 天的时间内, Port Health , Fru Health, Fabric State Change, Fabric Performance Impact 的状态是 Out of operating range 。 Out of operating range 代表运行状态偏离了正常的可接受范围,出现故障,结合起来看,有较大的可能性端口故障。

3.再看一下 Rules Affecting Health 这一段信息,较详细的列出了出现的故障内容,总体来看 Port Health 这块,主要是 Port 3/22 出现较多 ITWs 的错误,前面有技术器统计的数字,后面还有一些端口的 IN 和 OUT 的状态变化,以及期间的 Performance 的变化,影响,可忽略不计,总台来看,我们认为是端口的 SFP 故障导致出现的 ITWs 报错,以及关联影响到相关的一些报错信息。

综合上述分析,此故障是由端口 3/22 所引起的,通过更换 3/22 端口的 SFP ,并继续观察后,后续再无持续报错,问题得到解决。

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

3

添加新评论0 条评论

Ctrl+Enter 发表

作者其他文章

相关文章

相关问题

相关资料

X社区推广