Monday, June 9, 2014

Brocade Super X: 99% CPU Utilization Problem

I and another engineer I work with named Antonio went out this weekend to troubleshoot a problem that has a real problem for some time.  The customer has several racks full of servers with dual FastIrons with 10G uplinks back to the dual core of SuperXs.  Its a nice setup.  However, one switch in particular has not had redundant links due to "the problem" that is caused when the second link is in place.  The report given to us is that the network is brought to its knees. (Sounds like a loop, right?)  Here is the topology:

The problem:  top of rack switch showed incremental input errors by the thousands per second, which flooded the link (due to input errors on the top of rack switch), caused a 99% spike in CPU utilization on the core switches only, and brought the network to its knees from a performance perspective.

The solution:  Antonio found "dual-mode 1" configured on the 7/1 interface on the SuperX, which is used to untag vlan 1 traffic across the uplink to the top of rack switch.  The top of rack switch did not have that configuration on its side (Brocade to Brocade does not need that config).  This should have caused a "network outage" to that link.  Meaning, that it should not have passed any traffic at all across the link since it was configured on one side and not the other.  However, we found that this caused input errors to increment across that link by the thousands per second across the uplink, causing the CPU to spike on the core switches to 99% utilization.  Once we took the "dual-mode 1" command off of the interface, we no longer experienced the CPU spike.  This is a not normal expectation for a config issue of this sort.
We believe that the core switches have a software bug that has caused this problem pertaining to the "dual-mode 1" configuration when the upstream switch is not configured for "dual-mode 1".  We believe this because when we reverse this issue, meaning on the top of rack switch we configure the "dual-mode 1" command and have the core switches configured to NOT have the "dual-mode 1" command, we do not see any spike in CPU utilization on either side, and we can not get any traffic across the link, as would be expected.  As proof, we were able to ping when configured correctly, and not able to ping when not configured correctly on the top of rack side.  With this said, we do believe that the core switches have a software bug that we are not able to resolve without upgrading the firmware to a later version.

With that said, the current version of the core switches is version SXL05100c.  We tried to upgrade this to version SXR07202k, but we were unsuccessful with the following error:
BR-DC_CORE_1#copy tftp flash SXR07202k.bin primary
Router Code requires correct license PROM to be installed in the system
The code sub type 3 is not correct for the target hardware, abort!
File Type Check Failed

TFTP to Flash Error - code 8

With this said, we will need to get with Brocade to find out what version we can go to without this message or obtain the license necessary.

Things we have verified that IS in good shape:
1.  Fiber patch cable on both sides of the link, meaning on the core side and the top of rack side.
2.  GBICs on both sides of the link.
3.  GBIC modules in the core switches.
4.  Fiber cabling from top of rack switch to Core switch.

No comments:

Post a Comment

Your comment will be reviewed for approval. Thank you for submitting your comments.