VirtualWisdom Health - Loss of Sync and Loss of Signal issues

Loss of Sync and Loss of Signal events almost always occur together in the same time interval, so they are described together in this chapter.

What is a Loss of Sync Event?

A Loss of Sync event occurs when two devices lose the synchronization signal between them for a specific period of time. Fibre Channel clock synchronization between the transmitter and the receiver is achieved by an encoding technique that includes the clocking information in the data stream. If for any reason the data is corrupted (due to Encoding Errors or single-bit errors), clock synchronization may be lost and transmission may be interrupted until the clock synchronization is regained. The Fibre Channel standards dictate that three or more consecutive Encoding Errors will trigger a Loss of Sync condition.

What is a Loss of Signal Event?

A Loss of Signal event occurs when the amount of light on a receiving port drops below a predetermined threshold, to a point deemed too low for data to be considered valid.

Why are Loss of Sync/Signal Events a Problem?

Loss of Sync/Signal events usually indicate a physical issue on a link, or a server reboot or storage port reset. These events, and the related Encoding Errors which may accompany them, are important disruptions to connectivity. They cause performance problems by forcing devices to request and re-send frames which were corrupted. These unnecessary errors on the SAN cost precious switch CPU time, as each error is dealt with individually for no gain in fabric performance or stability.

Loss of Sync/Signal events may also lead to Aborts and multiple attempts by servers to access the same data. On storage ports, Loss of Sync/Signal events can induce long timeouts (30 seconds or more). Every effort should be made to track down these types of errors as they can seriously impact ISLs, highly-utilized links and application performance. Loss of Signal events can lead to a path going down, and if the HBA or storage port cannot failover for some reason, may lead to application downtime.

Required to identify:

Network Switch Probe (software only).

What are Common Causes of Loss of Sync/Signal Events?

Since three or more Encoding Errors in a row will cause a Loss of Sync event, the same physical problems causing an Encoding Errors can also cause a Loss of Sync:

  • Faulty, dirty, mismatched or disconnected cables
  • Failing or dirty SFP transceivers
  • Failing or dirty patch panels
  • Poor cable management, exceeding minimum bend radius, kinked cables, etc.
  • An SFP losing its light unexpectedly
  • A device being reset or rebooted
  • A device being removed from or added to the loop

A Loss of Signal event is caused by an SFP losing its light unexpectedly, which can happen due to many of the reasons listed just above.

How to Spot a Loss of Sync/Signal Event

The Network Switch Probe keeps track of the number of Loss of Sync/Signal events noticed by a switch. Each Loss of Signal event is flagged for the switch by an SFP when its light level drops below the specific SFP’s minimum receiver sensitivity. The events can be viewed and filtered on VI - Health - Physical Layer report:

Correlating Loss of Sync/Signal Events with Other Events

Three Encoding Error in a row will cause a Loss of Sync event. Continued Loss of Sync events can also lead to Loss of Signal events and Abort Sequences. Operation of a Fibre Channel port is governed by a port state machine. This defines the operation of the port, including initialization, normal operation and how it responds to various error conditions. The error handling defined in the state machine is very relevant to four key link metrics recorded by the VirtualWisdom Network Switch Probe: Loss of Sync, Loss of Signal, Link Reset and Link Failure. These metrics are closely inter-related and often occur together. It is important understand the relationship between them when interpreting data recorded by VirtualWisdom. Like Link Failure, Loss of Sync and Loss of Signal are port-level statistics, so they are recorded on both channels by the VirtualWisdom Network Switch Probe.

A Link Failure state is entered when either Loss of Sync or Loss of Signal persists for longer than the Receiver Transmitter Time Out Value (R_T_TOV), which is 100ms. Thus a Link Failure should be considered a more serious condition than Loss of Sync or Loss of Signal and will always be seen with these metrics. Strictly, Loss of Signal and Loss of Sync will precede the Link Failure. However, considering that the Network Switch Probe is polling switches at a minimum of 5-minute intervals, the combined metrics will be seen in the same time period. A build-up of Loss of Sync events may also occur just prior to a Loss of Signal event.

A Link Reset event is triggered on link timeout as well as on completion of link initialization. Thus a Link Reset will occur when recovering from Link Failure. It is important to note that a Link Reset is not always an error condition - a port will always reset as part of the initialization process. Thus a port coming online will always reset as part of the process.

If a server is rebooted, it is likely that a series of Loss of Sync, Loss of Signal and Link Reset events will occur on both HBAs of a single host, within the same summary interval:

The Link Reset event may also be followed by one or more Class 3 Discards as any exchanges in progress are dropped when the link goes down.

Loss of Sync events in the presence of Encoding Errors, Loss of Signal events or Link Reset events may indicate a Flapping HBA. Also known as a Flopping HBA, this is an active HBA port which randomly changes state because it has no SFP attached, or its SFP is uncovered with no cable attached. This can cause millions (or even billions) of Encoding Errors, creating a massive CPU overhead on the SAN switch. Resolving these events and errors proactively avoids many application slowdowns.

How to Resolve Loss of Sync/Signal Events

Loss of Sync/Signal events often indicate physical link problems with cables, SFPs or patch panels. They may also be caused by devices timing out or being reset, rebooted, added or removed. Such events will occur from time to time with the moving of equipment or configuration changes. In those cases, corresponding change control log entries should always exist. The log may be the best place to begin tracking down the cause of a Loss of Sync/Signal event.

If the log does not indicate any intentional reconfiguration or other manipulation of equipment, cables or SFPs, there could be actual physical problems with the optics. In that case, replacing the cables and SFPs on the link may help.

Another possible source of Loss of Sync/Signal events is a port with nothing connected to it. This could be an active port with no SFP attached or with an uncovered SFP that has no cable attached. It could also be a port whose server has no HBA driver installed (and possibly has no running operating system). Every effort should be made to track down and disable such ports (and cover any uncovered SFPs), in order to eliminate the potential performance impact of Loss of Sync/Signal events (as well as Encoding Errors, Frame Errors and CRC Errors) which may be generated by them. This problem is also known as a Flapping HBA or Flopping HBA.

In many cases, resolving Loss of Sync/Signal events requires examining, testing, cleaning and/or replacing SFPs, cables or patch panels until the issues cease. Once the existing problems have been resolved, it is a good idea to establish alarms for any Loss of Sync/Signal events that occur on any link. Initially the alarms may be limited (using filters) to ISLs, then to storage ports, then to all ports as the overall health of the SAN improves. Creating multiple levels of notification will escalate the worst problems so they can be dealt with quickly.