emergency EMS event netapp

Ivan · 07.07.2019

Добрый день! FAS 2650 прислал на почту алерт вот такой

Код:

spm.mgwd.process.exit: Management Gateway (mgwd) subsystem with ID 1986 exited as a result of signal normal exit (0). The subsystem will attempt to restart.

Description: This message occurs when the Management Gateway (mgwd) subsystem involuntarily exits. The subsystem provides administration services to manage the cluster and current node. This failure can disrupt administrative tasks that are being performed on the current node. The system attempts to recover by restarting the subsystem. While this situation persists, if a Cluster Management Logical Interface (LIF) is hosted on the current node, migrate it to another node in the cluster using another node management LIF or console access to another node. If the subsystem does not recover within the threshold number of retries, then an AutoSupport message is sent.
Action:
In the rare event that the Management Gateway (mgwd) subsystem has terminated unexpectedly, the automatic restart of the process will often stabilize the situation. While this situation persists, if a Cluster Management Logical Interface (LIF) is hosted on the current node, migrate it to another node in the cluster using another node management LIF or console access to another node. In some cases, there is a systemic issue that can be cleared up by rebooting the node when multiple restarts do not stabilize mgwd. This might be done automatically by the Node Watchdog service for persistent issues within a short window of time. In other cases, recent changes in logging, configuration or external management activities can be contributing factors to the unexpected terminations. Consider reverting one or more of these changes. If the issue persists, contact NetApp technical support. Подскажите что это может быть ? С виду все работает, других ошибок нет..

deadushka · 08.07.2019

Ivan сказал(а):
Добрый день! FAS 2650 прислал на почту алерт вот такой

Код:

spm.mgwd.process.exit: Management Gateway (mgwd) subsystem with ID 1986 exited as a result of signal normal exit (0). The subsystem will attempt to restart.

Description: This message occurs when the Management Gateway (mgwd) subsystem involuntarily exits. The subsystem provides administration services to manage the cluster and current node. This failure can disrupt administrative tasks that are being performed on the current node. The system attempts to recover by restarting the subsystem. While this situation persists, if a Cluster Management Logical Interface (LIF) is hosted on the current node, migrate it to another node in the cluster using another node management LIF or console access to another node. If the subsystem does not recover within the threshold number of retries, then an AutoSupport message is sent.
Action:
In the rare event that the Management Gateway (mgwd) subsystem has terminated unexpectedly, the automatic restart of the process will often stabilize the situation. While this situation persists, if a Cluster Management Logical Interface (LIF) is hosted on the current node, migrate it to another node in the cluster using another node management LIF or console access to another node. In some cases, there is a systemic issue that can be cleared up by rebooting the node when multiple restarts do not stabilize mgwd. This might be done automatically by the Node Watchdog service for persistent issues within a short window of time. In other cases, recent changes in logging, configuration or external management activities can be contributing factors to the unexpected terminations. Consider reverting one or more of these changes. If the issue persists, contact NetApp technical support. Подскажите что это может быть ? С виду все работает, других ошибок нет..

попробуйте подождать, похоже процесс какой то сбойнул.
Это сообщение появляется, когда Management Gateway (mgwd) subsystem невольно завершается. Подсистема предоставляет услуги администрирования для управления кластером и текущим узлом. Этот сбой может нарушить административные задачи, выполняемые на текущем узле. Система пытается восстановиться, перезапустив подсистему. Хотя эта ситуация сохраняется, если на текущем узле размещен логический интерфейс управления кластером (LIF), перенесите его на другой узел в кластере, используя другой LIF управления узлом или консольный доступ к другому узлу. Если подсистема не восстанавливается в пределах порогового числа повторных попыток, то отправляется сообщение AutoSupport.

Ivan · 08.07.2019

ошибка ушла спустя сутки. интересно конечно из за чего такое может быть, пойду более детальный лог гляну:gowork:

Goblin · 09.07.2019

Дело может быть в высокой сетевой нагрузке.

Title
[Impact: High] In a FAS2600 or FAS2700 series storage management network, a high network load might cause the expander management daemon in the SP or BMC to hang

Summary
In a FAS2600 or FAS2700 series storage management network, a high network load might cause the expander management daemon in the SP or BMC to hang resulting in the following:

The SP or BMC cannot obtain fan speed readings from the expander; and
The expander cannot obtain CPU/battery sensor readings from the SP or BMC.

This issue is tracked in BUG 1083414.

Impact: High – Node disruption and potential loss of access.

Issue Description
Excessive ingress broadcast traffic, multicast traffic, or in some cases unicast traffic might result in a high network load which can cause the expander management daemon to hang. To mitigate the issue, the ingress rate of broadcast and multicast network traffic has been limited to 1.2Mb on the wrench port via BURT 1217187 (FAS26x0/AFF-A200) and BURT 1226558 (FAS27x0/AFF-A220).

Symptom
A high network load might cause the expander management daemon in the SP or BMC to hang resulting in the following:

The SP or BMC cannot obtain fan speed readings from the expander which will cause ONTAP to shut down; and
The expander cannot obtain CPU/battery sensor readings from the SP or BMC, so the expander will drive and maintain a fan at maximum speed.

Workaround
If possible, the wrench port should be put in a management network where it would experience lower traffic volume. Contact NetApp Support for help in implementing this solution.

Solution
Systems that have experienced this issue should upgrade their FW as soon as possible to either:

SP 5.6P1 for FAS26x0 or AFF-A200, or
BMC 11.3P1 for FAS27x0 or AFF-A220.

Ivan · 09.07.2019

спасибо, попробую менеджмент перенести в другую подсеть

emergency EMS event netapp

Ivan

Случайный прохожий

deadushka

Участник

Ivan

Случайный прохожий

Goblin

Системный архитектор

Ivan

Случайный прохожий