|
In order to distinguish the watch dog documented here from the one implemented in Python being a part of the SiteController for years by now, the function range of the two watch dogs are briefly recapitulated here.
The Python watch dog is monitoring the SiteController modules that are supposed to run according to its sensor configuration (a.k.a. Site template). That watch dog will send a special request telegram in a regular interval to the modules and expects an answer from each of the modules stating the internal health state of each module by themselves. The watch dog waits for those answers. Should an answer time out, the watch dog checks if that modules has a process in the operating system process table.
If the process is missing, the watch dog declares the module as crashed and immediately tries to start that process anew. If by contrast the watch dog finds a process however, it assumes that process to be hanging and reports this to the cloud server. But it will not try to automatically heal the situation as sometimes a process just needs a bit of extra time for completing the processing of a bigger set of data.
The type of watch dog this document is about however is supposed to check the entire system for its health state. It uses a special watch dog hardware timer that needs to be reset (kicked) in regular intervals. If that watch dog times out the system will be rebooted immediately.
That could bring a system back up in case it crashed entirely, without any intervention of an operator required.
The NIFE103 comes with a chip (namely NCT7904D) that allows (next to monitoring other critical system sensor parameters) to use a hardware based watch dog.
A (very rough) documentation of the NIFE103's Watch Dog functionality could be found in [1] and a documentation of the specific hardware monitoring chip in [2].
In short, the watch dog needs to be configured and kicked via SMBus ioctls. An attempt to do so via a Python script directly was unsuccessfull so the actual calls needed to be implemented and compiled in C.
The binaries provided by this package requires to run on a Nexcom NIFE103 hardware that has a NCT7904D chip build in which answers on i2c-7
bus on address 2dhex.
Do not attempt to run these binaries on any other hardware as they may otherwise permanently damage your system. |
nct7904
kernel module needs to be blacklisted in modprobe configuration.i2c_i801
must NOT be blacklisted in modprobe configuration.If not already provided by your system's installed OS image, extract the files from the archive into your ${SITECONTROLLER_HOME}/scripts
folder. This is usually located at /opt/azeti/SiteController/scripts
:
mkdir -p /opt/azeti/SiteController/scripts tar -xvzf NIFE103-control.tar.gz -C /opt/azeti/SiteController/scripts |
make sure the binaries have correct ownership and permissions:
chown root:root /opt/azeti/SiteController/scripts/NIFE103-wdt* \ /opt/azeti/SiteController/scripts/watchdog.sh chmod 0700 /opt/azeti/SiteController/scripts/NIFE103-wdt* \ /opt/azeti/SiteController/scripts/watchdog.sh |
Double check nct7904
is blacklisted:
grep "nct7904" /etc/modprobe.d/blacklist* /etc/modprobe.d/blacklist.conf:blacklist nct7904 |
If it is not blacklisted, just add a blacklist nct7904
entry to /etc/modprobe.d/blacklist.conf
or remove the #
in the beginning of the line, if that entry was just commented out.
Double check i2c_i801
is NOT blacklisted (note the #
):
grep "i2c_i801" /etc/modprobe.d/blacklist* /etc/modprobe.d/blacklist.conf:#blacklist i2c_i801 |
If it is blacklisted, just add a #
at the beginning of that line.
modify the SiteController.cfg
file a.k.a. Site configuration so that in section [remote_exec_calls]
the following entries are present:
[remote_exec_calls] watchdog_start = /opt/azeti/SiteController/scripts/watchdog.sh start watchdog_stop = /opt/azeti/SiteController/scripts/watchdog.sh stop watchdog_kick = /opt/azeti/SiteController/scripts/watchdog.sh kick |
upload the watchdog-NIFE103.template.xml
file as a new component template to the cloud (if it's not already available there, that is)
add this component template to your Site template in the usual way.
NIFE103-wdt-init | will initialize the watch dog timer (needs to be run once before using the watch dog). It accepts a single parameter which sets the timeout value for the watch dog in minutes. If omitted the parameter defaults to 10 minutes. |
NIFE103-wdt-start | starts the timer. |
NIFE103-wdt-stop | stops the timer. After this command the watch dog is no longer guarding the system. |
NIFE103-wdt-reset_timer | resets the timer and thus 'kicks' the watch dog. This binary also requires the timeout value as a command line parameter. Same as the wdt-init executable the value is expected to be specified in minutes and defaults to 10 if the parameter is not provided. |
README.md | this file you're reading right now. |
watchdog.sh | interface shell script to be used with the SiteController. You may want to edit the TIMEOUT_MINUTES variable in this script. |
watchdog-NIFE103.template.xml | a component template that could be used to run the watch dog timer with the SiteController. If you changed the TIMEOUT_MINUTES in watchdog.sh , you may also want to adapt the timer parameter in xpath('/component_template/ac_rules/rule[1]/timers/timer[1]/@delay') of watchdog-NIFE103.template.xml . |
The SiteController is now set up to be monitoring the system state. All it remains to do is a restart of the SiteController. Once restarted the hardware based watch dog is initialized and started, the timer in the automation rule set will kick the watchdog_kick
action every 120 seconds which in turn will call the necessary io controls to reset the timer.
Should the watch dog run into a time-out after 10 minutes without any call to these ioctl
, the system is rebooted.
lm-sensors
package to monitor the NIFE103 hardware sensors because otherwise the kernel module locks the access to the SMBus.watchdogd
[1]: http://files.nexcom.com/Driver/NIFE103/User_Manual_NIFE103_170928.pdf (local copy)
[2]: https://www.nuvoton.com/resource-files/NCT7904D_Datasheet_V1.44.pdf (local copy)
|