The following guides are just best practices and should help you with common tasks in a azeti SONARPLEX infrastructure. Please make small changes and monitor the performance of your device, adjust step by step. A lot of the success of optimizations depends on your configuration in detail. Make sure to test changes in a test environment first before deploying to your production system. You can use the free VAA (available on www.azeti.net) for testing.
This guide relates to SONARPLEX Generation 5.
- Agent Password: Use a strong azeti Agent Password, best is to use more than 16 letters, special chars and numbers (Administration Web Interface > Configuration > Network > Agent Configuration)
- IP Forward: Disable "IP Forwarding" (forwarding of ip packets between the network interfaces)
- SNMP Community: Change the default SNMP community strings, use strong strings instead of "public" (Administration Web Interface > Configuration > Network > SNMP Configuration and Configuration > System > Status Delivery Configuration for Distributed Monitoring)
- HTTPS: Enable HTTPS instead of unencrypted HTTP (Administration Web Interface > Configuration > Network > HTTP Configuration), optionally change the HTTPS Port to something different than 443,444
- SSL Certificate: Use properly (signed by a trusted CA) signed SSL certificates instead of the default self-signed azeti certificate
- Passwords: Change the Administrator password and use a strong password (Administration Web Interface > Configuration > Accounts > Administrator (admin))
- Accounts: Delete all unnecessary accounts or at least disable the web and admin GUI access for unprivileged users (Administration Web Interface > Configuration > Accounts)
- Updates: Keep your SONARPLEX up to date, check portal.azeti.net regularly or enable Auto Updates if your SONARPLEX does have an internet connection
- Timeouts: Set timeouts for service checks if available, adjust timeouts slightly to find the appropriate values (depends on type and “cost” of check, network latency and the environment). Ideally checks do not execute longer than 3-5 seconds, try a timout of 5 seconds
- Check Intervals #1: Set check intervals wisely, don’t check everything every minute if there is not need for this (most performance issues come from unnecessary short check intervals)
- Check Intervals #2: Slowly changing metrics can be checked with bigger intervals, e.g. HDD usage could be checked every 15-20 minutes instead of every 3 minutes
- PIng vs. ICMP: Use check_icmp instead of check_ping, this has a big impact on the performance
- Check only what you really need to check, adjust parameters to get the needed information with the least effort, for example don’t do check_icmp with 10 packets if you just want to check if a connection is up, 3 packets should be enough
- Processes: Check http://<SONARPLEX-IP>/cgi-bin/ps.cgi to get detailed process and performance metrics, see the listing of every check with its individual execution time and latency to evaluate “evil” and “costly” service checks
- Performance Data: Enable performance data only if needed, processing and storing of the data is costly
- SLA: Enable SLA Processing (Administration Web Interface > Configuration > System > SLA) only if needed, the processing can produce a heavy load
Logging: Disable the logging facilities if your system runs stable. Logging has an impact on the overall I/O resources so only enable it if necessary.
We have seen I/O boosts up to 20% with disabled logging, especially on azeti 600M and azeti NG.
Troubleshooting Distributed Monitoring
The logging capabilities help you to identify issues in any Distributed Monitoring setup. Besides this you can use below check list to rule out possible errors.
- Check the logs for information about the delivery and receipt of the status:
- NOC SONARPLEX Log: "Distributed Monitoring (NOC Processor)”
Sattelite SONARPLEX Log: "Distributed Monitoring (Sattelite Processor)”
The return codes for the send and receive commands are logged, which will give hints if the processing was successful or not. A return code other than 0 is an indication for a problem.
- Check the network connectivity
- Make sure both machines can reach each other at least on the azeti agent port (default 4192)
- check your firewalling configuration
- check your routing if problems persist.
- Use the "Troubleshoot" function in azeti SONARMANAGER to ping and traceroute the devices each another (right click on a SONARPLEX node in the tree view)
- Monitor the azeti agent availability from the sattelites to the NOC SONARPLEX
- Configure the NOC appliance as a new host on your sattellite and the other way around.
- Add a service check to verify if they can reach each other through the agent connection (default port 4192), use the check command check_azeti_uptime or check_azeti_agentversion for example
SONARPLEX Performance Metrics
Beginning with SONARPLEX OS 3.7.0a default service checks for the performance of the SONARPLEX are added by default, find them at the default host –azeti-A-. These new checks help you to identify bottlenecks and to scale up in time.
CPU & MEM
The average CPU and MEM usage over time should range below 75%.
PROCS & LOAD
A high number of concurrent processes imply a configuration issue. Often this is caused by a service check command with a high execution time (5 seconds and more) which forces other processes to wait in the queue, this effect sums up and causes a large number of processes and a high load. The load is the number of processes, which are waiting for system resources (I/O). The load should range below 7 – 10.
This is the amount of time between the scheduled execution time and the actual execution time. Ideally every service check has a latency of 0 seconds. Make sure to have a latency below 10 seconds, better below 5. If you see high latencies than there are too much concurrent service checks, this can be adjusted slightly by decreasing the concurrent service check number (Configuration > System > Load Configuration) but the ideal solution is to scale up with an addition SONARPLEX. Recommendation: < 5 seconds
The average service check execution time is a important performance metric as it helps you to identify the average cost of your checks. The smaller the execution, the more service checks can be executed per minute. A high execution time implies slow service check commands, have a look into each different service to identify the slow and costly service checks. Either try to optimize the service check plug-in or increase the service check interval to lower the overall execution time. The service check execution time should ideally range below 3-5 seconds. Recommendation: < 3-5 seconds
SONARPLEX Virtual Appliance (VAA) Sizing
The appropriate sizing of a VAA highly depends on the used service check commands, service check interval and the “cost” (execution time) of the service checks. Try to start with a small machine setup and scale it up as the load increases. Make sure to keep an eye on the most important performance metrics (service check latency and service check execution time). Below is a table with sizing recommendations depending of the number of services.
|up to 100||Single Core, 500 MHz||512 MB|
|300||Dual Core, 1 GHz||1 GB|
|1000||Dual Core, 2 GHz||2 GB|
|3000||Multi CPU, Multi Core, 2,5 GHZ or better||4 GB|