Engine Troubleshooting Guide

Introduction

This guide explains tools to troubleshooting the azeti Engine server stack, this requires a deep system understanding and is considered to be an advanced read.

On this page:

Related pages:




General overview of how services are configured within an azeti Engine Installation

In order to have a general overview please read the general vm installation guide:

https://azetinetworks.atlassian.net/wiki/display/AC/azeti+Cloud+-+Install


Identifying Services on a given azeti Engine Installation

In order to identify the vm/container's one should query the *azeti-cfg* database.
To access the *azeti-cfg* database on can take a look on how Tomcat is configured to access it by looking at the {{/etc/default/tomcat8}} file.

JAVA_OPTS="$JAVA_OPTS -Dazeti.ssc.config.jdbc.driverClassName=org.postgresql.Driver -Dazeti.ssc.config.jdbc.url=jdbc:postgresql://127.0.0.1:5432/azeti-cfg?ApplicationName=SSCConfig 


Then we must query the table azeti_ssc_app_config to view how the applications access to the specific services:

select * from azeti_ssc_app_config


Identifying Services on a given VM

In order to identify the services running on a VM one should look at how they are monitored, started and stopped by looking at the {{/etc/monit/conf.d/azeti_cloud_monit}} file.

For example we identify how PostgreSQL is monitored, started and stopped.

# Postgresql 9.4
check process postgresql with pidfile /var/lib/postgresql/9.4/main/postmaster.pid
group database
stop program = "/etc/init.d/postgresql stop"
start program = "/etc/init.d/postgresql restart"
if failed unixsocket /var/run/postgresql/.s.PGSQL.5432 protocol pgsql then start
if failed unixsocket /var/run/postgresql/.s.PGSQL.5432 protocol pgsql then alert
if failed host localhost port 5432 protocol pgsql then start
if failed host localhost port 5432 protocol pgsql then alert
if failed host localhost port 5432 protocol pgsql then exec /etc/monit/slack
if failed host localhost port 5432 protocol pgsql then exec /etc/monit/hipchat
if 5 restarts within 5 cycles then timeout



API Test

The objective of the api/test call is to have an automated way of checking the status of the inner applications services by directly sending an event message from a simulated gateway (a SiteController called CloudCheck and a Sensor called CloudTest) and then checking all the internals trough this message and asserting that all the step have been performed correctly.

  1. Pre-requisites

    In order to perform this test it is mandatory to have an application user with two specific roles:

    ADMIN or OPERATOR role (In order to been able to make and API call, and subscribe to the application Topics)

    plus

    GATEWAY role (In order to send a message as a SiteController called CloudCheck for a Sensor called CloudTest).

  2. Call

    The call is a POST request to the https://azeticloudXX.azeti.net/SSCServices/api/test with the user described in the previous step.

    curl -sL -u 'test@azeti.org:xxxxxxxx' -X POST https://azeticloudXX.azeti.net/SSCServices/api/test| python -m json.tool 
  3. Tests Performed
    There are currently 8 Tests performed returning 0 for failure of the test or 1 for success of the test.

    1. Api
      This is the basic test of reaching the API endpoint of SSCServices/api/test plus successfully authenticating the user of the call.

    2. Database
      This is the basic test of making a relational database query for the number of locations plus checking for valid numeric result.

    3. Broker
      This is the basic test of making a Broker API request plus checking for the Store, Memory and Temp percentages and the Current Connections Count.
      In order to pass the test the Store has to be less than 75% the Memory less than 90% and the Temp less than 75% and the Current Connections Count less than 5000.

    4. BrokerPub
      This is the basic test of publishing a random event message to the Broker from the CloudTest sensor from the CloudCheck gateway.

    5. BrokerSub
      This is the basic test of making a subscription to the ACP application topic for the CloudTest sensor from the CloudCheck gateway plus checking that the random event message has been successfully received at the ACP application level within 750 milliseconds.

    6. BrokerActivity
      This test checks the Broker Queues that the ACP application uses internally and and Returns Error when one of the queues has More Than 200 Pending Messages.

      Also in case that the outbound queues has been enabled but no consumer is dequeuing (No Stomp Client connected) them deletes them in case that the Pending size is higher than 100K.
    7. TsInflux08
       This is is the basic test of making a query to the InfluxDB version 0.8 time series database to assure that the random event message has been stored within 1200 milliseconds.

    8. TsInfluxDB
      This is is the basic test of making a query to the InfluxDB version 0.13.+ time series database to assure that the random event message has been stored within 1200 milliseconds.

  4. Additional Results

    Apart from the mentioned tests, the answer also returns number of Errors (That must be 0 if all the tests returned 1) with consequent ErrorMessages.

    The number of Warnings (Independent of the success result of the tests) with consequent WarningMessages.

    The total Time spend on the tests in milliseconds.

  5. Example of a successful answer without warnigs

    "Errors": 0
    "Warnings": 0

    Note: The error counter of 0 corresponds to the all the current 8 tests being successful and delivering 1 as a result

    "Api": 1
    "Database": 1
    "Broker": 1
    "BrokerPub": 1
    "BrokerSub": 1
    "TsInflux08": 1
    "TsInfluxDB": 1
    "BrokerActivity": 1

    [
        {
            "Api": 1
        },
        {
            "Database": 1
        },
        {
            "Broker": 1
        },
        {
            "BrokerPub": 1
        },
        {
            "BrokerSub": 1
        },
        {
            "TsInflux08": 1
        },
        {
            "TsInfluxDB": 1
        },
        {
            "BrokerActivity": 1
        },
        {
            "InfoMessages": [
                "Api Test result: user: 580751",
                "Database Test result: locations: 4",
                "Broker Uptime result: 8 hours 6 minutes. CurrentConnections: 150. Store: 10% Memory: 0% Temp: 0%",
                "BrokerPub Test result:  Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Published: true",
                "BrokerSub Test result:  Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Values: true Times: true Points: 1",
                "TsInflux08 Test result: Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Values: true Times: true Points: 1",
                "TsInfluxDB Test result: Value: -177231141 Time: 2016-08-24T19:49:30.539Z+0200 Values: true Times: true Points: 1",
                "BrokerActivity Test result: [With Zero Pending: 26] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]"
            ]
        },
        {
            "Errors": 0
        },
        {
            "ErrorMessages": []
        },
        {
            "Warnings": 0
        },
        {
            "WarningMessages": []
        },
        {
            "Time": 3119
        }
    ]
  6. Example of an answer with ERRORS

    "Errors": 2
    "Warnings": 0

    Note: The error counter of 2 corresponds to 2 tests delivering 0 as a result.

    "TsInflux08": 0
    "BrokerActivity": 0

    Note: The error counter of 2 produced the corresponding 2 ErrorMessages with further details.

    [
        {
            "Api": 1
        },
        {
            "Database": 1
        },
        {
            "Broker": 1
        },
        {
            "BrokerPub": 1
        },
        {
            "BrokerSub": 1
        },
        {
            "TsInflux08": 0
        },
        {
            "TsInfluxDB": 1
        },
        {
            "BrokerActivity": 0
        },
        {
            "InfoMessages": [
                "Api Test result: user: 580751",
                "Database Test result: locations: 4",
                "Broker Uptime result: 1 hour 26 minutes. CurrentConnections: 152. Store: 1% Memory: 0% Temp: 0%",
                "BrokerPub Test result:  Value: -195000897 Time: 2016-08-24T13:09:09.764Z+0200 Published: true",
                "BrokerSub Test result:  Value: -195000897 Time: 2016-08-24T13:09:09.764Z+0200 Values: true Times: true Points: 1",
                "TsInfluxDB Test result: Value: 552711248 Time: 2016-08-24T13:08:33.386Z+0200 Values: false Times: false Points: 1",
                "BrokerActivity Test result: [With Zero Pending: 21] [azeti.ts_hd_topersist 1] [azeti.sscbroker_events 1] [azeti.notification.in 54] [azeti.idbts_values_topersist 1] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]"
            ]
        },
        {
            "Errors": 2
        },
        {
            "ErrorMessages": [
                "TsInflux08 empty result! not yet persisted after waiting for 1200ms? query : select value from \"sp.89fc49f5-0da3-47e9-8a65-4e149a17364d.cutype.3.checkunit.c50fbc07-2efa-421a-8f27-43bd42a3d45e.Value\" where value = -195000897 and time > 1472036949764 limit 5",
                "BrokerActivity Internal Pending Limits Reached for: azeti.sscbroker_hd: 308 Enq/Deq: 1072/1058"
            ]
        },
        {
            "Warnings": 0
        },
        {
            "WarningMessages": []
        },
        {
            "Time": 2946
        }
    ] 
  7. About the InfoMessages
    Some details of about the InfoMessages.

    "InfoMessages": [
                "Api Test result: user: 9303",
                "Database Test result: locations: 17",
                "Broker Uptime result: 3 hours 8 minutes. CurrentConnections: 115. Store: 4% Memory: 0% Temp: 0%",
                "BrokerPub Test result:  Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Published: true",
                "BrokerSub Test result:  Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1",
                "TsInflux08 Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1",
                "TsInfluxDB Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1",
                "BrokerActivity Test result: [With Zero Pending: 26] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]"
            ]

    Even in the case that the test resulted with success: 1 there are InfoMessages in order to analyse in detail the performed tests.

    Note: How the InfoMessages deliver further details on how the test performed. 

    "Api Test result: user: 9303" 
    Displays that the user 9303 was successfully authenticated.

    "Database Test result: locations: 17" 
    Displays that the database test query returned 17 locations.

    "Broker Uptime result: 3 hours 8 minutes. CurrentConnections: 115. Store: 4% Memory: 0% Temp: 0%" 
    Displays the actual Broker values.

    "BrokerPub Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Published: true" 
    Displays the value and timestamp of the published test event.

    "BrokerSub Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1" 
    Displays the correctness of the value and timestamp of the application subscribed test event.

    "TsInflux08 Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1" 
    Displays the correctness of the value and timestamp of the InfluxDB 08 stored test event.

    "TsInfluxDB Test result: Value: -2090789709 Time: 2016-08-15T19:22:04.603Z+0200 Values: true Times: true Points: 1" 
    Displays the correctness of the value and timestamp of the InfluxDB 0.13.+ stored test event.

    "BrokerActivity Test result: [With Zero Pending: 26] [Non Existing: org.apache.activemq:type=Broker,brokerName=activemq,destinationType=Queue,destinationName=*_outbound_*]" 
    Displays the Broker Queues with 0 Pending messages plus the Non Existing Queues to test.

  8. Troubleshoting
    In case that some test returns with error: 0 it is important to analyse the ErrorMessages plus the InfoMessages in order of understanding what is going on.

    Performing the same call after a minute will help to understand if it was a temporally problem (i.e.: due to a heavy load) or a more serious problem.

Troubleshooting

Some nice to have troubleshooting tips on live environments:

  1. Display current activity on PostgreSQL database

    su postgres
    psql -d postgres
    
    select count(*) as total, datname, application_name, waiting, state, query from pg_stat_activity group by datname, application_name, waiting, state, query order by total desc;
  2. InfluxDB Monitoring
    System Monitoring 
    https://github.com/influxdata/influxdb/blob/master/monitor/README.md
    How to use the SHOW STATS command and the _internal database to monitor InfluxDB
    https://www.influxdata.com/how-to-use-the-show-stats-command-and-the-_internal-database-to-monitor-influxdb/

  3.