Statistical hosts are the use of statistics on groups of hosts. This allows a user to see what percent of a group of hosts are experiencing the same issue. It is very useful when monitoring fleets of devices such as wireless access points within a building or redundant nodes.

I created 10 VMs and added them to Zabbix as hosts that are named “CentOS – Node #”. They are being monitored by ICMP ping to keep the setup simple. All hosts are added to group “cluster1”. It is required that the hosts be within the same group in order to be configured with a statistical host. Note that Zabbix allows hosts to belong to an unlimited number of groups, so it is normal to have a single host within many groups.

10 nodes in the same group “cluster1”

Template Configuration

2 templates need to be created for everything to work. The first is the template that defines monitoring per host, the second is the template that determines group status.

Host Template

Create a template that describes the host to be monitored. For our example, it will be “Template CentOS Node”. The associated group be should “Templates” at a minimum. You can assign a group here that will be applied to all hosts with the template applied. That will make it easy to apply statistics to all hosts within the same template.

Add a Linked template to link items and triggers. For this example we will use ICMP Ping.

Add the template and link your hosts with it. You can use the Mass update feature by selecting all of your hosts with the check boxes to the left, then click Mass update at the bottom. From there you can bulk edit the hosts.

For our example, the nodes are all assigned to the newly created host template:

Navigate to that template by going to Configuration > Templates and selecting it. There should be a few items inherited by the linked template. We need to add a few things for the group template to work.

Start by creating an item named “CentOS node is up”. This is going to relate ICMP ping result to a key item that will later be used for calculation. For this example, the fields are:

ItemValue
NameCentOS node is up
TypeCalculate
Keycentosnodesup
Formulalast(“icmpping”)=1
Update Interval{$NORMAL_UPDATE_INTERVAL}
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period{$TREND_STORAGE_PERIOD}
ApplicationsDevice Status

Preprocessing settings:

NameParameters
In range0
*Custom on failSet Value to 0

*Custom on fail is enabled by the check box on the right of the preprocessing step.

Now we need to create an item for the opposite, which is node is down. Create the item and use the following fields:

ItemValue
NameCentOS node is down
TypeDependent item
Keycentosnodesdown
Master itemTemplate CentOS Node: Node host is up
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period{$TREND_STORAGE_PERIOD}
ApplicationsDevice Status

Preprocessing settings:

NameParameters
Regular expression^1$
*Custom on failSet Value to 1

*Custom on fail is enabled by the check box on the right of the preprocessing step.

There is one more item that needs to be created. It is the item that will instruct our statistics to include the host in the total to prevent miscalculations in case of an non-understood response from an item. This item will be called “one” instead of up/down:

ItemValue
Nameone
TypeCalculated
Keycentosnodeone
Formula1
Update interval{$SLOW_UPDATE_INTERVAL}
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period{$TREND_STORAGE_PERIOD}
ApplicationsConstants

No preprocessing steps.

There should now be six items within the template we are creating, but there may be more or less based on which linked templates you used.

Alright so now we have finished creating the template that will be applied to each host. So the way it works is the template uses linked items, in this case it is ICMP ping checks, to monitor the hosts. The additional items we created will translate those items into information that can be calculated by the group template we are about to create. You are not limited to ICMP ping, you can easily translate any other items such as an Agent check. At the end we will verify our information and do an advanced check.

Group Template

Create a template that describes the group of hosts to be monitored. For our example, it will be “Template Group Device CentOS Node”. The associated group should be “Templates” at a minimum.

Items

Open the template that was just created and go to Items.

Click on Create item in the top right corner. I am populating the fields with the following:

ItemValue
NameCentOS nodes down
TypeZabbix Aggregate
Keygrpsum[“{$GROUP_NAME}”,”centosnodesdown”,”last”]
Update interval{$NORMAL_UPDATE_INTERVAL}
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period{$TREND_STORAGE_PERIOD}
ApplicationsStats

Preprocessing settings:

NameParameters
In range0
*Custom on failSet Value to 0

*Custom on fail is enabled by the check box on the right of the preprocessing step.

Create the same item, except use up instead of down for all values:

ItemValue
NameCentOS nodes up
TypeZabbix Aggregate
Keygrpsum[“{$GROUP_NAME}”,”centosnodesup”,”last”]
Update interval{$NORMAL_UPDATE_INTERVAL}
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period{$TREND_STORAGE_PERIOD}
ApplicationsStats

Preprocessing settings:

NameParameters
In range0
*Custom on failSet Value to 0

*Custom on fail is enabled by the check box on the right of the preprocessing step.

Create another item that uses total instead of up/down:

ItemValue
NameCentOS nodes total
TypeZabbix Aggregate
Keygrpsum[“{$GROUP_NAME}”,”centosnodeone”,”last”]
Update interval{$NORMAL_UPDATE_INTERVAL}
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period{$TREND_STORAGE_PERIOD}
ApplicationsStats

Preprocessing settings:

NameParameters
In range0
*Custom on failSet Value to 0

*Custom on fail is enabled by the check box on the right of the preprocessing step.

There should be three items now: up, down, total

An item can be created now that will determine the percent of hosts down based on the host up, down, and total values. I am going to name the item CentOS nodes down percent. Make the name of yours similar and use the following values:

ItemValue
NameCentOS nodes down percent
TypeCalculated
Keycentosnodesdownpercent
Formulalast(“grpsum[\”{$GROUP_NAME}\”,\”centosnodesdown\”,\”last\”]”)/last(“grpsum[\”{$GROUP_NAME}\”,\”centosnodeone\”,\”last\”]”)*100
Type of informationNumeric (float)
Units%
Update Interval{$NORMAL_UPDATE_INTERVAL}
History storage period{$HISTORY_STORAGE_PERIOD}
Trend storage period
{$TREND_STORAGE_PERIOD}
ApplicationsStats

Preprocessing settings:

NameParameters
In range0
*Custom on failSet Value to 0

*Custom on fail is enabled by the check box on the right of the preprocessing step.

That is enough with the Group template for now.

Statistical Host

To test that these items are working, let’s skip triggers for now and create the host that will be used to display the data. Navigate to Configuration > Hosts and click Create a host in the top right corner.

I am going to name the host “Statistic CentOS Cluster Status” and set the group to “Statistic + device type”. Name yours something similar to let you know that it is a statistical host.

Under the Templates tab for the host creation, link the newly created template that has the items for collecting host information. Mine was called “Template Group Device CentOS Node”:

Under the Macros tab for the host creation, create a macro that will assign a group name to be monitored. The value Macro used in above examples is “{$GROUP_NAME}” so I create that and set a value of “cluster1” because it is the group the nodes are placed within.

Click Add and the host should be created.

Testing

That should be enough to test now, so go to Monitoring > Latest data and select the statistical host that was just created. Within a minute or two some data should populate.

We can test it further by shutting down a host.

It looks like it is working, so let’s add some triggers.

Triggers

The triggers will be within the group template. Some macros need to be defined for the triggers to work, so we will do that first.

Navigate there via Configuration > Templates and select the group template you created. For the example, it is “Template Group Device CentOS Node”

Click on Macros and assign the following:

MacroValue
{$HOSTDOWN_AVERAGE_THRESHOLD}10
{$HOSTDOWN_DISASTER_THRESHOLD}30
{$HOSTDOWN_HIGH_THRESHOLD}20
{$HOSTDOWN_WARNING_THRESHOLD}5
{$HOSTS_TOTAL_THRESHOLD}1

Now that the macros are setup, we can create the triggers. Remember to click Update, then navigate back to the template and click on Triggers then click Create trigger.

Let’s start from the top, create a trigger for Disaster level. It will be named “More than {$HOSTDOWN_DISASTER_THRESHOLD}% CentOS node hosts are down”. This will allow the macro value to update the trigger name. Set the severity for Warning.

Now let’s create the expression. There is a little additional logic with this expression that will allow you to set a minimum number of hosts that have to go offline, no matter what the percent actually is. This prevents a small volume host group from creating a disaster level trigger that is most likely not as serious as a high volume host group. For example, if you have 5 devices and 3 out of them go down, you will receive a trigger for >50% hosts are down. This may want to be avoided in case you have another group of devices that 100 total and it is much more concerning when >50% go down.

Back to the expression, click Add and Select the item “CentOS nodes total”, leave the function as last and set the Result to be “>” “{$HOSTS_TOTAL_THRESHOLD}”:

Add ” and ” after the expression and click Add again. Now Select “CentOS nodes down percent” and set Function to avg, Last of (T) to “{$TRIGGER_COUNT}”, and Result to “>” {$HOSTDOWN_DISASTER_THRESHOLD}”:

The result should look like this:

The final expression is:

{Template Group Device CentOS Node:grpsum["{$GROUP_NAME}","centosnodeone","last"].last()}>{$HOSTS_TOTAL_THRESHOLD} and {Template Group Device CentOS Node:centosnodesdownpercent.avg({$TRIGGER_COUNT})}>{$HOSTDOWN_DISASTER_THRESHOLD}

Add this trigger and let’s create the next tier down, High. Click Create trigger and name it “More than {$HOSTDOWN_HIGH_THRESHOLD}% CentOS nodes hosts are down” and set the Severity for High.

For the Expression, remember it is a comparison of the number of total nodes to the macro set for minimum allowed nodes to set off a trigger. So it is Nodes Total > Set threshold. Nodes total is simply named “… nodes total” and the threshold is “{$HOSTS_TOTAL_THRESHOLD}”. The expression builder should look like this:

And the formula should look like this:

{Template Group Device CentOS Node:grpsum["{$GROUP_NAME}","centosnodeone","last"].last()}>{$HOSTS_TOTAL_THRESHOLD}

Now type ” and ” after that expression and click Add again. This is telling Zabbix that both conditions must be met for the trigger to go high. Now click Add again and add the second half of the expression.

This half will say that if the average of the previous checks is higher than the threshold, then the trigger should go high. The Item is the percent of down, so select that. The function is avg() with a Last of (T) value of “{$TRIGGER_COUNT}” which allows you to set a macro to define this. This is how Zabbix can be scaled much easier. The Result should be set to “>” “{$HOSTDOWN_HIGH_THRESHOLD}”. This second half of the expression should look like this:

Click Insert and the full expression should look like this:

{Template Group Device CentOS Node:grpsum["{$GROUP_NAME}","centosnodeone","last"].last()}>{$HOSTS_TOTAL_THRESHOLD} and {Template Group Device CentOS Node:centosnodesdownpercent.avg({$TRIGGER_COUNT})}>{$HOSTDOWN_HIGH_THRESHOLD}

Before you add the trigger, let’s create a dependency that will prevent several triggers from happening at the same time. If you have 60% of hosts down and your thresholds are 10, 20, 30, and 40% then all of those triggers will go off. We can prevent that from happening by having dependencies, so the result is that only the 40% trigger will go off.

Click on Dependencies, click Add, then click “More than 30% CentOS node hosts are down” and select. That is all that is necessary. As you work down the triggers by severity, make sure to select all of the triggers higher than it.

Click Add and the trigger should be created.

Now we have two triggers created. Only two more left to match up all of the macro thresholds created. On your own, create the Average and Warning triggers. They follow the same design as the two created already, just remember to create Average first and set the dependencies to all other existing triggers.

With all of the triggers created, the template trigger page should look similar to this:

Remember to make sure your global macro includes “{$TRIGGER_COUNT}” or else none of the triggers with the macro will work. You can edit this under Administration > General and in the top right corner click Macros. Add “{$TRIGGER_COUNT}” and set it for a number like #3. It needs to have “#” as a preface to work.

Testing – Part 2

Now we can still see that there is one host down:

This is 10% of the hosts, so the thresholds we set should have a Warning trigger. Let’s see if we have any problems showing up yet:

It looks like there are none, by why? We have {$HOSTS_TOTAL_THRESHOLD} set for 1, there is one host down, but the expression says it must be greater than the value. So let’s set the macro threshold to 0 and see what happens:

It works, so now lets shut a few more nodes down and see what happens.

Once again, it works exactly as expected. There is only one problem being shown since the others have dependencies set up.

Comments

  1. Ya

    Hello

    I try to follow on my Zabbix 5.0.1, but there is an error and confuse steps in
    Navigate to that template by going to Configuration > Templates and selecting it. There should be a few items inherited by the linked template. We need to add a few things for the group template to work.

    Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *