System Monitoring or How and What to Monitor

System Monitoring or How and What to Monitor

Author:


21.06.2021

System monitoring may take different forms. We use it most often to check the function of a website, application, service, or network. How much time is required to get a server response? How much does it take to send a query to a database? Does the virtual machine operate correctly? How many resources does it use and do we have enough of them?  

The answers to those questions will help to optimize processes and respond to any failure in time or even to counteract it.  

How far does system monitoring reach?  

Optimization is connected with effectiveness which refers not only to machines. It is a bit controversial, though not unusual in the times of more and more frequent remote work, to monitor what happens on the employees’ computers. It is possible to check if the employee works or spends their time on social media. This can be compared to monitoring a housing estate with a camera system.  

This is not just invigilation. It may happen that a user’s inbox receives an e-mail with a suspicious link. Monitoring is able to discover that early enough and then we have time to warn the user not to follow the link. This is, e.g. the way DNS Security operates which we discussed when talking about Cisco Umbrella. 

How do we know what to observe? 

One of the assumptions that come to mind first is the message to end-users, namely “we have monitoring — trust us more,” but it is about something entirely different from the perspective of the person managing a given solution.  

Regardless of the target and configuration, monitoring helps to reconstruct what took place on the virtual machine in the past. Thanks to that, as if following the river to get to the sea, we can find an incident or a series of incidents that will tell us why something does not operate or does not do it in the way we want.  

It is worth using monitoring, e.g. for working on an application. And we use it to get to know it better and to optimize it. In this case, data reading will be faster and clearer thanks to tools generating different types of summary boards, e.g. colorful dashboards.  

We sometimes work on services that must operate 24 hours a day and 365 days a year. We should know when something stops working before our hotline is blocked by users detached from the service. 

To that aim, it is worth setting an e-mail or SMS alert which will notify us of any irregularities in the virtual machine operation. 

Which monitoring tool to choose?  

Popular tools include Zabbix and Nagios which are helpful mostly when monitoring network infrastructure. Every tool is different. It is important to choose the tool most convenient for the team. Let us not choose the most popular one. Test a few and select the one which meets our needs best. 

Cloud or on-premise?  

As in the case of any application, we can choose if we act in the cloud or on-premise. There is no simple answer to which option is better. This is conditional on the needs and on the budget.  

Cloud tools are more expensive. However, they relieve us from the obligation to check if the monitoring is functional and the configuration itself, as the online monitoring is pre-developed to check specific items. It also suggests the pre-defined alerts worth using.  

Cloud monitoring cannot be used if the policy of the company where the tool is to be installed does not allow that. It may happen that the company does not want to share data and does not allow installation of any extra agents or there is no physical possibility to install them.  

If our budget is limited and we want online monitoring, we can try a hybrid solution and use cloud monitoring just for examining the most vulnerable spots.  

Mateusz Buczkowski, Head of R&D, Grandmetric

On the other hand, an advantage of on-premise tools is that those are often Open Source solutions, meaning they can be used in their basic version for free. You will have to pay for any extra options which we often need. Nonetheless, the opportunity to test a tool will help to make a decision on which of them to choose.  

Cloud tools, a short specification: New Relic, DataDog, Elastic Stack  

New Relic 

New Relic is a popular cloud tool for application operation monitoring. It enables to monitor applications in Java, .Net, php, node.js, Ruby, Go.  

DataDog 

Datadog combines metrics, tracks, logs, UX test results, and more in a single panel. Users like it as it combines functionalities of several applications in one (you can use it to monitor infrastructure and application) and is easily integrated, e.g. with Slack.  

ELK Stack 

ELK Stack, popularly termed Elastic, is a combination of three Open Source products: Elasticsearch, Logstash, Kibana. Elasticsearch is a NoSQL database based on the Lucene search engine. Logstash is a tool that accepts input data from various sources, carries out different transformations, and exports data in a form specified by us. Kibana is a visualization layer operating above Elasticsearch. 

To put it simply, Logstash collects data, saves it, and sends it to Elasticserach base, and Kibana is responsible for the result visualization.  

Together, they make one of the most popular “combines” for monitoring in IT.  

I have selected a tool, what to do next? 

Remember that the monitoring operation does not end with the tool installation and alert setting. When the traffic on the monitored machine increases, we must be certain that the tool throughput is sufficient to pick the data which is interesting for us.  

And thus… 

How do we know that the system monitoring is functional?  

This is a very good question as we do not need the monitoring which we do not know NOT to operate. This is why it is good to have a monitoring tool on another server or use cloud monitoring to observe if the monitoring agent is functional.  

Non-standard use of monitoring tools 

Adding a bit of work, you can adapt the monitoring tool to non-standard needs. For example, Elastic Stack can be used, with some minor modifications, to monitor IoT devices.  

We employed it to monitor 500 devices installed in a dormitory.  

We used an extension written by ourselves to Elastic Stack. It gathered data and different metrics from the devices in the dormitory. An important extension component was the data formatting method to ensure that it is understandable for the team.  

Next, we created dashboards for ourselves. This case shows that we can use the tool for a purpose different than the one it was originally developed.  

You may not have to develop a separate extension. There are lots of plug-ins both for Zabbix and for Nagios. The community keeps developing new things all the time. We just want to show you that the possibilities are vast. It is important to specify your target and then you will find the resources or somebody to write the required extension.  

Can monitoring help to counteract failures? 

Monitoring is not just the possibility of responding to errors. System monitoring can be configured to ensure that it alerts before the service status changes from up to down. A simple case is insufficient disk space or excessive RAM consumption. You can set an alert to notify you in advance.  

As we mentioned at the beginning of this article, monitoring can also warn you against some dangerous links before they are clicked.  

Machine Learning and AI in monitoring 

Abnormal traffic may be a harbinger of emerging problems. For example, if we know that the traffic is usually the heaviest in daytime and the data shows that it increases at night, something must be wrong and you should review the diagrams to check the cause of this abnormality.  

In this case, AI is to ensure that the agent learns which incidents are correct and which are abnormal. When we receive an alert of any abnormal traffic, review the logs to find the root cause of the problem as somebody might have tried to attack us.  

Machine Learning and AI will be particularly useful to monitor security and detect burglaries. An ordinary alert will notify us only when the traffic, e.g. on the website exceeds the pre-set limit. We will not get an automatic message when the traffic is close to the alarm threshold. It is different for Machine Learning and AI. The system will detect the anomaly automatically, based on long-term reports and data to which we turn our attention to.  

As you can see, system monitoring is a broad problem so we will probably return to it in other threads. Do not hesitate to ask questions in the comments. This will help us to prepare even better content for our readers. Thank you for your trust and time.  

— 

The article is based on a podcast called “Próba Połączenia.” The IT monitoring was discussed by Marcin Biały and Mateusz Buczkowski. Both of them work for Grandmetric. 

Author

Leave a Reply

Your email address will not be published. Required fields are marked *

Sign up to our newsletter!


 

Newsletter