Database Monitoring: What should be monitored at database servers?

May 24, 2023 | por dbsnoop

It is through consistent database monitoring that the availability and performance of applications are ensured, and business continuity as well.

MySQL, Oracle, IBM DB2, Microsoft SQL Server, PostgreSQL, MariaDB—no matter which database you use, it should be monitored.

Each database (SQL or NoSQL) delivers a certain level of telemetry. The telemetry provided is what allows you to create rules and metrics for monitoring your database.

Database monitoring telemetry can be implicit or explicit. Explicit telemetry is easily available; typically, these are tables and/or state variables accessible through simple commands, procedures, functions, or queries. Implicit telemetry, on the other hand, requires interpretation and is usually obtained by cross-referencing information from various sources.

Each database has common monitoring items, and specific items. The purpose of this article is to enlist the main common items that should be accurately monitored.

After all, what should I monitor in my database? What can each item tell me about?

Metric: CPU Consumption

Monitor constant high processor consumption and loads. Don’t worry about sparks. Any peak of less than 5 minutes can only represent a momentary high consumption. Processor high consumption and/or high processor loads can indicate excessive number of concurrent queries, queries taking excessive resources, queries with bad strategies (query execution plan), and even queries doing “full table scan”.

Metric: Disk I / O

Monitor the number of readings and writes per second, as well as how many bytes are written and read per second. Keeping track of the number of read/write, and the amount of bytes read/written per second help to understand if the database are within the capacity of the disks and/or controllers. Setting database’s I/O needs in contrast to the environment’s capabilities, it helps to evaluate if the environment is running out of limits. Queuing operations (I/O Waits) can tell the environment’s (hardware) capacity exhaustion, or, the same issues pointed at CPU Consumption metrics.

Metric: Ethernet I / O

Monitor the number of bytes sent and received per second, and the number of open ports. Just like the disk controllers and the disks themselves, one must know the input and output capacity of their network controllers. Make sure that the network environment has enough capacity to traffic all bytes sent and received by the database. Exhaustion of doors, although rare, can happen. In addition, many more open ports in the Operation System than in database server can indicate: configuration errors, defective and/or incompatible libraries, and even application errors. Monitor whether your controllers are accepting connection to the database.

Metric: Processes running on the Server

Monitor everything that runs on the database server. Nothing should take more resources that the database itself.

Database servers must run on dedicated hosts (or virtual machines). However, some process or agent may show up consuming some extra resources, either by mistake or by other issues. It is important to monitor alien processes running on the same host machine avoiding them and preventing them for stealing resources.

Metric: SWAP

Monitor the SWAP’s behavior and its consumption.

This is a controversial matter, but we will not address it on this short article. Anyways, it is important to monitor the SWAP’s consumption and even its behavior. SWAP is one of those things that is nice to have, but better if it’s never be used. The SWAP’s consumption can point an inadequate database’s configuration, which one is building pressure on RAM memory, using more memory resources than the existing ones. It can also indicate that complex queries (or poorly written queries) is taking a lot of RAM memory pushing the server to use the SWAP.

Metric: Disk space and its state

Monitor disk usage and its condition (accepting writes).

Everyone knows that because it’s pretty basic, but, because it is simple and obvious, normally, it’s forgotten. Therefore, monitor whether your disks have space for data, metadata, indexes, logs, etc. Database is purely intense disk consumer. At all the time it should have disk space for writes, including several distinguished maintenances (objects moves, create and allocate space for new objects, shrinks, etc)

Problems with disks (medias) itself, controllers, file systems, among others, can put your disks in “read only” condition. Then, in addition to not writing logs, there will be no data writing. Some databases can freeze, paralyzing your applications.

Metric: Poor Queries

Monitor queries that are consuming an elevated level of resources or taking too long to complete.

A good monitoring tool should search for poor queries and catalog them based on intelligent thresholds. A query is considered poor when it consumes a lot of resources (CPU and/or Memory), handling more lines than necessary for a given result set. A poor query can also generate crashes, exhaust resources in such a way as to make the entire database slow, unstable and even bring it down. Monitoring and finding these bad guys (queries), and taking immediate action is the key to keeping your database healthy.

Metric: Locks and Deadlocks

Locks and Deadlocks take place silently, most of the time. That’s the reason they must be considered dangerous. One transaction locks couple others for innocent milliseconds. When these milliseconds occur several times a day one can realize how many processing time is being stealing from applications, people, and businesses. It must be addressed.

Monitoring and understanding the causes of locks and deadlocks is mandatory for avoiding poor performance and faulty applications. Identifying and addressing them will bring better performance on OLTP and OLAP operations, reports and dashboards that handle larger data masses.

Usually locks and deadlocks are caused improper use of indexes, excessive concurrence, transaction misconception, misuse of storage engines, and even poor data modeling.

Metric: Buffers and Caches

Even with the advent of rise of powerful SSD drives designed for databases, RAM memory still being more efficient mean. So, keeping as much data in RAM as possible it will make the database faster. Monitor the use of buffers and caches regarding the number of total pages, pages in use, which data is in-memory, flashes, commits, and, if the data stored there make sense.

A good configuration of buffers and caches means that the most accessed data is always in memory, reducing disk I/O and increasing performance. The goal of Buffers and Caches monitoring is to make available as much relevant data as possible in the memory and the less effort to bring data from disk to the memory.

Metric: Indexes

Indexes are part of applications and databases success. Lack of indexes will make simple queries slower, complex ones will suffer to complete. Poor built indexes will take more resources than necessary to fulfill. Too much indexes will make writes slow. Fewer indexes are bad, in the other hand too much indexes are also bad. A balance must be achieved.

Monitor indexes activities. Identify poor performance indexes, the most used ones and those never being used.

Metric: Connections or Sessions

Monitor the sessions connected to the database, cataloging them as active connections, idle connections, and connections with high resource consumption.

For instance, each session incurs a cost for the database. Therefore, whether there is a large number of active connections or not, it can be a source of problems for the host machine. Excessive idle connections may indicate issues on the application or network side.

These are metrics common to the majority of databases that should be monitored. However, there are at least 50 other common metrics, in addition to those specific to each database manufacturer.

In addition to the metrics, another factor of great importance is the Threshold. Thresholds are operational limits (lower and maximum) that define an imaginary arc of normality. Poor thresholds can cause alerts not to be issued correctly. This subject will be covered in coming article.

Learn more about Flightdeck, our free database monitoring platform. Visit our Senior DBA blog or see his contribution at Oracle Community.

Anything is missing on our monitoring platform? Chat with us and we will be happy to develop it.

Give it a try for 14 days, no burocracy, no credit card

Learn more about Flightdeck!


Leia mais