5 Useful Database Metrics Every SRE Should Track

November 14, 2025 | by dbsnoop

As a Site Reliability Engineer (SRE), your mission is to ensure the stability and performance of complex systems at scale. Your arsenal of tools likely includes robust Grafana dashboards, Prometheus alerts, and detailed logs, all focused on keeping your SLOs (Service Level Objectives) intact.

When it comes to the database—the most critical and often the most opaque layer of your stack—the tendency is to rely on the “black box” metrics provided by your cloud platform: CPUUtilization, FreeableMemory, Read/Write IOPS. These metrics are important, but dangerously insufficient.

They describe the load on the infrastructure, but they reveal nothing about the nature of the work that is causing that load. Relying solely on them is like flying a plane by looking only at the fuel gauge; you know if you’re spending a lot, but you have no idea if the engine is operating efficiently or if it’s about to catch fire. For an SRE, who lives by the discipline of reducing the mean time to resolution (MTTR), this lack of context is a critical liability.

To evolve from simple symptom monitoring to a real root cause diagnosis, you need to go beyond the obvious. This guide presents five database metrics, focused on the workload, that every SRE should track to get a deep, actionable, and truly reliable view of the health of their data systems.

1. DB Time (or Database Load)

If you could choose only one metric to define the total load on your database, this would be it. DB Time, also known as “Average Active Sessions,” is the most accurate measure of your database’s total activity.

What is it? DB Time is the accumulated sum of time that all database sessions have spent active (either working on the CPU or waiting for a resource like disk, network, or a lock) during a given interval. If, in a 60-second interval, the total DB Time was 120 seconds, it means that, on average, two sessions were active every second.
Why does it matter? This metric is your universal “load indicator.” Your database’s maximum capacity is dictated by the number of CPU cores. A server with 8 vCPUs can, at most, support 8 seconds of CPU work per second of real time. If your DB Time consistently exceeds the number of vCPUs, it means that sessions are being queued and waiting for CPU time, resulting in a direct increase in latency for all queries. It captures both productive (CPU) and unproductive (wait) work, providing a holistic view of the load.
How it surpasses traditional metrics: CPUUtilization only tells you how busy the processor is. It says nothing about the sessions that are stalled, waiting for I/O or a lock. A system can have a low CPU (e.g., 20%), but a very high DB Time, indicating a severe lock contention problem that is paralyzing the application. DB Time captures both scenarios. A platform like dbsnOOp centralizes DB Time, correlating it with the CPU and breaking it down by the queries that contribute to it the most, allowing an SRE to see not only that the load is high, but which query is responsible for that load.

2. Wait Events (or Wait Statistics)

If DB Time tells you how much the database is busy, Wait Events tell you what it is busy with. This is the most powerful diagnostic metric for understanding the nature of a performance bottleneck.

What are they? When a database session cannot proceed because it needs a resource that is not currently available (a data block from the disk, a lock held by another session, etc.), it enters a wait state and records a “Wait Event.” Modern databases track hundreds of types of wait events.
Why do they matter? The aggregated analysis of wait events reveals the personality of your bottleneck.
- I/O Waits (e.g., io:data_file_read, pageiolatch_sh): If these are at the top, your system is “disk-bound.” Queries are demanding more data than can fit in memory (RAM), forcing slow reads from storage. The root cause could be a missing index, inefficient queries, or insufficient memory.
- Lock Waits (e.g., LCK_M_X, enq: TX – row lock contention): If these dominate, your problem is application contention. Multiple transactions are competing for the same rows, causing blocks and paralyzing processing. The cause could be slow UPDATE queries or a problematic transaction design.
- CPU Waits (e.g., CPU or SOS_SCHEDULER_YIELD): If the main “wait” is for CPU time, it means your bottleneck is computational. The queries are complex and require a lot of processing. The solution is query optimization or, as a last resort, a hardware upgrade.
How it surpasses traditional metrics: An SRE looking at an I/O dashboard might see a ReadIOPS spike but has no idea if this spike is healthy (many small, efficient reads) or problematic (a single query doing a massive scan). The analysis of Wait Events provides this context. dbsnOOp not only shows the main wait events but also connects them directly to the queries that are causing them, allowing the SRE to see that 90% of the I/O waits are being generated by query X, providing a direct path to resolution.

3. Buffer Cache Hit Ratio

This metric measures how efficiently your database is using its most precious and fastest resource: RAM.

What is it? The Buffer Cache (or Buffer Pool) is an area of RAM where the database stores data pages that have been recently read from the disk. When a query needs a piece of data, it first looks in the Buffer Cache. If the data is there (a “cache hit”), the operation is extremely fast. If it’s not (a “cache miss”), the database needs to perform a slow physical read from the disk. The Buffer Cache Hit Ratio is the percentage of times the data was found in memory versus the total number of requests.
Why does it matter? A high cache hit ratio (usually above 99% for OLTP systems) is a sign of a healthy and performant system. A consistently low or declining hit ratio is a red flag. It means that the “working set” (the actively used data set by your application) does not fit in the memory allocated to the database. This can be caused by:
- Insufficient Memory: The instance is simply too small for the workload.
- Inefficient Queries: Massive Full Table Scans can “pollute” the cache, reading millions of irrelevant rows and forcing out useful, “hot” data, harming the performance of all other queries.
How it surpasses traditional metrics: The FreeableMemory metric only tells you how much RAM the operating system thinks is free. It says nothing about the quality of the memory usage by the database. A server can have little free memory (which is good, as the database should use as much RAM as possible for the cache), but a high cache hit ratio. Conversely, it can have free memory but a low hit ratio. The Buffer Cache Hit Ratio is the true measure of memory efficiency. dbsnOOp monitors this metric over time and correlates it with the queries, allowing the SRE to determine if a drop in the hit ratio was caused by a change in the workload pattern or the introduction of a new inefficient query.

4. Index Usage (Scans vs. Seeks) and Unused Indexes

This is not a single metric, but a category of analysis focused on the health of your indexing strategy, the pillar of read performance.

What is it? Databases maintain statistics on how indexes are being used. The two most important operations are:
- Index Seek: A highly efficient operation, where the database uses the index’s tree structure to navigate directly to the exact rows it needs.
- Index Scan (or Full Index Scan): A less efficient operation, where the database reads the entire index from beginning to end. It’s faster than a Table Scan, but it can still be a sign of a poorly formulated query or an inadequate index.
  On the other hand, the database also tracks which indexes are never used.
Why does it matter?
- Scans vs. Seeks: A high number of Index Scans relative to Seeks on a critical index indicates a problem. The query might be using a function on the indexed column or a LIKE ‘%text’, which prevents a seek. Optimizing the query to allow for a seek can result in performance gains of orders of magnitude.
- Unused Indexes: This is a silent but expensive problem. Every index on a table adds overhead to all write operations (INSERT, UPDATE, DELETE). An index that is never used for reads is pure “dead weight”: it offers no performance benefit but slows down all your writes and consumes disk space.
How it surpasses traditional metrics: There is no traditional infrastructure metric that provides this view. This is a purely internal analysis of the database. An SRE can go years without realizing that 50% of the indexes on a critical table are useless and are actively harming the performance of INSERTs. Tools like dbsnOOp automate this analysis. They not only identify the queries that are causing inefficient scans but also provide clear reports of unused indexes, allowing for a safe “cleanup” that can dramatically improve write performance.

5. Transaction Log Generation / Throughput

This metric is often overlooked in non-high-transaction environments, but it is a vital indicator of the “rate of change” of your data and a harbinger of problems.

What is it? The Transaction Log (or Write-Ahead Log – WAL) is a critical component where the database records all data modifications before they are written to the main data files. The log generation rate measures how many megabytes or gigabytes of log are being written per second or minute.
Why does it matter?
- Indicator of Write Load: It is the most accurate measure of the intensity of your write workload.
- Detection of Inefficient Operations: A batch UPDATE or DELETE operation that is written inefficiently (e.g., row by row in a loop, instead of a single set-based statement) can cause a storm of log generation. A sudden and unexpected spike in this metric is a clear sign that a new process or an inefficient query has started to modify data in bulk.
- Impact on Replication and High Availability: In systems with read replicas or high availability configurations, the transaction log is what is sent to the replicas. An excessive log generation rate can saturate the network and cause “replication lag,” where the replicas become dangerously out of date with the primary.
How it surpasses traditional metrics: The WriteIOPS metric on the disk measures the low-level operations but does not differentiate between the efficient writes of the log and other writes. The log generation rate is specific to data change. It provides a direct insight into the impact of your write operations. dbsnOOp monitors the log generation rate and, when a spike occurs, can correlate it with the modification queries (INSERT, UPDATE, DELETE) that were active at that moment, pointing the SRE to exactly which process caused the “log storm.”

From SRE to DBRE

The discipline of SRE is built on the foundation of data and precise measurements. By expanding your focus from infrastructure metrics to these five workload-focused metrics, you evolve your analytical capabilities. You cease to be just a Site Reliability Engineer and begin to become a Database Reliability Engineer (DBRE). You stop asking “is the server slow?” and start asking “what is the server waiting for?”.

You stop treating the symptoms with more hardware and start curing the disease by optimizing the software. This change in perspective, enabled by an observability platform that provides the necessary context, is what transforms incident management from a firefighting exercise into a discipline of precision engineering.

Want to go beyond the CPU and get a real view of your database’s performance? Schedule a meeting with our specialist or watch a live demo!

To schedule a conversation with one of our specialists, visit our website. If you prefer to see the tool in action, watch a free demo. Stay up to date with our tips and news by following our YouTube channel and our LinkedIn page.

Schedule a demo here.

Learn more about dbsnOOp!

Learn about database monitoring with advanced tools here.

Visit our YouTube channel to learn about the platform and watch tutorials.

5 Useful Database Metrics Every SRE Should Track

November 14, 2025 | by dbsnoop

1. DB Time (or Database Load)

2. Wait Events (or Wait Statistics)

3. Buffer Cache Hit Ratio

4. Index Usage (Scans vs. Seeks) and Unused Indexes

5. Transaction Log Generation / Throughput

From SRE to DBRE

Recommended Reading

Read more

SSD vs. HDD in the Cloud for Databases: What IOPS Really Mean for Cost

How to Perform a Database Health Check: An Essential 12-Point Checklist

HOME

PRODUCTS

SUPPORT

PARTNERS

COMPANY

5 Useful Database Metrics Every SRE Should Track

November 14, 2025 | by dbsnoop

1. DB Time (or Database Load)

2. Wait Events (or Wait Statistics)

3. Buffer Cache Hit Ratio

4. Index Usage (Scans vs. Seeks) and Unused Indexes

5. Transaction Log Generation / Throughput

From SRE to DBRE

Recommended Reading

Read more

SSD vs. HDD in the Cloud for Databases: What IOPS Really Mean for Cost

How to Perform a Database Health Check: An Essential 12-Point Checklist

UPGRADE YOUR OPERATION WITH AUTONOMOUS DBA