

The performance degradation of a database is rarely a sudden event. It is a slow and insidious process, a death by a thousand cuts, where small inefficiencies accumulate over time. Queries start to take a few milliseconds longer, table “bloat” grows silently, and CPU utilization rises from 15% to 30%, and then to 50%. None of these isolated changes are large enough to trigger a traditional monitoring alert. The system appears “healthy” until, under the load of a traffic spike or during a critical batch process, it collapses.
The truth is that the system was not healthy; it had been sick for months, but no one was checking the right vital signs. Just monitoring CPU and memory is insufficient. To ensure reliability and efficiency, engineering teams need to adopt a proactive approach: the regular performance audit, or “health check.”
This is not a weeks-long deep dive, but a focused and periodic audit of the pillars that support the database’s health. This checklist presents 12 essential and actionable points for SREs and DBAs to perform an effective health check, identify problems before they become crises, and keep their data systems operating at peak performance.
Section 1: Workload Analysis (What is the Database Doing?)
The starting point of any performance audit is not the infrastructure, but the work it is being forced to perform.
1. Identify the Top 10 Most Expensive Queries
The Pareto Principle applies aggressively to databases: 80% of your resources are likely consumed by less than 20% of your queries. Identifying this small set of offenders is the highest-impact optimization you can make.
- How to Do It: Use the pg_stat_statements extension in PostgreSQL or the equivalent Dynamic Management Views (DMVs) in SQL Server to rank queries by total_time or total_exec_time. The most important metric to watch is the total cost (mean_time * calls), which identifies the “fast and frequent” queries that cause the most load.
- The Smart Approach: An observability platform like dbsnOOp automates this. Its main dashboard already displays the ranking of queries by “DB Time” (the total load), eliminating the need for manual queries and providing a clear starting point for your investigation.
2. Check for Missing Indexes (Analysis of Seq Scans)
Once you have the list of the most expensive queries, the most common root cause for their slowness is a Full Table Scan (or Seq Scan in PostgreSQL).
- How to Do It: For each query on your Top 10 list, generate its execution plan (EXPLAIN ANALYZE). Look for the Seq Scan operation on large tables. If you find it, and the query’s WHERE clause is selective, this is an unequivocal sign of a missing index.
- The Smart Approach: dbsnOOp not only identifies the slow query but also captures and analyzes its execution plan automatically, highlighting the problematic Seq Scans and, in many cases, recommending the exact CREATE INDEX command to solve the problem.
3. Audit for Unused Indexes
The opposite of a missing index is equally harmful. Indexes that are never used for reads are “dead weight” that penalize every INSERT, UPDATE, and DELETE operation.
- How to Do It: Use the system views (pg_stat_user_indexes in PostgreSQL or sys.dm_db_index_usage_stats in SQL Server) to find indexes with zero reads (idx_scan = 0 or user_seeks + user_scans + user_lookups = 0). Be sure to collect this data over a full business cycle to avoid removing a rarely used but critical index.
- The Smart Approach: The historical analysis of dbsnOOp provides a high-confidence report on unused indexes, allowing for a safe cleanup that can dramatically improve write performance.
Section 2: Physical Health Check and Maintenance (Is the Database in Good Shape?)
An optimized workload running on a poorly maintained database will still have poor performance.
4. Measure Table and Index “Bloat” (Especially in PostgreSQL)
Due to PostgreSQL’s MVCC mechanism, UPDATEs and DELETEs leave behind “dead” versions of rows. If the VACUUM process cannot clean them up fast enough, the table and its indexes become “bloated,” consuming more disk space and making scans slower.
- How to Do It: Use well-known SQL query scripts from the PostgreSQL community to estimate the percentage of “bloat” in your most active tables and indexes. A bloat above 20-30% is a sign to act. Tools like pg_repack are essential for removing bloat online, without downtime.
- The Smart Approach: dbsnOOp can help identify the symptoms of bloat, such as a gradual degradation in the performance of previously fast queries, even with no changes to the execution plan.

5. Check the Health of Autovacuum (PostgreSQL)
Autovacuum is PostgreSQL’s cleanup system and is, perhaps, the most critical component for the long-term health of an instance.
- How to Do It: Check if autovacuum is running. Look for tables with a high number of “dead tuples” that are not being cleaned up. Monitor the database’s “age” to avoid the dangerous transaction ID wraparound, an event that can force a shutdown for maintenance. SELECT datname, age(datfrozenxid) FROM pg_database; is a crucial query.
- The Smart Approach: dbsnOOp can alert on anomalies that are symptoms of an inefficient autovacuum, such as a sudden increase in the dead tuple count.
6. Ensure Optimizer Statistics Are Up to Date
The query optimizer decides the execution plan based on statistics about the data distribution. If these statistics are outdated, it will make bad decisions.
- How to Do It: Check the last_autoanalyze column in the pg_stat_user_tables view (PostgreSQL) to ensure the ANALYZE process is running regularly. If you see massive discrepancies between the estimated and actual rows in an EXPLAIN ANALYZE, it’s a clear sign of bad statistics.
- The Smart Approach: dbsnOOp visualizes this discrepancy between estimated and actual rows, making it obvious when the optimizer is “flying blind.”
Section 3: Resource Configuration (Are the Parameters Correct?)
The best query in the world will perform poorly on a misconfigured database.
7. Audit Critical Memory Settings
Memory allocation is the most important performance configuration.
- How to Do It: For PostgreSQL, review shared_buffers (usually 25% of system RAM), work_mem (increase with caution for queries with heavy sorts and joins), and maintenance_work_mem. For SQL Server, check the max server memory settings.
- The Smart Approach: A tool like dbsnOOp shows the symptoms of a bad memory configuration, such as a high volume of I/O waits (indicating a small shared_buffers) or Hash Spills in the execution plans (indicating insufficient work_mem).
8. Analyze the Buffer Cache Hit Ratio
This metric indicates the percentage of data reads that were satisfied from memory (cache) instead of the disk.
- How to Do It: Query your database’s system views. A rate consistently below 99% for a transactional system (OLTP) is a warning sign.
- What It Means: It’s not a goal to be achieved, but a diagnostic indicator. A low rate means your workload is “I/O bound.” The causes could be insufficient memory or, more likely, inefficient queries that are forcing massive reads from the disk.
9. Review the Connection Pooling Strategy
Opening and closing database connections is an expensive operation. An excessive number of active connections can exhaust memory and degrade performance.
- How to Do It: Connection pooling should be done outside the database. Check if your application is using a robust pooler (like HikariCP for Java) or, for PostgreSQL, if an external pooler like PgBouncer is in use to manage a large number of client connections. Check the number of active and idle connections.
- What It Means: Errors like “failed to acquire connection” or a very high number of connections in the idle in transaction state are signs of a pooling problem.
10. Evaluate Storage Performance and IOPS
The performance of your underlying disk is crucial.
- How to Do It: Use the metrics from your cloud provider (e.g., Read/Write Latency, Disk Queue Depth, Burst Balance in AWS EBS).
- What It Means: Remember that high IOPS utilization is often a symptom of an inefficient read workload (Factor #2), not the root cause. Before paying for a more expensive storage tier, optimize your queries to reduce the I/O demand.
Section 4: Reliability and Best Practices (Is the System Resilient?)
11. Validate the Backup and Recovery Strategy
An untested backup is just a hope.
- How to Do It: Check if automated backups are enabled, if the retention is adequate, and if point-in-time recovery (PITR) is configured. More importantly, schedule regular restore tests to ensure that you can, in fact, restore the data and measure your Recovery Time (RTO).
- What It Means: The ability to recover from a disaster or human error (like a DELETE without a WHERE) is a pillar of reliability.
12. Look for Long-Running Transactions and Idle Connections
A transaction left open can wreak havoc.
- How to Do It: Use queries to check pg_stat_activity (PostgreSQL) or sys.dm_exec_sessions (SQL Server) for sessions that are in the idle in transaction state or that have a very old xact_start.
- What It Means: A long-open transaction can hold locks, preventing other operations, and, crucially, can prevent VACUUM from cleaning up dead tuples, leading to massive bloat.
Towards Proactivity
Performing this 12-point health check regularly transforms performance management from a reactive, crisis-driven discipline into a proactive and continuous improvement process. It provides the necessary data to find the “known unknowns”, the problems that are silently growing in your environment.
An observability platform like dbsnOOp does not replace the need for this process, but it drastically accelerates it, automating the data collection, correlating the symptoms with the causes, and allowing your team to spend less time diagnosing and more time optimizing.
Want to automate your database health check and get deep insights in minutes? Schedule a meeting with our specialist or watch a live demo!
Schedule a demo here.
Learn more about dbsnOOp!
Learn about database monitoring with advanced tools here.
Visit our YouTube channel to learn about the platform and watch tutorials.

Recommended Reading
- The report that has already saved millions for companies like yours: This article technically details how workload diagnosis translates into a massive ROI, connecting query optimization to the direct reduction of cloud costs, the decrease in engineering time spent on troubleshooting, and the recovery of revenue lost to latency.
- Why relying only on monitoring is risky without a technical assessment: Explore the critical difference between passive monitoring, which only observes symptoms, and a deep technical assessment, which investigates the root cause of problems. The text addresses the risks of operating with a false sense of security based solely on monitoring dashboards.
- Your database might be sick (and you haven’t even noticed): Discover the signs of chronic and silent problems that don’t trigger obvious alerts but that degrade performance and stability over time. The article focuses on the need for diagnostics that go beyond superficial metrics to find the true health of your data environment.