Locks, Contention, and Performance: A Technical Study on the Recovery of Databases in Critical Situations

November 5, 2025 | by dbsnoop

Locks, Contention, and Performance: A Technical Study on the Recovery of Databases in Critical Situations
dbsnoop  Monitoring and Observability

A recent performance incident in a SQL database brought to light one of the most complex problems in data engineering: the cascading impact caused by prolonged locks. In a 45-minute interval, an entire environment suffered severe slowness, failures in critical APIs, and resource saturation. Detailed analysis revealed a dangerous combination of concurrent queries, poorly tuned indexes, and long transactions—a perfect storm that affected availability, throughput, and stability.

Understanding the Scenario: The Chain Collapse

Log analysis showed a series of critical metrics during the incident:

  • Batches per second: critical
  • Data Usage: 74%
  • Long Locks: 8%
  • Memory Usage: 70%
  • Simultaneous locks: 1042

This set of signals indicated severe contention. Although the volume of requests did not increase, hundreds of connections were stuck waiting for the release of blocked resources. This classic lock chain condition occurs when multiple transactions compete for the same set of records, each blocking the progress of the others.

From the availability traces and execution graphs, it was possible to identify that the stuck connections accumulated mainly between 11:30 a.m. and 12:15 p.m., coinciding with scheduled routines – likely automatic jobs or batch executions. The result was the complete degradation of the order service, affecting both the backend and the integration APIs.

Origin of a Critical Lock

The situational analysis revealed lock spikes at two distinct times: at 11:30 a.m. and at 9:20 p.m. These times suggest that scheduled processes were in direct conflict with write operations in production. Threads 119 and 190 were identified as the main blockers, holding active locks for up to 120 seconds – an extremely long time for any mission-critical system.

These threads were associated with UPDATE and INSERT commands on a central order table, which generated a chain of blocks. The problem was amplified by a high number of simultaneous connections originating from the same host, possibly a local application running multiple SQL sessions or an internal job without concurrency control.

Detailed Performance Analysis

The analysis of the availability trace showed that the problem was concentrated between 11:30 a.m. and 12:15 p.m., coinciding with the execution windows of automatic jobs. These tasks, responsible for data ingestion and normalization, performed heavy operations on the same tables used by real-time APIs.

Threads 119 and 190 were identified as the main blockers, holding active locks for up to 120 seconds, a critical time for any ingestion environment. During this interval, hundreds of connections were stuck waiting for release. The situation worsened because the jobs lacked proper concurrency control, opening multiple simultaneous SQL sessions on the same set of records.

From the analysis, it was possible to map the conflict pattern:

  1. An ingestion job started a high-volume transaction with multiple UPDATEs.
  2. While the operation was still in progress, new INSERTs and SELECTs were triggered by the pipeline.
  3. The transaction system blocked the resources until the first operation was completed.
  4. Subsequent requests were left pending, creating a lock queue.

This chain of dependencies caused the database to enter a state of critical contention, characterized by high latency and partial unavailability.

Technical Causes of the Incident

The investigation of the query behavior revealed that:

  • The main table involved had a high volume of write operations.
  • Several UPDATEs had an average duration of 30 seconds and peaks of up to 7 minutes.
  • The INSERTs were executed in large batches, increasing the blocking time per transaction.
  • There was an accumulation of redundant indexes, raising the cost of updates and writes.

This combination created the ideal scenario for contention. When multiple UPDATEs try to alter the same rows or data ranges, SQL Server needs to ensure transactional consistency, blocking the others until completion. The impact grows exponentially when there is an excess of simultaneous connections and jobs running in parallel.

dbsnoop  Monitoring and Observability

The Role of Observability and AI in Detection

The decisive point of the recovery came when dbsnOOp identified the anomalous behavior. Through the continuous analysis of metrics and logs, the AI correlated the symptoms: increased locks, a drop in throughput, and memory saturation with patterns from previous incidents.

The AI tuning engine automatically detected the queries with the longest blocking times and classified the incident as a write contention pattern. From this, it generated a set of immediate recommendations:

  • Temporary suspension of concurrent ingestion jobs.
  • Identification of the blocking threads (119 and 190).
  • Suggested rewrite for UPDATEs with more selective filters.
  • Reduction of INSERT batch sizes to optimize commits.
  • Dynamic adjustment of long transaction timeouts.

Based on these recommendations, the environment was stabilized in minutes. The automated action allowed for the isolation of critical queries, the release of locks, and the restoration of the normal ingestion flow. The business impact, which could have lasted for hours, was quickly contained.

Diagnosis and Analysis Methodology

The analysis was conducted in three phases:

  1. Understanding the scenario: Collection of real-time metrics and comparison with stable periods, highlighting peaks in CPU, memory, and connection usage.
  2. Locating the problem: Sorting locks by duration, identifying blocking threads, and correlating them with specific users and hosts.
  3. Path to resolution: Defining the critical queries, isolating long transactions, and reviewing the data modeling of the affected table.

This methodology revealed the direct impact of heavy UPDATEs on performance. The slowest operations were prioritized for optimization, especially those associated with threads 119 and 190. Additionally, a review of the index model and the division of transactions into smaller batches to reduce contention was recommended.

Mitigation and Recovery Strategies

Recovering from such an incident requires coordinated actions at multiple levels: code, database, and operation. Among the recommended measures are:

  1. Review of Critical UPDATEs: Commands with the longest blocking times should be rewritten or optimized. Strategies include:
    • Avoiding updates on wide ranges of data.
    • Using more selective filters in the WHERE clauses.
    • Dividing operations into smaller batches, reducing transaction time.
  2. Adjustment of Batch INSERTs: Insertion transactions should be broken down. Instead of inserting thousands of records in a single operation, it is recommended to process them in smaller groups (e.g., 500 records per commit), allowing for the gradual release of locks.
  3. Review of Indexes and Partitioning: One of the identified bottlenecks was the excess of indexes on the critical table. Although indexes speed up reads, they penalize write operations. Reducing the number of unnecessary indexes and adopting horizontal partitioning (sharding or range partitioning) can balance performance and consistency.
  4. Monitoring of Threads and Jobs: Continuous observation of the responsible threads and scheduled jobs is essential. Monitoring the behavior of threads 119 and 190, in particular, can anticipate similar incidents.
  5. Concurrency Control: Applications and services that execute multiple simultaneous connections should adopt mechanisms for execution queuing, controlled retries, and configurable timeouts to avoid saturating the connection pool.

Technical Lessons from the Incident

The case analyzed reinforces a fundamental principle of database engineering: performance depends not only on the infrastructure but on the orchestration between code, transactions, and data modeling. Among the most important lessons learned are:

  • Locks are not errors; they are security mechanisms. The problem arises when the application’s design forces excessive contention.
  • Small adjustments to queries can have an exponential impact on overall performance.
  • Observability automation allows for the detection of patterns before they cause incidents.
  • Continuous learning from operational data helps the system predict and avoid collapse situations.

Best Practices to Avoid New Incidents

  • Periodically review critical transactions, simulating concurrency loads.
  • Avoid simultaneous jobs that manipulate the same tables.
  • Use historical metrics to define safe execution windows.
  • Implement intelligent observability, correlating locks, CPU consumption, and query latency.
  • Adopt proactive alerts, triggering the team before the blocking time exceeds critical limits.

Conclusion

The incident analyzed shows how a small deviation in the transaction structure can trigger a systemic collapse. Excessive contention on a single table was enough to impact multiple services and APIs, increasing response times and degrading the user experience.

The solution involves an integrated technical approach: optimizing queries, rethinking the data model, and adopting intelligent automation for observability and continuous tuning. Active monitoring, combined with automated optimization recommendations, transforms performance management from a reactive to a predictive process.

From this study, it is clear that the resilience of a modern database depends less on hardware and more on the ability to understand, and adjust, the transactional behavior in real time.

Schedule a demo here.

Learn more about dbsnOOp!

Learn about database monitoring with advanced tools here.

Visit our YouTube channel to learn about the platform and watch tutorials.

dbsnoop  Monitoring and Observability

Recommended Reading

  • The report that has already saved millions for companies like yours: This article technically details how workload diagnosis translates into a massive ROI, connecting query optimization to the direct reduction of cloud costs, the decrease in engineering time spent on troubleshooting, and the recovery of revenue lost to latency.
  • Why relying only on monitoring is risky without a technical assessment: Explore the critical difference between passive monitoring, which only observes symptoms, and a deep technical assessment, which investigates the root cause of problems. The text addresses the risks of operating with a false sense of security based solely on monitoring dashboards.
  • Your database might be sick (and you haven’t even noticed): Discover the signs of chronic and silent problems that don’t trigger obvious alerts but that degrade performance and stability over time. The article focuses on the need for diagnostics that go beyond superficial metrics to find the true health of your data environment.
Share

Read more

UPGRADE YOUR OPERATION WITH AUTONOMOUS DBA

NO INSTALL – 100% SAAS

Complete the form below to proceed

*Mandatory