Skip to Main Content
Status Planned
Categories Bug Report
Created by Guest
Created on May 26, 2025

Possible "zombie" state for containers connected to rabbitmq

It has been observed that containers may occasionally enter a state where they:

  • Start up

  • Successfully connect to RabbitMQ

  • Receive a prefetch count

  • Then stall without processing any messages

đź”— View logs in New Relic

In the linked query, you can see that four initialize containers were started as a result of KEDA scaling up, reacting to the increasing queue size.

However, these containers appear to stall immediately after initialization. They do not log any further activity or process messages until they are restarted around 02:02. Notably, one of the containers is killed by KEDA once the queue size drops below the scaling threshold, thanks to other containers successfully processing messages.

Key points from the data:

  • The "Count" column in the query reflects how many log entries were recorded.

  • The stalled containers have 17 logs during their lifecycle.

Previous Hypothesis

We previously suspected that this issue might be caused by RabbitMQ quorum queues not replicating correctly across all nodes. In particular, we theorized that if a leader node shift occurred while the quorum was incomplete, Link might be unable to handle the inconsistency, resulting in stalled behavior.

However, this theory now appears less likely.

Current Observation

During the timeframe of the issue, all RabbitMQ queues were fully replicated across all nodes. This suggests the root cause lies elsewhere.

  • Attach files
  • Admin
    Mikkel B. Mikkelsen
    Sep 25, 2025