Runbook: Airhost room not found
| Service | channelku, pms interation |
|---|---|
| Owner Team slack handle | @bnl-team-c |
| Team's Slack Channel | #bnl-pms-jp-notification |
Table of Contents
Important Links
This section links to the most relevant resources for an engineer responding to this alert.
How to write this section: Add direct links to the most relevant resources for responding to this alert. Do not link to general homepages like Grafana or a broad log view.
Don'ts and Dos:
- Don't link to the Grafana homepage. Instead, link directly to the relevant dashboard.
- Don't use generic link text like "click here". Instead, describe what the link points to, like "Rollbar: High Volume Exceptions".
- Don't assume the on-call knows how to query logs from scratch. Instead, create and link to a pre-filtered log view (e.g., CloudWatch, Kibana) showing only this service's logs.
- Don't forget about dependencies. Instead, link to dashboards for critical dependencies like the Aurora DB and Redis.
- Don't make the engineer search for the code repo. Instead, link directly to it.
- Don't clutter this section with links to general documentation. Instead, keep it focused only on immediate diagnostic links.
Examples:
| Alert | [*Link to the alert, *you should always include this one.*] |
|---|---|
| Logs | [*Link to this service/feature logs, *you should always include this one if it is applicable*] |
| Dashboard | [Link to any relevant dashboards, for example redis memory dashboard, or Aurora's database load dashboard...] |
| Admin Dashboard | [Link to any admin dashboard, for example sideKiq's admin dashboard.] |
1. triage
This section's goal is to quickly determine if this is an ongoing incident or a false alarm.
How to write: Start with a concise paragraph summarizing the checklist's goal. Follow with a fast, simple checklist an engineer can complete to evaluate if this is an ongoing incident or a false alarm.
Don'ts and Dos:
- Don't ask for complex analysis like "Analyze log patterns for anomalies." Instead, ask a direct, binary question that gives a strong yes/no signal, such as "Is the process running?"
- Don't just describe where to look or click in a UI. Instead, include screenshots that highlight the exact location.
- Don't give general instructions like "Check the health check endpoint." Instead, provide a copy-paste-ready command like
curl http://[service-name].[namespace].svc.cluster.local:8080/health_check - Don't assume the engineer knows a command's normal output. Instead, provide the expected output (e.g.,
{"status": "ok"}). - Don't assume the on-call knows what "normal" looks like. Instead, specify thresholds (e.g., "Is CPU usage > 80%?", "Is p99 latency above 500ms on the dashboard?").
- Don't assume the engineer knows about recent changes. Instead, direct them to a source of truth: "Check the [#deploys-slack-channel] for any production deploys in the last 30 minutes."
- Don't assume the engineer knows how to navigate the observability tool. Instead, provide deep links to the exact view needed, bypassing navigation or filtering steps.
- Don't provide commands with multiple variables to fill in. Instead, hardcode as much as possible (like namespace, labels) and only use placeholders for dynamic values, like
[pod-name]. - Don't include remediation steps in this section. Instead, keep this strictly for validation and assessment.
Example:
Quickly confirm if Redis is actively out of memory and impacting the application, or if this was a transient spike that has resolved.
- 1. Check Redis' live memory usage on the dashboard:
- Open the Redis Grafana Dashboard:
[Link to the specific Grafana Redis Dashboard] - Look at the "Database Memory Usage Percentage" panel. Is the value currently above the 90% alert threshold?
- Open the Redis Grafana Dashboard:
- 2. Check for key evictions:
- On the same Grafana dashboard, press
command+Fon macOS orCtrl+Fon windows/linux and search "Evicted Keys" to find the panel. - Is there a sharp, vertical spike in the last 15 minutes? A flat line or a count of zero is normal; a spike means Redis is actively deleting keys to make space.
- (Example of what a problem looks like:)
[Optional: Screenshot of a Grafana panel showing a spike in evicted keys]
- On the same Grafana dashboard, press
- 3. Check Rollbar for application-level errors:
- Open the Rollbar Items Page for the application:
[Link to Rollbar project, pre-filtered for 'redis'] - Look for a new, high-occurrence "Item" that started in the last 15 minutes with an error class like
Redis::CommandErrorand a message likeOOM command not allowed.
- Open the Rollbar Items Page for the application:
- 4. Check Sidekiq for direct impact on background jobs:
- Open the Sidekiq Web UI:
[Link to Sidekiq UI] - Click on the "Retries" tab. Are there many jobs failing with a Redis-related error message? (e.g.,
OOM command not allowedorREADONLY You can't write against a read only replica).
- Open the Sidekiq Web UI:
- 5. Check the underlying AWS ElastiCache node status:
- Open the AWS Console for the ElastiCache Cluster:
[Link to the specific Redis cluster in the AWS Console] - Verify the cluster's status is green and
available. (This rules out an AWS-level issue).
- Open the AWS Console for the ElastiCache Cluster:
2. Decision Point
This section helps the engineer interpret the triage results and decide: is this a true incident or a false alarm? A false alarm means the alert triggered, but the system is not experiencing an incident.
How to write:
Write this as a simple, logical IF... THEN... statement that is impossible to misinterpret.
- Structure: Use a simple conditional. The
IFcondition must reference the Triage checklist outcomes. TheTHENaction must be a direct link to the next appropriate runbook section. - Formatting: Use bold text, headers, and emojis (like 🛑 or ➡️) to make the section visually distinct and easy to scan. Use Markdown anchor links (
[Link Text](#-section-header)) to make navigation instant. - Tone: Be imperative and absolute. Avoid words like "maybe" or "consider." Make the decision for the engineer based on the gathered facts.
Don'ts and Dos:
- Don't write "If you think there's a problem..." Instead, write "IF the error rate is above 1% on the dashboard..."
- Don't make the engineer scroll. Instead, provide a direct Markdown anchor link to the next step (e.g.,
[Go to True Incident](#4-true-incident)). - Don't mix the decision with the action. Instead, separate them: "IF X is true, THEN GO TO Y".
- Don't just write text. Instead, use formatting (bold, headers, links) to make the paths clear.
- Don't include diagnostic or action steps here. Instead, keep this section for direction only.
- Don't use "maybe," "possibly," or "probably." Instead, be definitive.
- Don't assume the conditions are obvious. Instead, explicitly reference the triage checks.
Example:
- IF all triage checks passed (the error rate on Grafana is back to normal, Rollbar shows no new errors, and all pods are
Running)...- ➡️ Go to: False Alarm
- IF any triage check failed (the error rate is still high, there's a new error spike in Rollbar, or pods are crashing)...
- ➡️ Go to: True Incident
3. False Alarm
A false alarm is not a "close and ignore" event. A noisy alert can mask a deteriorating situation, as a triggered monitor will not re-fire if conditions worsen. This section must provide a clear protocol to safely handle the alert. The on-call engineer needs instructions to address the immediate noise, such as:
- Keeping the monitor triggered while actively watching.
- Changing the threshold so the monitor recovers and can re-trigger.
- Muting the monitor temporarily to prevent fatigue if it is flapping.
This section must also define how to create a follow-up task to investigate the root cause of the noise and prevent a future incident.
How to write:
This section is a guide, not a rigid script, as every alert is unique. Your instructions must answer the core question: What is the immediate action plan for the triggered monitor?
If the correct action is not obvious, you may choose one of these three options:
- Keep the monitor triggered and actively monitor the situation in case it deteriorates further.
- Change the monitor's threshold to allow the monitor to recover so it can re-trigger if the condition worsens.
- Mute the monitor temporarily to prevent alert noise if the monitor is flapping between triggered and recovered states.
For the chosen option:
- List the exact, step-by-step instructions. Include direct links to the alerting tool, copy-pasteable commands, and the specific buttons to click.
- Provide a pre-written Slack message for the on-call engineer to post, informing the team of the situation and the action they took.
- Detail how to create a permanent follow-up task, such as a GitHub Issue or Jira ticket. Include a template with pre-filled context to make this as easy as possible.
Don'ts and Dos:
- Don't leave the on-call to guess why an alert is noisy. Instead, (if known) add a note explaining common causes for this specific false positive (e.g., "This can trigger during nightly data warehouse syncs").
- Don't just write "Silence the alert." Instead, provide a direct link to the alert and specify a safe duration, like
Silence for 1 hour. - Don't write vague instructions like "keep an eye on the dashboard." Instead, provide a specific, numbered list of actions, such as "1. Mute the alert for 2 hours. 2. Post a status update in Slack."
- Don't suggest the engineer make changes silently. Instead, include a final step instructing them to communicate the action and reason in the team's Slack channel.
- Don't just say "tell the team." Instead, provide a copy-pasteable Slack message template that clearly communicates the alert's status and the action taken.
- Don't write "Create a ticket to investigate." Instead, provide a ticket template with pre-filled context (alert name, time, links) and tag the appropriate owner team (
@platform-team).
Example:
The Redis memory alert triggered at 91% but has since dropped to 88%. While there is no active incident, memory usage is critically high and poses an immediate risk of a service-degrading OOM event. The following protocol provides breathing room and initiates a proper fix.
- Immediately upscale the Redis ElastiCache instance to the next size. This provides critical memory headroom and prevents a potential incident.
- Navigate to the AWS ElastiCache Console:
[Link to the specific Redis cluster] - Click Modify.
- In the Node type dropdown, select the next larger instance size (e.g., from
cache.m5.largetocache.m5.xlarge). - Choose Apply immediately and confirm the modification.
- Navigate to the AWS ElastiCache Console:
- Inform the team of the action taken. Post the following message in the
#product-team-badassSlack channel:Heads up: The Redis memory alert triggered at 91%. To prevent an imminent OOM incident, I have upscaled the ElastiCache instance to the next size. This is a temporary fix; I am creating a ticket to investigate the root cause.
- Create a Jira ticket to investigate the source of the memory increase. This is not a monitoring issue; it is a potential application-level problem.
- Create the ticket here:
[Link to create a Jira ticket] - Use the title:
RCA: High Redis Memory Usage Leading to Proactive Upscale - [Date] - In the ticket description, link to the relevant Grafana dashboard showing the memory trend.
- Paste the link to the new ticket into the Slack thread from the previous step.
- Create the ticket here:
- Start a conversation with the responsible team to resolve the issue. In the same Slack thread, tag the owning team's handle (
@team-a-devs) to hand off the investigation. Use a message like this:@team-a-devs, can you please take ownership of the linked Jira ticket? The upscaling provides a temporary buffer, but we need your team to investigate the application logic and identify the root cause of the increased memory consumption.
4. True Incident
When a true incident occurs, the first priority is to recover the system and restore service. Once stable, secondary priorities are assessing the blast radius, communicating the impact, and performing cleanup (e.g., re-running failed jobs, remediating data).
4.1. Recover the System
Find the root cause and apply a fix to restore service.
How to write:
Structure this as a list of potential root causes, ordered from most likely/easiest to investigate to least likely/hardest.
For each potential cause, provide three distinct subsections:
- Diagnostic Steps: A checklist to confirm if this is the root cause. This must include specific commands, queries, or deep links to observability tools.
- Remediation Plan: A step-by-step guide to fix the issue. This should be a direct, imperative list of actions.
- Verification: A final checklist to confirm that the remediation was successful and the system is stable.
Don'ts and Dos:
- Don't combine diagnosis and remediation into one list. Instead, have separate "Diagnostic Steps" that link to "Remediation Steps".
- Don't ask for complex analysis like "Analyze log patterns for anomalies." Instead, ask a direct, binary question with a strong yes/no signal, such as "Is the process running?"
- Don't just describe where to look or click in a UI. Instead, include screenshots that highlight the exact location.
- Don't give general instructions like "Check the health check endpoint." Instead, provide a copy-paste-ready command like
curl http://[service-name].[namespace].svc.cluster.local:8080/health_check - Don't assume the engineer knows a command's normal output. Instead, provide the expected output (e.g.,
{"status": "ok"}). - Don't assume the on-call knows what "normal" looks like. Instead, specify thresholds (e.g., "Is CPU usage > 80%?", "Is p99 latency above 500ms on the dashboard?").
- Don't assume the engineer knows about recent changes. Instead, direct them to a source of truth: "Check the [#deploys-slack-channel] for any production deploys in the last 30 minutes."
- Don't assume the engineer knows how to navigate the required tools. Instead, provide deep links to the exact view needed, bypassing navigation or filtering steps.
- Don't provide commands with multiple variables to fill in. Instead, hardcode as much as possible (like namespace, labels) and only use placeholders for dynamic values, like
[pod-name].
Example:
This incident is most often caused by a recent deployment introducing a memory leak or by organic growth finally exceeding instance capacity.
Potential Cause 1: Recent Bad Deployment
A recent code change is inefficiently using Redis, causing a rapid increase in memory usage.
Diagnostic Steps:
- Check the
[#deploys-slack-channel]for any production deployments to this service in the last 3 hours. - Open the Redis Grafana Dashboard:
[Link to the specific Grafana Redis Dashboard] - Compare the deployment timestamp from Slack with the "Database Memory Usage Percentage" panel. Does the memory usage begin a sharp, sustained climb immediately after the deployment?
Remediation Plan:
- Initiate a rollback to the previous stable version using the deploy pipeline:
[Link to deployment pipeline with rollback button] - Announce the rollback in the
#product-team-badasschannel with the message:Incident: Redis OOM. Rolling back the latest deployment for
[service-name]as it correlates with a sharp increase in memory usage.
Verification:
- Monitor the Redis Grafana Dashboard. Confirm that memory usage is steadily decreasing toward its pre-deployment baseline.
- Check Rollbar for Redis
OOMerrors:[Link to Rollbar filtered for Redis errors]. Confirm the error count drops to zero within 5 minutes of the rollback completing.
Potential Cause 2: Organic Growth Exceeded Capacity
The service has naturally grown to the point where the provisioned Redis instance is no longer large enough.
Diagnostic Steps:
- On the Redis Grafana Dashboard, zoom the time range out to 30 days. Is there a slow, steady, upward trend in memory usage?
- Confirm that no single recent deployment correlates with a sharp, sustained increase in memory, ruling out "Potential Cause 1".
Remediation Plan:
- Upscale the Redis ElastiCache instance to the next available size.
- Navigate to the AWS ElastiCache Console:
[Link to the specific Redis cluster] - Click Modify.
- In the Node type dropdown, select the next larger instance size (e.g., from
cache.m5.largetocache.m5.xlarge). - Choose Apply immediately and confirm the modification.
- Navigate to the AWS ElastiCache Console:
- Post an update in the
#product-team-badasschannel:Incident: Redis OOM. Memory usage shows slow organic growth over the last 30 days. Proactively upscaling the ElastiCache instance to provide headroom.
Verification:
- On the Redis Grafana Dashboard, confirm the "Database Memory Usage Percentage" has dropped significantly below the
90%threshold after the upscale operation completes (this can take several minutes). - Verify in Rollbar that no new
OOMerrors are occurring:[Link to Rollbar filtered for Redis errors].
4.2. Clean up
Once the system is stable, the next priority is managing the incident's aftermath. This includes assessing the blast radius, communicating the impact, and executing cleanup tasks to resolve all negative side effects.
How to write:
Provide a clear, sequential plan for post-incident actions, structured as a checklist covering three core areas:
- Impact Assessment: Provide exact commands or queries to quantify the impact (e.g., number of failed jobs, affected users, or transactions).
- Communication: Define who needs to be notified, which channels to use, and a template for the message.
- Cleanup Procedure: List the specific, step-by-step commands or actions needed to correct data, re-run jobs, or restore any inconsistent state. For complex procedures, link to a dedicated runbook or script instead of listing all steps here.
Don'ts and Dos:
- Don't put all cleanup steps in this runbook. Instead, link to dedicated runbooks for complex procedures to keep this one focused.
- Don't use vague instructions like "See which jobs failed." Instead, provide a specific, copy-paste-ready command to list them.
- Don't assume the engineer knows who to contact. Instead, specify the exact Slack channel (
#product-updates) and team handles (@customer-success-team) to notify. - Don't say "let stakeholders know about the issue." Instead, provide a pre-written communication template to ensure the message is clear, concise, and contains all necessary information.
- Don't describe a manual cleanup process in prose. Instead, provide a numbered list of commands or link directly to the cleanup script in the code repository.
- Don't assume the impact is obvious. Instead, provide a direct link to a pre-filtered dashboard or a specific log query to quantify the blast radius.
- Don't perform cleanup actions silently. Instead, include a step instructing the engineer to announce the start and completion of cleanup tasks in the relevant Slack channel.
- Don't just fix the problem. Instead, include instructions to gather data for the post-mortem, like saving a list of affected user IDs to a CSV and attaching it to the incident ticket.
- Don't embed long, complex scripts in the runbook. Instead, link to the script in the Git repository and provide only the command to execute it.
- Don't assume cleanup was successful. Instead, include a final verification step, such as running a query to confirm a retry queue is now empty.
- Don't leave other teams in the dark. Instead, instruct the engineer to check for and communicate any downstream impact to dependent services.
Example
The Redis OOM event caused numerous background jobs to fail and move to the Sidekiq "Retries" queue. The following steps will quantify the impact, notify stakeholders, and re-process the failed jobs.
- Assess the impact.
- Run the following command in a production console to get a precise count of failed jobs related to this incident:
bundle exec rails runner "puts Sidekiq::RetrySet.new.select { |job| job.error_message.include?('OOM command not allowed') }.count" - Record this number in the incident's Jira ticket.
- Run the following command in a production console to get a precise count of failed jobs related to this incident:
- Communicate the impact to stakeholders.
- Post the following message in the
#product-updatesSlack channel:Post-Incident Update for Redis OOM: The system is now stable. We've identified that approximately
[NUMBER_FROM_STEP_1]background jobs failed during the incident. We are beginning the process of re-enqueuing them now. No customer data was lost.
- Post the following message in the
- Re-enqueue all failed jobs.
- Run the dedicated Rake task designed for this purpose. This task will safely move all jobs from the "Retries" set back into their original queues for processing.
bundle exec rake sidekiq:retry_all
- Run the dedicated Rake task designed for this purpose. This task will safely move all jobs from the "Retries" set back into their original queues for processing.
- Verify the cleanup is complete.
- Open the Sidekiq Web UI:
[Link to Sidekiq UI] - Navigate to the "Retries" tab and confirm that the count is now
0. - Post a final confirmation in the
#product-updatesSlack thread: "Cleanup is complete. All failed jobs have been successfully re-queued."
- Open the Sidekiq Web UI: