Backend Incident Monitor

A continuous monitoring system that checks backend logs every 60 seconds, detects and analyzes errors, uses AI to classify incidents and match them against known issues, manages the full incident lifecycle, sends alerts and updates to Slack, Telegram, and Jira, and maintains an internal issue registry for tracking recurring problems.

Sandipan Bhattacharjee@sandy

0032

DatadogLLog ParserJira Slack

Scheduled Trigger

Every 60 seconds

Fire monitor loop

Datadog

Pull last 60s of logs

Log Ingestion

Log Parser

Extract error signals from raw logs

Issues detected

All clean

◇

Issues Found

Repeat

One-off

◇

Repeat error?

Jira

Create Ticket

Create silent low-priority ticket for one-off error

Incident open

No incident

◇

Active Incident

Redis

Increment clean_streak + 1

Redis

Increment all_clear_counter + 1

1 hour clean

Still counting

◇

clean_streak == 60?

Slack all-clear

◇

all_clear_counter == 60?

Slack

Hourly Digest

Send all-clear message

Incident already open

New incident

◇

Active incident

Redis

Open incident

Redis

Append to incident_log, update last_issue_at, reset clean_streak

Send update

Too soon

◇

Notification timer elapsed 5 min?

Redis

Registry Lookup

Query registry for matching endpoint and error_type

Known issue

No exact match

◇

Exact match in registry?

10b

Redis

Update last_seen, load jira_ticket_id into incident state

11c

Jira

Add Comment

Post reappearance note to existing ticket

10c

ChatGPT

Registry Match

Compare current findings against last 10 registry entries

Same root cause

New issue

11d

◇

AI match found?

12b

Redis

Update last_seen, load jira_ticket_id into incident state

13a

Jira

Add Comment

Post AI-identified reappearance note to existing ticket

Critical

Warning

12c

◇

Severity?

13b

Jira

Create Ticket

Create P1 incident ticket

13c

Jira

Create Ticket

Create P2 or P3 incident ticket

Claude

Incident Summary

Write first alert summary

Critical

Warning

◇

Critical or Warning?

16a

Critical Alert

Send first alert to on-call

16b

Slack

Critical Alert

Send first alert to on-call

16c

Slack

Warning Alert

Post first alert to #backend-alerts

Claude

Incident Update

Write ongoing update using full incident history

Critical

Warning

10a

◇

Critical or Warning?

11a

Ongoing Alert

Send 5-minute update to on-call

11b

Slack

Ongoing Alert

Post 5-minute update to #backend-alerts

12a

Redis

Set notification_timer = now

Claude

Resolution Report

Write full incident resolution report

Jira

Resolve Ticket

Add resolution report and set status to Resolved

Resolved

end resolution message to on-call

Slack

Resolved

Post resolved message to #backend-alerts

10d

Redis

Clear incident state

11e

Manual

Review resolution report, link related tickets, decide if a fix is needed

Daily Cleanup

Once per day

Redis

Staleness Check

Check if last_seen > 3 days ago

Yes

◇

Stale?

Redis

Delete registry entry

100%

mouseDrag blocks from the palette · Click a node to edit · Scroll to zoom

menu_book

Recipe

47 nodes · 1 phase

Node 1

Scheduled Trigger

Every 60 seconds

expand_more

Log Collection

Node 2

DatadogPull last 60s of logs

expand_more

Node 3

Log Parser

expand_more

Node 4

◇

Issues Found

expand_more

Node 5a

◇

Repeat error?

expand_more

Node 6b

JiraCreate Ticket

expand_more

Node 5b

◇

Active Incident

expand_more

Node 6c

Redis

expand_more

Node 6d

Redis

expand_more

Node 7c

◇

clean_streak == 60?

expand_more

Node 7d

◇

all_clear_counter == 60?

expand_more

Node 8d

SlackHourly Digest

expand_more

Node 6a

◇

Active incident

expand_more

Node 7b

Redis

expand_more

Node 7a

Redis

expand_more

Node 8a

◇

Notification timer elapsed 5 min?

expand_more

Node 8b

RedisRegistry Lookup

expand_more

Node 9b

◇

Exact match in registry?

expand_more

Node 10b

Redis

expand_more

Node 11c

JiraAdd Comment

expand_more

Node 10c

ChatGPTRegistry Match

expand_more

Node 11d

◇

AI match found?

expand_more

Node 12b

Redis

expand_more

Node 13a

JiraAdd Comment

expand_more

Node 12c

◇

Severity?

expand_more

Node 13b

JiraCreate Ticket

expand_more

Node 13c

JiraCreate Ticket

expand_more

Node 14

ClaudeIncident Summary

expand_more

Node 15

◇

Critical or Warning?

expand_more

Node 16a

TelegramCritical Alert

expand_more

Node 16b

SlackCritical Alert

expand_more

Node 16c

SlackWarning Alert

expand_more

Node 9a

ClaudeIncident Update

expand_more

Node 10a

◇

Critical or Warning?

expand_more

Node 11a

TelegramOngoing Alert

expand_more

Node 11b

SlackOngoing Alert

expand_more

Node 12a

Redis

expand_more

Node 8c

ClaudeResolution Report

expand_more

Node 9c

JiraResolve Ticket

expand_more

Node 9d

TelegramResolved

expand_more

Node 9e

SlackResolved

expand_more

Node 10d

Redis

expand_more

Node 11e

Manual

expand_more

Node 1

Daily Cleanup

Once per day

expand_more

For Each Registry Entry

Loop 2

RedisStaleness Check

expand_more

Loop 3

◇

Stale?

expand_more

Loop 4

Redis

expand_more

Backend Incident Monitor

menu_bookRecipe47 nodes · 1 phase

Sandipan Bhattacharjee@sandy

0032

DatadogLLog ParserJira

Node 1

Scheduled Trigger

Every 60 seconds

expand_more

Log Collection

Node 2

DatadogPull last 60s of logs

expand_more

Node 3

Log Parser

expand_more

Node 4

◇

Issues Found

expand_more

Node 5a

◇

Repeat error?

expand_more

Node 6b

JiraCreate Ticket

expand_more

Node 5b

◇

Active Incident

expand_more

Node 6c

Redis

expand_more

Node 6d

Redis

expand_more

Node 7c

◇

clean_streak == 60?

expand_more

Node 7d

◇

all_clear_counter == 60?

expand_more

Node 8d

SlackHourly Digest

expand_more

Node 6a

◇

Active incident

expand_more

Node 7b

Redis

expand_more

Node 7a

Redis

expand_more

Node 8a

◇

Notification timer elapsed 5 min?

expand_more

Node 8b

RedisRegistry Lookup

expand_more

Node 9b

◇

Exact match in registry?

expand_more

Node 10b

Redis

expand_more

Node 11c

JiraAdd Comment

expand_more

Node 10c

ChatGPTRegistry Match

expand_more

Node 11d

◇

AI match found?

expand_more

Node 12b

Redis

expand_more

Node 13a

JiraAdd Comment

expand_more

Node 12c

◇

Severity?

expand_more

Node 13b

JiraCreate Ticket

expand_more

Node 13c

JiraCreate Ticket

expand_more

Node 14

ClaudeIncident Summary

expand_more

Node 15

◇

Critical or Warning?

expand_more

Node 16a

TelegramCritical Alert

expand_more

Node 16b

SlackCritical Alert

expand_more

Node 16c

SlackWarning Alert

expand_more

Node 9a

ClaudeIncident Update

expand_more

Node 10a

◇

Critical or Warning?

expand_more

Node 11a

TelegramOngoing Alert

expand_more

Node 11b

SlackOngoing Alert

expand_more

Node 12a

Redis

expand_more

Node 8c

ClaudeResolution Report

expand_more

Node 9c

JiraResolve Ticket

expand_more

Node 9d

TelegramResolved

expand_more

Node 9e

SlackResolved

expand_more

Node 10d

Redis

expand_more

Node 11e

Manual

expand_more

Node 1

Daily Cleanup

Once per day

expand_more

For Each Registry Entry

Loop 2

RedisStaleness Check

expand_more

Loop 3

◇

Stale?

expand_more

Loop 4

Redis

expand_more