A continuous monitoring system that checks backend logs every 60 seconds, detects and analyzes errors, uses AI to classify incidents and match them against known issues, manages the full incident lifecycle, sends alerts and updates to Slack, Telegram, and Jira, and maintains an internal issue registry for tracking recurring problems.
Every 60 seconds
Fire monitor loop
Pull last 60s of logs
Log Ingestion
Extract error signals from raw logs
Create Ticket
Create silent low-priority ticket for one-off error
Increment clean_streak + 1
Increment all_clear_counter + 1
Hourly Digest
Send all-clear message
Open incident
Append to incident_log, update last_issue_at, reset clean_streak
Registry Lookup
Query registry for matching endpoint and error_type
Update last_seen, load jira_ticket_id into incident state
Add Comment
Post reappearance note to existing ticket
Registry Match
Compare current findings against last 10 registry entries
Update last_seen, load jira_ticket_id into incident state
Add Comment
Post AI-identified reappearance note to existing ticket
Create Ticket
Create P1 incident ticket
Create Ticket
Create P2 or P3 incident ticket
Incident Summary
Write first alert summary
Critical Alert
Send first alert to on-call
Critical Alert
Send first alert to on-call
Warning Alert
Post first alert to #backend-alerts
Incident Update
Write ongoing update using full incident history
Ongoing Alert
Send 5-minute update to on-call
Ongoing Alert
Post 5-minute update to #backend-alerts
Set notification_timer = now
Resolution Report
Write full incident resolution report
Resolve Ticket
Add resolution report and set status to Resolved
Resolved
end resolution message to on-call
Resolved
Post resolved message to #backend-alerts
Clear incident state
Review resolution report, link related tickets, decide if a fix is needed
Once per day
Once per day
Staleness Check
Check if last_seen > 3 days ago
Delete registry entry
A continuous monitoring system that checks backend logs every 60 seconds, detects and analyzes errors, uses AI to classify incidents and match them against known issues, manages the full incident lifecycle, sends alerts and updates to Slack, Telegram, and Jira, and maintains an internal issue registry for tracking recurring problems.