We have a network of production monitoring tools at patientslikeme.com, where monit, NewRelic, and Pingdom feed alerts through PagerDuty to produce e-mail, SMS, and Pager alerts for production issues. PagerDuty has a ticketing system to assign a given problem to a single person. It’s awesome.
Life Before PagerDuty
Whenever a background worker was automatically restarted, we deployed a fix, or any minor system event occurred a handful of e-mails would be generated to our whole Ops team and most of them would get SMS messages for each. We mostly ignored all of this noise. When a genuine emergency occurred, we often didn’t react immediately. Because we were all getting alerted, often 2-3 of us would respond in a piling-on effect. This sucks.
Principles of Proper Ops Monitoring
- People only get alerts for serious issues requiring human intervention
- Only One Person Alerted at a Time
- Serious Issues Should Wake You Up at 4AM
PagerDuty is the heart of the monitoring system at PatientsLikeMe. We’ve configured monit, NewRelic, and Pingdom to fire e-mail notifications to PagerDuty. PagerDuty collects all of the notifications, applies filtering rules, and opens Incident Tickets for anything needing attention.
- Filter Out Signal from Noise (monit instance alerts, etc)
- Single Person Assigned to Each Issue at a Time, w/ Escalation
- Wired up E-Mail, SMS, and old-school Pagers to get your attention (even at 4am)
Single Person Responds to Each Issue
When an incident is opened, we have a linear escalation hierarchy. I’m the first one notified (via SMS on personal iPhone and company Pager). If I don’t respond within 10 minutes, the ticket escalates to my boss (Director of Engineering). If he doesn’t respond within 20 minutes, an emergency backup is alerted (technical co-founder). This gives us great coverage and prevents double-responding to issues.
How Do We Use Monit?
Monit is on the front-lines of our operations monitoring. Every production service has a monit daemon monitoring key processes (Passenger, Resque Workers, Sphinx, etc) and system resources (CPU utilization, free memory, etc). Monit has the capability to diagnose failure and re-start processes as well as sending e-mail alerts through PagerDuty. Unfortunately, it’s very verbose in what generates e-mail notices, so we filter many of the notices via PagerDuty.
How Do We Use NewRelic?
NewRelic is the key to performance and site behavior monitoring. This is the first place I look to see how well patientslikeme.com is doing. We have custom alerts for error rate threshold, traffic patterns (high and low), and other issues via e-mail through PagerDuty and to our Campfire developer chat room.
How Do We Use Pingdom?
Pingdom is a separate, last-line of defense for critical errors. Pingdom is a simple service that measures application response time and uptime aggregation. Pingdom checks are entirely external to the application, and represent the worst-care scenario alarm (the Production application is down for an unknown reason). Additionally, Pingdom provides nice long-term aggregate tracking of response times.