Archive for the ‘Operations’ Category

Monitoring Redis Replication in Nagios

12/13/2012 Comments off

I’ve put together the following Nagios plugin for monitoring Redis slave server replication to ensure replication is successful and the lag is within a reasonable time limit:

#!/usr/bin/env ruby
require 'optparse'
options  = {}
required = [:warning, :critical, :host]

parser   = do |opts|
  opts.banner = "Usage: check_redis_replication [options]"
  opts.on("-h", "--host redishost", "The hostname of the redis slave") do |h|
    options[:host] = h
  opts.on("-w", "--warning percentage", "Warning threshold") do |w|
    options[:warning] = w
  opts.on("-c", "--critical critical", "Critical threshold") do |c|
    options[:critical] = c
abort parser.to_s if !required.all? { |k| options.has_key?(k) }

master_last_io_seconds_ago = `redis-cli info | grep master_last_io_seconds_ago`.split(':').last.to_i rescue -1

status = :ok
if master_last_io_seconds_ago < 0 || master_last_io_seconds_ago >= options[:critical].to_i
  status = :critical
elsif master_last_io_seconds_ago >= options[:warning].to_i
  status = :warning

status_detail = master_last_io_seconds_ago == -1 ? 'ERROR' : "#{ master_last_io_seconds_ago.to_s }s"
puts "#{status.to_s.upcase} - replication lag: #{status_detail}"

if status == :critical
elsif status == :warning

This can be wired up in Nagios with an NRPE remote execution check:

define service {
    name                            redis_replication
    register                        1
    check_command                   check_nrpe!check_redis_replication!$HOSTNAME$ 100 250
    service_description             Redis Replication
    hostgroup_name                  redis_slave
Categories: Operations

MongoDB Indexing, count(), and unique validations

11/10/2012 Comments off

Slow Queries in MongoDB

I rebuilt the database tier powering App Cloud earlier this week and uncovered some performance problems caused by slow queries. As usual, two or three were caused by missing indexes and were easily fixed by adding index coverage. MongoDB has decent index functionality for most use cases.

Investigating Slow count() queries

Unfortunately, I noticed a large variety of slow queries issuing complex count() queries like:

  count: "users", 
  query: { 
    email: "", 
    _id: { $ne: ObjectId('509e83e132a5752f5f000001') }
  fields: null 

Investigating our users collection, I saw a proper index on _id and email. Unfortunately, MongoDB can’t use indexes properly for count() operations. That’s a serious drawback, but not one I can change.

Where were these odd looking queries coming from? Why would we be looking for a user with a given email but NOT a given id?

The uniqueness validation on the email key of the User document and many other models was the culprit. Whenever a User is created/updated, ActiveModel is verifying there are no other Users with the given email:

class User
  include MongoMapper::Document

  key :email, String, unique: true

Use the Source!

Why is a unique validation triggering this type of count() query? Within Rails 3.x, this functionality is handled by the UniquenessValidator#validate_each implementation, which checks for records using the model’s exists?() query:


The exists?() method is a convention in both ActiveRecord and MongoMapper, checking for any records within the given scope. MongoMapper delegates it’s querying capability to the Plucky gem, where we can find the exists?() implementation using count():

  def exists?(query_options={})

Root Cause and a Patch to work-around MongoMapper/Plucky

In SQL, using count() is a nice way to check for the existence of records. Unfortunately, since MongoDB won’t use indices properly for count(), this incurs a big performance hit on large collections.

I added a MongoMapper patch to work-around the issue. We can patch the exists?() method to use find_one() without any fields instead of the expensive count() path:

module MongoMapper
  module Plugins
    module Querying
      module ClassMethods
        # Performance Hack: count() operations can't use indexes properly.
        # Use find() instead of count() for faster queries via indexes.
        def exists?(query_options={})

Resque Queue Priority

02/17/2012 Comments off

Resque Queue Priority

Queue Priority

Resque allows each worker process to work a prioritized list of work queues. When jobs are added, they end up in one particular queue. Each worker scans the queues in that priority order to determine the next job to process. This allows you to ensure that higher priority work is processed before lower priority work.

TL;DR: Resque workers process work in the priority order specified.

Categories: Debugging, Operations, Rails Tags:

How Monitors Ops w/ PagerDuty

04/26/2011 1 comment

PagerDuty Dispatch

Summary (TL;DR)
We have a network of production monitoring tools at, where monit, NewRelic, and Pingdom feed alerts through PagerDuty to produce e-mail, SMS, and Pager alerts for production issues. PagerDuty has a ticketing system to assign a given problem to a single person. It’s awesome.

Life Before PagerDuty
Whenever a background worker was automatically restarted, we deployed a fix, or any minor system event occurred a handful of e-mails would be generated to our whole Ops team and most of them would get SMS messages for each. We mostly ignored all of this noise. When a genuine emergency occurred, we often didn’t react immediately. Because we were all getting alerted, often 2-3 of us would respond in a piling-on effect. This sucks.

Principles of Proper Ops Monitoring

  1. People only get alerts for serious issues requiring human intervention
  2. Only One Person Alerted at a Time
  3. Serious Issues Should Wake You Up at 4AM

Read more…

File Handle Leaks in Hudson

12/17/2010 Comments off

Hudson is Awesome
We recently switched from cruisecontrol.rb to Hudson and have been much happier. It’s more reliable and we get much better resource management using build queues.

Hudson Failure
However, this week Hudson has stopped responding several times with the following error:

Dec 17, 2010 12:41:29 PM hudson.triggers.SCMTrigger$Runner runPolling
SEVERE: Failed to record SCM polling
hudson.plugins.git.GitException: Error retrieving tag names
        at hudson.plugins.git.GitAPI.getTagNames(
   ... snip ...
Caused by: Cannot run program "git" (in directory "/home/cruise/.hudson/server/jobs/plm-website-master/workspace"): error=24, Too many open files
        at java.lang.ProcessBuilder.start(
   ... snip ...
Caused by: error=24, Too many open files
        at java.lang.UNIXProcess.<init>(
   ... snip ...

Proximate Cause
Hudson is bound by the default process file limit, of 1024 on this linux box. When it hits the limit, these type of failures occur preventing forking off child processes. Something was leaking file handles. Using lsof showed file handles allocated up the the limit, 99% of which were pipes.

Root Cause
Whenever Hudson forks off a child process, it monitors the stdout/stderr streams even after they finish executing. So if you spawn off a daemonized process and don’t close out your output streams, you will leak 2-3 file handles on every execution. We have a simple ruby script that spawns off to report build status in our Campfire chatroom.

Being a naive script, we weren’t properly closing out our file handles.

We switched the Post-build script execution from:

ruby campfire.rb &


ruby campfire.rb &

Where this is


eval exec {0..2}\>/dev/null
eval exec {3..255}\>\&- # Close ALL file descriptors


Reference Links

UPDATE: Looks like this is mostly caused by a bug in the Git plugin for Hudson (Fixed in 1.390).

Categories: Debugging, Operations Tags:

Adding Bundler to Passenger Hosted Apps

12/15/2010 Comments off

We upgraded one of our applications to manage dependencies with bundler at PatientsLikeMe. When we deployed the new version of the application on a Passenger app server, we saw errors loading our bundle:

rubygems/dependency.rb:52:in `initialize': Valid types are [:development, :runtime], not nil (ArgumentError)

Some quick googling show this is a problem with rubygems on the system, with the fix being to upgrade rubygems as follows:

$> sudo gem update --system
$> gem -v 

We restarted the application via:

touch tmp/restart.txt

However, we still experienced the same bundler problem – as if the wrong gem system were being used in an RVM or multi-ruby environment.

This is an easy gotcha, as tmp/restart.txt only re-loads the application via the Passenger Spawn process, it doesn’t reload Passenger or the configuration.

When you’re changing system gems or other configuration loaded by Passenger, you need to restart the entire Apache stack hosting Passenger:

sudo /usr/sbin/apachectl restart

This resolved the problem.

Categories: Debugging, Operations, Rails

Rails Tests Run in 2/3 Time w/ GC Tuning

12/10/2010 Comments off

Run Your Unit Tests in 2/3 the Time
Tweaking the Ruby Enterprise Edition (REE) garbage collection (GC) parameters, I was able to run my unit tests in 2/3 the normal time. Total test time w/ Ruby 1.8.7 down from 20mins to approx 6mins on tuned REE 1.8.7.

This data was measured on the PatientsLikeMe Rails codebase, a very mature and large Rails app. The hardware is a MacBook Pro w/ Rails 2.3.5 on OSX 10.6.4. Your mileage may vary.

Background: Garbage Collection & Tuning
Ruby is a dynamic language with GC managing dynamic memory allocation. Most Ruby programmers have the benefit of ignoring the garbage collector during development, but tuning the GC parameters can have dramatic benefits in production and running your tests locally. Using REE allows the tuning of many GC parameters.

37Signals Production Settings

# NOTE: These only take effect when running Ruby Enterprise Edition

export RUBY_HEAP_MIN_SLOTS=600000
export RUBY_GC_MALLOC_LIMIT=59000000
export RUBY_HEAP_FREE_MIN=100000

Measured Performance

# Before (REE, no GC settings)
$> ruby -v
ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02
$> rake test:units
Finished in 666.310269 seconds.

3883 tests, 11523 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
# After (REE, w/ 37Signals GC tuning)
$> ruby -v
ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02
$> env | grep RUBY
$> rake test:units
Finished in 411.319884 seconds.

3883 tests, 11523 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications

Why? What Do These Settings Mean?

  • RUBY_HEAP_MIN_SLOTS – Number of slots in Ruby heap, directly controls initial heap size in your VM. Should be large enough to hold entire Rails environment. This is 6x the default heap size.
  • RUBY_GC_MALLOC_LIMIT – Wait until # of malloc() calls to trigger GC, this is much longer wait than Ruby default period for less collections. The value above collects every 59mil malloc()s
  • RUBY_HEAP_FREE_MIN – Minimum free heap post-collection, if not met will allocate a whole new heap. We’ve set it here to 17% the size of the heap. Default is 25% of heap.

Adventures in DHCP

12/08/2010 Comments off

Internet Down
Working for a startup, office IT crises can be as important as production operations.

We all arrived in the office this morning to find out there was no internet access. Further debugging showed that all of the DHCP lease requests were failing. However, the internet connection was working if you manually configured your IP/DNS info.

We had installed a new DHCP/DNS server last Friday night, so we immediately suspected some failure there. Thankfully, we had the old server as a hot backup. Surprisingly, the hot backup server didn’t work either.

We fired up a network sniffer and saw that between all of the clients in the office network and the DHCP server, someone else was intercepting the DHCP discovery requests and responding negatively to every attempt before the real server could.

DHCP is a broadcast protocol, so it’s easy for a misconfigured machine or router to hijack valid network requests. All of the evidence pointed to a similar thing happening here.

At the same time, we had just opened up a new room in our offices and had wired drops put in. There was a new patch panel with wiring for several drops in the room. I was immediately suspicious of a new router/device in this area due to the timing. However, we didn’t find any new devices.

Mystery Solved
It turns out that someone had wired the new patch panel up to an existing Time-capsule/Airport hub, but plugged it into the Uplink port instead of a hub port. This caused the Airport to configure itself as a router, turn on DHCP forwarding to the dead network (the new room), and intercept all DHCP requests from the entire office.

Categories: Debugging, Operations

REE Cuts Rails Test Time in Half

12/07/2010 Comments off

Ruby Enterprise Edition (REE)
I spent the night after work switching our build/stage server to Ruby Enterprise Edition. I switched both our Hudson based builds and our Passenger staging servers.

REE is well known for it’s superior garbage collection and memory management, but I was shocked to see how much faster it executed in Ruby CPU-bound contexts. We saw about a 55% drop in runtime, taking our average build times from 55min to 30min.

Build/Test Times Cut Almost in Half
Hudson Screen Shot

Performance Drill-down

  • Unit Tests: From 1036s to 579s (to run 3880 tests)
  • Functional Tests: From 844s to 448s (to run 860 tests)
  • Cucumber Tests: From 498s to 255s (to run 1078 steps)

The nginx/REE/Passenger stack is known as the best of breed production Rails stack, but I can’t believe how much of a benefit we’ve gotten from introducing the same components into our build and staging systems.

This effort was initially a functional testing pass to verify our system performed correctly under REE, I never expected to achieve such massive performance gains on it’s own merits.

Tips/Tricks & Gotchas

  • RVM is the best way to test/incrementally introduce a new ruby interpreter
  • If you’re using bundler w/ file-system bundles (via –path) you need to completely rebuild them when you switch Ruby interpreters
  • If you have a previous Passenger Apache module installer, you need to rebuild/reinstall the REE based Apache module

Original Ruby Version
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]
REE Version
ruby 1.8.7 (2010-04-19 patchlevel 253) [x86_64-linux], MBARI 0x6770, Ruby Enterprise Edition 2010.02

Cache Segmentation for Rails Apps

11/23/2010 3 comments

Problem: Cache Correctness vs. Cache Persistence
In many Rails apps, there are two caching requirements: Cache as much static data as possible, Flush any cached content that varies with each release (cached models, for example).  Many simple deployment strategies flush all cache data to prevent serialization conflicts with models, but this can cause an expensive penalty in cache misses until your cache is warmed up again.

At PatientsLikeMe, we faced both of these problems.  We have a high volume of action-cached data with a long shelf-life that’s difficult to pre-cache.  We also have a number of volatile model based caches that need to be flushed every time we deploy.

Solution: Segment Stable vs. Volatile Caches
I built this simple segmented cache strategy to meet our needs: a stable memcache namespace for any data that should persist between deploys, and a volatile cache namespace that’s swapped on every deploy.

Rails Cache Configuration Example

revision_file = Rails.root.join('REVISION')
if File.exist?(revision_file)
  revision = /[a-f,0-9]{6}$/
config.cache_store = :mem_cache_store, memcache_host, { :namespace => "volatile-#{ revision ? revision[0] : '0' }" }
config.action_controller.cache_store = :mem_cache_store, memcache_host, { :namespace => 'stable' }

EDIT: Thanks to Jeremy for the refactoring for this sample.

You’ll notice this sets one Cache object for the default Rails cache (used by models and most things) and a custom, stable Cache object for anything in a Controller (Action and Fragment Caches).

Categories: Caching, Operations