Archive

Archive for the ‘Debugging’ Category

Resque Queue Priority

02/17/2012 Comments off

Resque Queue Priority

 
 
Queue Priority

Resque allows each worker process to work a prioritized list of work queues. When jobs are added, they end up in one particular queue. Each worker scans the queues in that priority order to determine the next job to process. This allows you to ensure that higher priority work is processed before lower priority work.

TL;DR: Resque workers process work in the priority order specified.

Categories: Debugging, Operations, Rails Tags:

File Handle Leaks in Hudson

12/17/2010 Comments off

Hudson is Awesome
We recently switched from cruisecontrol.rb to Hudson and have been much happier. It’s more reliable and we get much better resource management using build queues.

Hudson Failure
However, this week Hudson has stopped responding several times with the following error:

Dec 17, 2010 12:41:29 PM hudson.triggers.SCMTrigger$Runner runPolling
SEVERE: Failed to record SCM polling
hudson.plugins.git.GitException: Error retrieving tag names
        at hudson.plugins.git.GitAPI.getTagNames(GitAPI.java:650)
   ... snip ...
Caused by: java.io.IOException: Cannot run program "git" (in directory "/home/cruise/.hudson/server/jobs/plm-website-master/workspace"): java.io.IOException: error=24, Too many open files
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
   ... snip ...
Caused by: java.io.IOException: java.io.IOException: error=24, Too many open files
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
   ... snip ...

Proximate Cause
Hudson is bound by the default process file limit, of 1024 on this linux box. When it hits the limit, these type of failures occur preventing forking off child processes. Something was leaking file handles. Using lsof showed file handles allocated up the the limit, 99% of which were pipes.

Root Cause
Whenever Hudson forks off a child process, it monitors the stdout/stderr streams even after they finish executing. So if you spawn off a daemonized process and don’t close out your output streams, you will leak 2-3 file handles on every execution. We have a simple ruby script that spawns off to report build status in our Campfire chatroom.

Being a naive script, we weren’t properly closing out our file handles.

We switched the Post-build script execution from:

ruby campfire.rb &

To:

ruby scrub_fds.sh campfire.rb &

Where this is scrub_fds.sh:

#!/bin/sh

eval exec {0..2}\>/dev/null
eval exec {3..255}\>\&- # Close ALL file descriptors

$*

Reference Links

UPDATE: Looks like this is mostly caused by a bug in the Git plugin for Hudson (Fixed in 1.390).

Categories: Debugging, Operations Tags:

Adding Bundler to Passenger Hosted Apps

12/15/2010 Comments off

We upgraded one of our applications to manage dependencies with bundler at PatientsLikeMe. When we deployed the new version of the application on a Passenger app server, we saw errors loading our bundle:

rubygems/dependency.rb:52:in `initialize': Valid types are [:development, :runtime], not nil (ArgumentError)

Some quick googling show this is a problem with rubygems on the system, with the fix being to upgrade rubygems as follows:

$> sudo gem update --system
$> gem -v 
1.3.7

We restarted the application via:

touch tmp/restart.txt

However, we still experienced the same bundler problem – as if the wrong gem system were being used in an RVM or multi-ruby environment.

This is an easy gotcha, as tmp/restart.txt only re-loads the application via the Passenger Spawn process, it doesn’t reload Passenger or the configuration.

When you’re changing system gems or other configuration loaded by Passenger, you need to restart the entire Apache stack hosting Passenger:

sudo /usr/sbin/apachectl restart

This resolved the problem.

Categories: Debugging, Operations, Rails

Strange Rails Timestamp Errors

12/11/2010 1 comment

Strange Timestamp Errors
Occasionally (once or twice a week), we see strange failures in our Rails tests, with errors like the following:

<Wed Dec 08 23:47:09 -0600 2010> expected but was
<Wed Dec 08 23:47:09 -0600 2010>.

… generated from a test like the following:


    t = 2.days.ago
    Timecop.freeze(t) do
      saved_search = Factory(:user).saved_searches.create!(:name => 'My Search', :params => {'foo' => '1'}.to_json)
    end

    Timecop.freeze(1.day.ago) do
      get :show, :id => saved_search.id
      saved_search.reload
      assert_equal t, saved_search.last_used_at, "should not be updated"
    end

Off By Less Than 1s
If you’re familiar with Ruby Time objects, you can see that the time value is different, but at a resolution well below the seconds printed in the messages used. How can this be? We’re using Timecop to freeze all of the Ruby/Rails Time interfaces at a fixed point for this test.

We’re creating a record with a known Timestamp, reloading it from the database, and seeing that Timestamp off, but just slightly (by less than a second).

Background Info: Timecop
Timecop is one of my favorite Rails plugins, it allows you to pass a block of code that executes in a different Time, by stubbing out all the Ruby/Rails Time APIs to return the time you’ve chosen. You can see it used here to “travel” in time back to 2 days ago, then 1 day ago, and execute different parts of the test. This “freezes” time and makes any real difference in Timestamps impossible. So how is the Timestamp for this record changing? And in such a strange way?

Mystery Revealed: ActiveRecord DB Time Parsing/Rounding Error
It turns out there’s a bug in Rails 2.3/3.0 in Timestamp database value parsing, causing rounding errors for some edge case values. There’s a pending patch on it’s way into Rails, but I’ve manually patched our version of Rails to prevent this in the meantime. This means that when a test runs and happens to hit one of these category of Timestamp values (based on Time.now) this will blow up, since the Timestamp written to the DB will not be the one read back (off by less than a second). A truly maddening intermittent bug.

Much respect to Jacob Lauemøller who discovered/fixed this Rails bug. Here’s hoping this patch makes it in to the next 2.3.x release.

We’ve been suffering these erroneous failures for years and never knowing why. Glad to finally knock this off the list.

Categories: Debugging, Rails, Testing

Adventures in DHCP

12/08/2010 Comments off

Internet Down
Working for a startup, office IT crises can be as important as production operations.

We all arrived in the office this morning to find out there was no internet access. Further debugging showed that all of the DHCP lease requests were failing. However, the internet connection was working if you manually configured your IP/DNS info.

Background
We had installed a new DHCP/DNS server last Friday night, so we immediately suspected some failure there. Thankfully, we had the old server as a hot backup. Surprisingly, the hot backup server didn’t work either.

We fired up a network sniffer and saw that between all of the clients in the office network and the DHCP server, someone else was intercepting the DHCP discovery requests and responding negatively to every attempt before the real server could.

DHCP is a broadcast protocol, so it’s easy for a misconfigured machine or router to hijack valid network requests. All of the evidence pointed to a similar thing happening here.

At the same time, we had just opened up a new room in our offices and had wired drops put in. There was a new patch panel with wiring for several drops in the room. I was immediately suspicious of a new router/device in this area due to the timing. However, we didn’t find any new devices.

Mystery Solved
It turns out that someone had wired the new patch panel up to an existing Time-capsule/Airport hub, but plugged it into the Uplink port instead of a hub port. This caused the Airport to configure itself as a router, turn on DHCP forwarding to the dead network (the new room), and intercept all DHCP requests from the entire office.

Categories: Debugging, Operations

Asset Bundling vs. Internet Explorer CSS Limits

11/30/2010 Comments off

Weird CSS Limits in IE
Internet Explorer 6 & 7 have several surprising hard limits for CSS stylesheets you won’t find in any other browser:

  • No more than 4096 CSS rules/file
  • No more than 32 referenced CSS files.

Once you cross either of these thresholds, CSS styles will be quietly dropped without applying.

Asset Bundling
At PatientsLikeMe, we’ve decreased page load time by bundling common CSS/JS files together and serving them over a Content Delivery Network. We also make heavy use of SASS mixins for style encapsulation and server-side includes.

Asset Bundling + CSS Size Limits = Calamity
When we bundled two of our largest CSS files together we saw some strange IE-only styling errors. One of our front-end developers wasted at least 24 hours trying to isolate which CSS rule or expression was causing the failure, what a nightmare.

Add this to the long list of arcane and maddening Internet Explorer only gotchas to watch out for.

Follow

Get every new post delivered to your Inbox.