Strange Rails Timestamp Errors

12/11/2010 1 comment

Strange Timestamp Errors
Occasionally (once or twice a week), we see strange failures in our Rails tests, with errors like the following:

<Wed Dec 08 23:47:09 -0600 2010> expected but was
<Wed Dec 08 23:47:09 -0600 2010>.

… generated from a test like the following:

    t = 2.days.ago
    Timecop.freeze(t) do
      saved_search = Factory(:user).saved_searches.create!(:name => 'My Search', :params => {'foo' => '1'}.to_json)

    Timecop.freeze( do
      get :show, :id =>
      assert_equal t, saved_search.last_used_at, "should not be updated"

Off By Less Than 1s
If you’re familiar with Ruby Time objects, you can see that the time value is different, but at a resolution well below the seconds printed in the messages used. How can this be? We’re using Timecop to freeze all of the Ruby/Rails Time interfaces at a fixed point for this test.

We’re creating a record with a known Timestamp, reloading it from the database, and seeing that Timestamp off, but just slightly (by less than a second).

Background Info: Timecop
Timecop is one of my favorite Rails plugins, it allows you to pass a block of code that executes in a different Time, by stubbing out all the Ruby/Rails Time APIs to return the time you’ve chosen. You can see it used here to “travel” in time back to 2 days ago, then 1 day ago, and execute different parts of the test. This “freezes” time and makes any real difference in Timestamps impossible. So how is the Timestamp for this record changing? And in such a strange way?

Mystery Revealed: ActiveRecord DB Time Parsing/Rounding Error
It turns out there’s a bug in Rails 2.3/3.0 in Timestamp database value parsing, causing rounding errors for some edge case values. There’s a pending patch on it’s way into Rails, but I’ve manually patched our version of Rails to prevent this in the meantime. This means that when a test runs and happens to hit one of these category of Timestamp values (based on this will blow up, since the Timestamp written to the DB will not be the one read back (off by less than a second). A truly maddening intermittent bug.

Much respect to Jacob Lauemøller who discovered/fixed this Rails bug. Here’s hoping this patch makes it in to the next 2.3.x release.

We’ve been suffering these erroneous failures for years and never knowing why. Glad to finally knock this off the list.

Categories: Debugging, Rails, Testing

Rails Tests Run in 2/3 Time w/ GC Tuning

12/10/2010 Comments off

Run Your Unit Tests in 2/3 the Time
Tweaking the Ruby Enterprise Edition (REE) garbage collection (GC) parameters, I was able to run my unit tests in 2/3 the normal time. Total test time w/ Ruby 1.8.7 down from 20mins to approx 6mins on tuned REE 1.8.7.

This data was measured on the PatientsLikeMe Rails codebase, a very mature and large Rails app. The hardware is a MacBook Pro w/ Rails 2.3.5 on OSX 10.6.4. Your mileage may vary.

Background: Garbage Collection & Tuning
Ruby is a dynamic language with GC managing dynamic memory allocation. Most Ruby programmers have the benefit of ignoring the garbage collector during development, but tuning the GC parameters can have dramatic benefits in production and running your tests locally. Using REE allows the tuning of many GC parameters.

37Signals Production Settings

# NOTE: These only take effect when running Ruby Enterprise Edition

export RUBY_HEAP_MIN_SLOTS=600000
export RUBY_GC_MALLOC_LIMIT=59000000
export RUBY_HEAP_FREE_MIN=100000

Measured Performance

# Before (REE, no GC settings)
$> ruby -v
ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02
$> rake test:units
Finished in 666.310269 seconds.

3883 tests, 11523 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
# After (REE, w/ 37Signals GC tuning)
$> ruby -v
ruby 1.8.7 (2010-04-19 patchlevel 253) [i686-darwin10.4.0], MBARI 0x6770, Ruby Enterprise Edition 2010.02
$> env | grep RUBY
$> rake test:units
Finished in 411.319884 seconds.

3883 tests, 11523 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications

Why? What Do These Settings Mean?

  • RUBY_HEAP_MIN_SLOTS – Number of slots in Ruby heap, directly controls initial heap size in your VM. Should be large enough to hold entire Rails environment. This is 6x the default heap size.
  • RUBY_GC_MALLOC_LIMIT – Wait until # of malloc() calls to trigger GC, this is much longer wait than Ruby default period for less collections. The value above collects every 59mil malloc()s
  • RUBY_HEAP_FREE_MIN – Minimum free heap post-collection, if not met will allocate a whole new heap. We’ve set it here to 17% the size of the heap. Default is 25% of heap.

Adventures in DHCP

12/08/2010 Comments off

Internet Down
Working for a startup, office IT crises can be as important as production operations.

We all arrived in the office this morning to find out there was no internet access. Further debugging showed that all of the DHCP lease requests were failing. However, the internet connection was working if you manually configured your IP/DNS info.

We had installed a new DHCP/DNS server last Friday night, so we immediately suspected some failure there. Thankfully, we had the old server as a hot backup. Surprisingly, the hot backup server didn’t work either.

We fired up a network sniffer and saw that between all of the clients in the office network and the DHCP server, someone else was intercepting the DHCP discovery requests and responding negatively to every attempt before the real server could.

DHCP is a broadcast protocol, so it’s easy for a misconfigured machine or router to hijack valid network requests. All of the evidence pointed to a similar thing happening here.

At the same time, we had just opened up a new room in our offices and had wired drops put in. There was a new patch panel with wiring for several drops in the room. I was immediately suspicious of a new router/device in this area due to the timing. However, we didn’t find any new devices.

Mystery Solved
It turns out that someone had wired the new patch panel up to an existing Time-capsule/Airport hub, but plugged it into the Uplink port instead of a hub port. This caused the Airport to configure itself as a router, turn on DHCP forwarding to the dead network (the new room), and intercept all DHCP requests from the entire office.

Categories: Debugging, Operations

REE Cuts Rails Test Time in Half

12/07/2010 Comments off

Ruby Enterprise Edition (REE)
I spent the night after work switching our build/stage server to Ruby Enterprise Edition. I switched both our Hudson based builds and our Passenger staging servers.

REE is well known for it’s superior garbage collection and memory management, but I was shocked to see how much faster it executed in Ruby CPU-bound contexts. We saw about a 55% drop in runtime, taking our average build times from 55min to 30min.

Build/Test Times Cut Almost in Half
Hudson Screen Shot

Performance Drill-down

  • Unit Tests: From 1036s to 579s (to run 3880 tests)
  • Functional Tests: From 844s to 448s (to run 860 tests)
  • Cucumber Tests: From 498s to 255s (to run 1078 steps)

The nginx/REE/Passenger stack is known as the best of breed production Rails stack, but I can’t believe how much of a benefit we’ve gotten from introducing the same components into our build and staging systems.

This effort was initially a functional testing pass to verify our system performed correctly under REE, I never expected to achieve such massive performance gains on it’s own merits.

Tips/Tricks & Gotchas

  • RVM is the best way to test/incrementally introduce a new ruby interpreter
  • If you’re using bundler w/ file-system bundles (via –path) you need to completely rebuild them when you switch Ruby interpreters
  • If you have a previous Passenger Apache module installer, you need to rebuild/reinstall the REE based Apache module

Original Ruby Version
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]
REE Version
ruby 1.8.7 (2010-04-19 patchlevel 253) [x86_64-linux], MBARI 0x6770, Ruby Enterprise Edition 2010.02

MacBook Pro SSD Upgrade

12/05/2010 4 comments

I’ve been suffering with abysmal disk performance on my work laptop for some time, so I decided to pick up an OCZ Vertex2 SSD on sale over Black Friday and 8gigs of RAM. Total cost for both was under $300 (though the same from Apple would have been about $1200).

I wanted to get OSX re-installed, data migrated, and my development environment setup on my own time without wasting productive work time, so I did the following:

  • Installed SSD in USB enclosure
  • Installed OSX on SSD Drive as secondary
  • Migrated Data/Apps from old internal drive
  • Boot off of USB/SSD and finish config
  • Install SSD as primary drive

Everything went smoothly. Here’s the disk portion of a system XBench benchmark before/after:
XBench Disk Performance

Overall, about an 8x performance increase split between 4x for Sequential Read/Write and 28x for Random Access. The system boots up in about 15 seconds now. Applications load very quickly, Chrome with 10-12 saved tabs opens in about 1 second.

Categories: Uncategorized

Asset Bundling vs. Internet Explorer CSS Limits

11/30/2010 Comments off

Weird CSS Limits in IE
Internet Explorer 6 & 7 have several surprising hard limits for CSS stylesheets you won’t find in any other browser:

  • No more than 4096 CSS rules/file
  • No more than 32 referenced CSS files.

Once you cross either of these thresholds, CSS styles will be quietly dropped without applying.

Asset Bundling
At PatientsLikeMe, we’ve decreased page load time by bundling common CSS/JS files together and serving them over a Content Delivery Network. We also make heavy use of SASS mixins for style encapsulation and server-side includes.

Asset Bundling + CSS Size Limits = Calamity
When we bundled two of our largest CSS files together we saw some strange IE-only styling errors. One of our front-end developers wasted at least 24 hours trying to isolate which CSS rule or expression was causing the failure, what a nightmare.

Add this to the long list of arcane and maddening Internet Explorer only gotchas to watch out for.

Cache Segmentation for Rails Apps

11/23/2010 3 comments

Problem: Cache Correctness vs. Cache Persistence
In many Rails apps, there are two caching requirements: Cache as much static data as possible, Flush any cached content that varies with each release (cached models, for example).  Many simple deployment strategies flush all cache data to prevent serialization conflicts with models, but this can cause an expensive penalty in cache misses until your cache is warmed up again.

At PatientsLikeMe, we faced both of these problems.  We have a high volume of action-cached data with a long shelf-life that’s difficult to pre-cache.  We also have a number of volatile model based caches that need to be flushed every time we deploy.

Solution: Segment Stable vs. Volatile Caches
I built this simple segmented cache strategy to meet our needs: a stable memcache namespace for any data that should persist between deploys, and a volatile cache namespace that’s swapped on every deploy.

Rails Cache Configuration Example

revision_file = Rails.root.join('REVISION')
if File.exist?(revision_file)
  revision = /[a-f,0-9]{6}$/
config.cache_store = :mem_cache_store, memcache_host, { :namespace => "volatile-#{ revision ? revision[0] : '0' }" }
config.action_controller.cache_store = :mem_cache_store, memcache_host, { :namespace => 'stable' }

EDIT: Thanks to Jeremy for the refactoring for this sample.

You’ll notice this sets one Cache object for the default Rails cache (used by models and most things) and a custom, stable Cache object for anything in a Controller (Action and Fragment Caches).

Categories: Caching, Operations