Heroku's Cedar Stack will kill your Resque workers
UPDATE: As of 8/21/12 Resque 1.22, there's an official solution! Skip to the end of this post and check it out.
Heroku is an invaluable resource for quickly deploying apps without having to do the dev-ops heavy lifting, allowing development to move fast. One of Heroku's great features is on-demand scaling of dynos (additional processes). To take advantage of those dynos, there are many job queueing gems available, the most popular being Resque. Heroku's ease of scaling up dynos combined with Resque's ease of using those dynos make the two a match made in heaven. Unfortunately, that's not the case.
So what's the problem?
The issues that we've encountered lie within Heroku's Cedar Stack, the most popular stack for Rails apps (now Heroku's new default!). Simply put, Heroku will kill your Resque workers.
Ok, that's not entirely true, it just made for a simpler title. What's really happening is that Heroku is telling Resque to terminate, and then Resque kills its workers.
If you're constantly processing jobs, there's a likely chance that your Resque workers will get killed in the middle of processing a job. This is really bad. Not only could your job get partially processed, but information about it and whatever it didn't complete will be completely lost. Have fun with that, right?
There are two key events that Heroku sends TERM signals to Resque (possibly more that I don't know about):
- Every time you re-deploy, Heroku will send a TERM signal to your Resque workers so that it can restart new workers with the new deployment's code.
- When you scale down your Resque workers, Heroku will send a TERM signal to the excess of workers. If you have 10 workers and scale down to 5, 5 of the workers will be shut down.
Here's an example of a Resque job:
Here's what happens when you scale down a Resque worker that's actively working on a job:
The worker gets killed, only the first 23 items of the job gets processed, and the job is discarded. Bad, bad, bad.
Why all the carnage?
It's all about communication, or mis-communication that is. Heroku follows Unix Best Practices and sends a TERM signal to Resque during a re-deploy or when scaling down dynos, giving Resque workers 10 seconds to shutdown. Resque, on the other hand, uses nginx signal handling, causing Resque to response to TERM by force killing its workers (1).
What Resque really wants is a QUIT signal. If Resque got a QUIT signal instead of TERM, it would allow the worker to finish processing its active job, and then gracefully shutdown. Unfortunately, we have no control over Heroku, and there's no way to make Heroku send a QUIT signal (2). Thus, the unnecessary carnage.
What to do.
Ideally, there would be an official solution from Resque and/or Heroku, but until that happens, you'll need to get by, one way or another.
There are a couple of factors that you'll need to consider:
If your jobs get processed safely under Heroku's 10 second grace period
You can patch Resque to interpret the TERM signal as a QUIT signal. Resque will allow active jobs to complete before shutting down the workers.
Here's what we need to patch:
WARNING: The downside is that you can't easily update Resque to the latest version without having to re-patch it.
If your jobs tend to run over 10 seconds and can't be interrupted part way through
Even if your Resque workers receive the equivalent of a QUIT signal, the workers still won't be able to complete their job in time before they get killed by Heroku. If this is your case, you'll need a mechanism to pause Resque workers, stopping them from picking up new jobs after they complete their active jobs. You then simply wait for all of the active jobs to complete and then re-deploy new code or scale down your workers.
Resque's first step in processing a job is reserving a job from the queue. The following monkey patch restricts a Resque worker from reserving a new job if the resque_paused flag in the redis store is set to true.
WARNING: Although Resque is a mature gem, there's a chance that the reserve method's implementation could be modified in a new update, rendering your patch invalid.
The following rake tasks make it easy to pause and resume Resque:
Here's a demonstration:
First, let's enqueue some jobs:
Then watch the Resque worker stop picking up new jobs after we tell it to pause. Likewise, watch the Resque worker start to pick up new jobs when we tell it to resume.
If your jobs run longer than 10 seconds and can be interrupted half way through
You can trap the term signal, find a stopping point, and then re-enqueue the remaining portion of the job. You'll still need to patch Resque to interpret TERM as QUIT.
Here's an example of a job doing just that:
And you'll see here that the job stops once it traps TERM and re-enqueues another job for the unprocessed remainder of items.
If you'd like an example project of the above techniques to try out yourself, check out the following github repos.
Fork of Resque interpeting TERM as QUIT:
Example project of above techniques:
I hope this blog post saves some time and hardships for those devs out there using Resque on Heroku's Cedar Stack.
If you have an opinion on the matter and would like to comment, or if you come across another way to handle this issue, I'd love to hear it.
As you can read in the comment by Abhinav Keswani (wasabhi), as of August 21st, there's an official solution to this problem in Resque's 1.22 release!
In summary, there are two new environment variables that give an opt-in approach to handling the term signal differently.
- TERM_CHILD=1 (opt in to new term signal workflow)
- RESQUE_TERM_TIMEOUT=10 (where 10 is the number of seconds to wait for a job to complete before killing it -defaults to 4)
Simply set TERM_CHILD=1 and rescue Resque::TermException from within your self.perform method, and do your thing within the set RESQUE_TERM_TIMEOUT. Easy.
Check out Heroku's blog post for a more detailed explanation and examples.
Big thanks to the teams at Heroku and Resque!