April 26, 2018

Implementing a queue system for PHP

Recently, I was tasked with implementing a queue system for a PHP application. There are many proven alternatives, such as RabbitMQ, ZeroMQ or Gearman. I decided to go with Beanstalkd, because it’s extremely easy to install, configure and start using, it’s small, fast, mature, used in many production environments with success. Also we didn’t need any of the advanced features others had to offer. There’s also a mature client for PHP. Even though its development is not very active today, its maturity and ease of use still makes it a good choice.

Using Beanstalkd with PHP is straightforward and covered extensively elsewhere. I want to talk about some points that may not be obvious when you are implementing a queue system from scratch. Some of the points are Beanstalkd-specific, but some of them are applicable to any scenario where you use php workers to do background processing.

Using supervisord to supervise workers

In a typical web service, you will put jobs in the queue after certain user actions. For example, when a user requests a password reset email. You will have worker processes indepent from your web server to read these jobs from Queue manager and do them. Soon you’ll realize that it’s quite a task itself to run and maintain these workers. Sometimes they’ll stop running due to an error, sometimes they’ll become unresponsive, and sometimes you’ll need to kill them and start new ones to start using newly deployed code. You need a ‘supervisor’ to supervise them. That’s where supervisord comes in.

Supervisord is a process control system. You can use Supervisord to control your workers, start/stop all or some of them, and most importantly to make sure a number of workers are always running. Whenever a worker stops working, Supervisord will immediately spawn a new one.

A conditional worker loop

It makes sense to set up your worker in a way that it always listens for new jobs, but also can stop listening on certain conditions. Having a ‘keepRunning’ flag and checking that flag in the beginning of the loop will help you to achieve that.

class Worker {
    public function __construct() {
        $this->keepRunning = true;
    }

    public function run() {
        while($this->keepRunning) {
            // check for new jobs...
        }
    }
}

Making workers stop after a while

Long-running PHP scripts can become bloated after a while. Your own code may be suitable for running long periods of time, but one of your dependencies may be leaking memory. Database connections might start to time-out. To prevent unforeseeable problems, it’s best if your worker processes die after a certain period of time and a new process take their place. This certain period may depend on your application and what you do, but I think 2 hours is a sane maximum execution time for a worker process.

$this->started = time();

while($this->keepRunning) {
    // check for new jobs...
    // ...
    if(time() - $this->started >= 60 * 60 * 2) {
        $this->keepRunning = false;
    }
}

Handling exit signals

You want your workers to exit after certain events. There will mainly be two situations where you’ll want to stop your running workers. First, when they reach a certain running time we just talked about, and second, when there are changes to your application’s code. But you don’t want to kill them right away because they might be in the middle of processing a job. You want to wait until they finish the job that they’re currently processing. You can achieve this by handling SIGTERM signal, and this is actually very easy in PHP. You register to be notified of system calls by providing a handler function. You can register for SIGTERM, and in the handler, set $this->keepRunning to false, so it will get out of while loop and exit.

pcntl_async_signals(true);

pcntl_signal(SIGTERM, function() {
    $this->keepRunning = false;
});

The default signal supervisord sends to processes to stop them is SIGTERM, so using it here makes sense. If you want to immediately stop any running workers, you can do it by sending SIGKILL.

Handling failed jobs

Jobs can and will fail, so you have to have a plan for when they fail. Ideally your queue jobs should be a single piece of code, have a single responsibility; so when they fail, you can run them again without worrying about side effects.

One way to retry failed jobs when you’re using Beanstalkd is to check how many times a job has been reserved. When you process a job and it hasn’t produced the value you expect (a.k.a. failed), use release command to release it. Releasing a job will make it ‘ready’ again, and another worker will be able to pick it up and make it ‘reserved’. This will increase that job’s ‘reserved’ value by one. Use statsJob command to find out how many times it has been reserved. If it’s less than the number of times you want to try this job, release it again, prefarably with some delay. Otherwise bury it so it is no more touched until it’s ‘kick’ed or deleted.

if($job->failed) {
    $stats = $pheanstalk->statsJob($pheanstalkJob);
    if($stats->reserves < $job->maxTryCount) {
        $pheanstalk->release($pheanstalkJob, $job->priority);
    } else {
        $pheanstalk->bury($pheanstalkJob); // or delete()
    }
}

Logging

Basically, log anything that you see necessary. Definitely log when you start a new job, the name of that job, if it succeeded or failed and how long it took to finish. Also log when a worker starts running, when it stops, how many jobs it got, how many of them failed.

Getting alerts for critical problems

A few jobs failing due to a network condition is not worth being alerted, but you should definitely set alerts for some cases. Set up cronjobs or use a tool like Monit to be notified when important things go wrong.

Things to be alerted for:

  • Too many failing jobs

Probably some new code has been introduced and that broke something. Or a 3rd party service you depend on started having problems. Check your logs.

  • Too many jobs in ready status

If there are a lot of jobs waiting to be processed, there could possibly be 3 kind of problems.

  1. Check your workers’ logs. Are they rapidly consuming jobs or is one job taking too long to finish? In first case, your current number of workers simply aren’t enough for the amount of jobs your application is producing. You should run more workers.

  2. If they’re stuck on one job, inspect that job. What is taking it so long? Make sure things like network connection timeout values make sense. They shouldn’t be too high.

  3. If you have jobs that naturally take a long time to finish, put them in a seperate queue and have some workers explicitly work for them, so the rest of your jobs won’t be delayed by them.

  • No workers are running

That means nobody is processing queues and jobs are piling up! Not good. Needs immediate attention. Find out why supervisord can’t spawn new worker processes.

  • Beanstalkd server is not running or not responsive

This is not very likely, nevertheless we should be sure beanstalkd server is running. One possible problem with Beanstalkd server could be when you set a memory limit for it and it has reached that limit.

  • Supervisord server is not running or not responsive

Also not very likely but should be checked constantly.

© Ahmet Kun 2018