Celery/RabbitMQ Notes on Task Timings

celery, django, rabbitmq

The following notes apply when using Celery with the RabbitMQ broker.

Celery Settings

task_acks_late

The Celery setting task_acks_late (by default disabled), if set, will defer message ACK with RabbitMQ until the task completes successfully.

  • If it is enabled, and the Celery task takes too long (see consumer_timeout below), the message will be requeued and redelivered to the next worker. This will cause the message to be processed multiple times.
  • If it is disabled, then Celery will ACK the message to RabbitMQ as soon as the task starts. This tells RabbitMQ that the message was “delivered and processed,” and RabbitMQ will delete this message from the queue. This will cause the message to be lost if the Celery task was interrupted before it finishes.

task_acks_on_failure_or_timeout

Then there is task_acks_on_failure_or_timeout (by default enabled). According to the doc, this will ACK the message even if the task failed or timed out. This may or may not be the correct choice depending on the task.

RabbitMQ Settings

consumer_timeout

The RabbitMQ consumer_timeout configuration (by default 30 minutes) gives a task a maximum timeout before requeuing the message. If a client (a Celery task in this case) does not ACK the message before this timeout expires, the message will be requeued and redelivered.

TTL / message TTL

Furthermore, the TTL / message TTL configuration (by default ??) determines how long a message will stay in RabbitMQ before being discarded. If a message stays in a queue for longer than this time, then it is removed from the queue.

Implications

To increase the chance that tasks are not aborted/lost due to restarts (e.g. deployments):

  • Design each task to finish as soon as possible. Spawning additional smaller tasks if necessary.
  • The entire task is designed to survive a partial completion state and is able to completely restart without ending up in a bad state (e.g. duplicate records created, abandoned/orphaned/partial updates).
  • Enable task_acks_late so that RabbitMQ ACKs are not sent until tasks are finished.
  • Extend the TTL / message TTL for RabbitMQ so that we have enough time to consume messages (in case the Celery task needs to be paused for some time to fix bugs).

There is no support to retry failed tasks. In fact, unless there is a Result Backend set up, there won’t even be a good way to audit which tasks failed (other than application logs).

PyTorch in Docker

docker, programming, Python

I’m getting into the fray that is Generative AI since, according to some, my job as a programmer will soon be taken over by some literal code cranking machine.

There are a few things to set up:

Like most tools and libraries, the instructions assume that I have nothing else happening on my machine, so install just Anaconda globally according to their simplistic assumptions. All the other dependencies like even the Python version on my machine can go to hell.

Dockerizing the environment

Well. Until I have enough $$$ to buy a new machine for each new tool I want to try out, I’ll be using Docker (think goodness I don’t need an actual VM) to isolate an environment to play with.

After some trial and error, this is a template Dockerfile and docker-compose.yml I’m using:

Dockerfile

FROM python:3.11-bookworm
ENV PYTHONUNBUFFERED 1

WORKDIR /code

RUN apt update && apt install -y \
    vim

RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-py311_24.1.2-0-Linux-x86_64.sh
RUN sh Miniconda3-py311_24.1.2-0-Linux-x86_64.sh -b
ENV PATH="$PATH:/root/miniconda3/bin"

RUN conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
RUN conda init

The above environment has:

  • Vim
  • Python 3.11
  • Miniconda3
  • PyTorch with CUDA 12.1

docker-compose.yml

services:
  app:
    build: .
    volumes:
      - ./:/code
    tty: true
    stdin_open: true

The docker-compose.yml is a template one I use for other things. Nothing special here.