Embedding Vector Search w/ Django & PostgreSQL, Part 1

django, docker, postgresql, programming, Python

Sigh. 🙄 Yes–yet another “AI” article in a blog.

The pgvector extension allows PostgreSQL to support vector searches. This ability is useful for implementing RAG solutions. The link has all the background information on the technology. This post will not duplicate the info.

This post is a runbook on setting up:

  • a Docker environment with PostgreSQL and the pgvector extension.
  • a Django application with the client libraries ready to use the vector extension of the PostgreSQL DB.

Setup (Docker)

The first step is to create a “service” (docker-compose concept) for the DB:

services:
  ...
  postgres:
    # Check for latest version as necessary/desired
    image: pgvector/pgvector:pg17
    environment:
      # remove before pushing up to source control
      - POSTGRES_PASSWORD=mysecretpassword
    ports:
      - "5432:5432"
    expose:
      - "5432"
    volumes:
      # Optional; use a host volume
      - ./data/psql:/var/lib/postgresql/data

Then bring up the service once to initialize the DB:

> docker-compose up postgres
...
postgres-1  | 2025-02-13 03:05:22.288 UTC [1] LOG:  database system is ready to accept connections

The image pgvector/pgvector:pg17 is a Docker image for PostgreSQL 17 with the pgvector extension bundled.

Create a DB (if necessary)

Create a DB to be used if necessary.

> docker-compose run --rm postgres psql -h postgres -U postgres 
Password for user postgres: .....
...
> create database mydb;
> \q

The POSTGRES_PASSWORD environment entry can now be removed from docker-compose.yml.

Django Connection to PostgreSQL

Add the psycopg dependency so we can use PostgreSQL.

Shell into the Django app container and run:

> pip install psycopg
...
> pip freeze -r requirements.txt > requirements.txt

Tell Django to use the PostgreSQL DB instead of the default SQLite:

DATABASES = {
    # 'default': {
    #     'ENGINE': 'django.db.backends.sqlite3',
    #     'NAME': BASE_DIR / 'db.sqlite3',
    # }
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "HOST": "postgres",  # Matches the docker "service" name
        "NAME": "mydb",      # the DB name
        "USER": "postgres",  # the default user
        "PASSWORD": "mysecretpassword",  # the password from POSTGRES_PASSWORD above
    }
}

Verify by running the prerequisite migrations for a Django application:

> python manage.py migrate
...
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK
  ...
  Applying sessions.0001_initial... OK

Install pgvector-python

Follow the directions here: https://github.com/pgvector/pgvector-python. Shell into the Django app container and run:

> pip install pgvector
...
> pip freeze -r requirements.txt > requirements.txt

There are also examples there for Django.

PyTorch in Docker

docker, programming, Python

I’m getting into the fray that is Generative AI since, according to some, my job as a programmer will soon be taken over by some literal code cranking machine.

There are a few things to set up:

Like most tools and libraries, the instructions assume that I have nothing else happening on my machine, so install just Anaconda globally according to their simplistic assumptions. All the other dependencies like even the Python version on my machine can go to hell.

Dockerizing the environment

Well. Until I have enough $$$ to buy a new machine for each new tool I want to try out, I’ll be using Docker (think goodness I don’t need an actual VM) to isolate an environment to play with.

After some trial and error, this is a template Dockerfile and docker-compose.yml I’m using:

Dockerfile

FROM python:3.11-bookworm
ENV PYTHONUNBUFFERED 1

WORKDIR /code

RUN apt update && apt install -y \
    vim

RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-py311_24.1.2-0-Linux-x86_64.sh
RUN sh Miniconda3-py311_24.1.2-0-Linux-x86_64.sh -b
ENV PATH="$PATH:/root/miniconda3/bin"

RUN conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
RUN conda init

The above environment has:

  • Vim
  • Python 3.11
  • Miniconda3
  • PyTorch with CUDA 12.1

docker-compose.yml

services:
  app:
    build: .
    volumes:
      - ./:/code
    tty: true
    stdin_open: true

The docker-compose.yml is a template one I use for other things. Nothing special here.

Celery with RabbitMQ on Docker

celery, django, docker, programming, Python

Picking up from Django app template w/ Docker, here are the steps to add Celery to the Django app.

Add RabbitMQ as the message queue

Modify docker-compose.yml to include:

services:
  ...
  rabbitmq:
    image: rabbitmq:3-management
    ports:
      - "15672:15672"
    expose:
      - "15672"

I use the “3-management” tag so that it includes a management plugin (accessible at http://localhost:15672/). However, simpler tags (e.g. “3” or “latest“) can be used if the management UI is not needed.

Install celery to the Django project

docker-compose run --rm app /bin/bash
...
pip install celery
pip freeze -r requirements.txt > requirements.txt
exit

Rebuild the container with the new packages added by celery.

Add a couple of files to set up Celery in our Django project

The Celery stuff will be added into the myapp Django app.

myapp/celery.py

from celery import Celery


app = Celery(
    'celery_app',
    broker='amqp://rabbitmq',
    # backend='rpc://',

    # This should include modules/files that define tasks. This is a list of strs 
    # to be evaluated later in order to get around circular dependencies, I suspect.
    include=[  
        'myapp.tasks',  # This is our file containing our task
    ]
)

# Optional configuration, see the application user guide.
app.conf.update(result_expires=3600)


if __name__ == '__main__':
    app.start()

myapp/tasks.py

import logging

from myapp.celery import app


logger = logging.getLogger(__name__)


# This is where Celery tasks are defined


@app.task
def add(x: int, y: int) -> int:
    logger.info(f"add({x}, {y}) called.")
    return x + y

Add a Celery service to docker-compose.yml

Modify docker-compose.yml again to add a Celery service. This can be done together with the RabbitMQ service above, but it is shown here separately for readability.

services:
  ...
  rabbitmq:
    ...
  app-celery:
    build: .
    environment:
    - DJANGO_SETTINGS_MODULE=myapp.settings
    command: >
      sh -c "celery -A myapp.celery worker --loglevel=INFO"
    volumes:
      - ./:/code
    depends_on:
      rabbitmq:
        condition: service_started

Things to watch out for

A bunch of things to highlight to show where the connection points are:

  • The broker URL when instantiating the Celery app is amqp://rabbitmq (not amqp://localhost) because that’s how networking in Docker works. The “rabbitmq” in this case the name of the service we use for the RabbitMQ container. So if a different container name is used, this AMQP URL needs to use that corresponding name.
  • The Celery app parameter (-A myapp.celery) is the path to the myapp/celery.py file where the Celery app (app = Celery('celery_app', ...) ) is created.
  • Speaking of which, when defining the Celery app, its include=[ ... ] should include str values that point to files where Celery tasks are defined.
  • And the task files that define the Celery tasks need to import the Celery app and use its @app.task decorator for the task functions.

Complete docker-compose.yml

The entire file looks like:

services:
  app:
    build: .
    command: >
      sh -c "python manage.py migrate &&
        python manage.py runserver 0.0.0.0:8000"
    ports:
      - "8000:8000"
    expose:
      - "8000"
    volumes:
      - ./:/code
    depends_on:
      rabbitmq:
        condition: service_started
    tty: true
    stdin_open: true
  app-celery:
    build: .
    command: >
      sh -c "celery -A myapp.celery worker --loglevel=INFO"
    volumes:
      - ./:/code
    depends_on:
      rabbitmq:
        condition: service_started
  rabbitmq:
    image: rabbitmq:3-management
    ports:
      - "15672:15672"
    expose:
      - "15672"

Django app template w/ Docker

django, docker, programming, Python

Revisiting https://www.pn.therealvan.com/2021/01/24/postgresql-and-mysql-docker-containers/, this post focuses on a plain Django app with minimal dependencies:

  • exclude pipenv
  • using the default SQLite DB

Bootstrapping

Start with these files:

Dockerfile

FROM python:3
ENV PYTHONUNBUFFERED 1

WORKDIR /code
#COPY requirements.txt /code/

#RUN pip install --upgrade pip && pip install -r requirements.txt

docker-compose.yml

version: '3'
services:
  app:
    build: .
    #command: >
    #  sh -c "python manage.py migrate &&
    #    python manage.py runserver 0.0.0.0:8000"
    ports:
      - "8000:8000"
    expose:
      - "8000"
    volumes:
      - ./:/code
    tty: true
    stdin_open: true

Run these to start up a container:

docker-compose build
docker-compose run --rm app /bin/bash

Initializing a Django project

Run these inside the container:

pip install django
pip freeze > requirements.txt

django-admin startproject myproj .
django-admin startapp myapp

exit

Rebuild and start the app

Now uncomment the lines in Dockerfile and docker-compose.yml.

Build the image and restart the container:

docker-compose build

Modify myproj/settings.py to add a line to register myapp to Django:

INSTALLED_APPS = [
    ...
    'myapp.apps.MyappConfig',  # Add this line
]

Now start the app again:

docker-compose up app

This should now bring up the app listening to http://localhost:8000/

Wagtail in Docker

django, docker, programming, Python, Windows

Over the weekend I found out about a CMS software written on top of Django called Wagtail.

The installation instructions promise an easy path. Just run:

pip install wagtail

Since I guard my global pip pretty carefully in order to reduce version collisions and whatnot, my first inclination was to see if I can test this from within a Docker container.

To start the experiment within a Docker container using Python, I start with:

mkdir wagtail
cd wagtail

docker run -it --rm \
 --mount type=bind,source="%CD%",target=/wagtail \
 python:3.7-buster /bin/bash

I use “%CD%” because I’m on Windows. If I were on Linux/Mac, I suppose it would be something like “${PWD}” instead.

So once in, I install and initialize per the instructions from Getting started — Wagtail Documentation 3.0.1 documentation:

cd /wagtail
pip install wagtail
wagtail start hahaha 

Interestingly, that wagtail start command, in addition to generating a bunch of files typical for a Django app, also created a Dockerfile.

So maybe the installer may already have some Docker support in mind?

Reading through it and spending more time than I expected to get to work, here are the steps to:

  • Get a Wagtail project set up from scratch using Docker without having to install any Python on the host.
  • Use a volume from the host for the app AND the SQLite DB so that they can be conveniently backed up and transferred (e.g. pushed up to some code repository like Github or SVN).

Generate a new Wagtail project

Pretty much what I had above:

mkdir wagtail
cd wagtail

docker run -it --rm \
 --mount type=bind,source="%CD%",target=/wagtail \
 python:3.7-buster /bin/bash
cd /wagtail
pip install wagtail
wagtail start hahaha
exit

I’m using “hahaha” as my project name. Substitute as needed. Also: %CD% should be ${PWD} for Linux/Mac.

Build the Image

cd hahaha
docker build -t hahaha . 

Fix up file permissions (Windows only)

Since the app should be run by the user wagtail:wagtail, fix up the permissions on the files.

For some reason this is not done correctly for Windows despite the command in Dockerfile, so a manual step is required:

docker run -it --rm \
 --mount type=bind,source="%CD%",target=/app \
 --user root hahaha /bin/bash

chown -R wagtail:wagtail .
exit

Set Up the App

This is typical Django setup stuff:

docker run -it --rm \
 --mount type=bind,source="%CD%",target=/app \
 hahaha /bin/bash
python manage.py migrate
python manage.py createsuperuser
exit

The above will create the db.sqlite3 file to be used for the app and also set up an admin user to be used to sign into the app.

Run The App

Finally, to run the app:

docker run -it --rm -p 8000:8000 \
 --mount type=bind,source="%CD%",target=/app hahaha

The -it --rm arguments are optional, but they help in stopping and cleaning up the container during development.

The site can be accessed at, of course, http://localhost:8000/. To manage it, use the superuser created earlier to get into the Admin Interface.

Worker Timeout

And now to actually start playing around with Wagtail….

So far I’m seeing a lot of errors when requesting pages in the Admin site. They look like this:

[2022-08-01 01:48:15 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:12)
[2022-08-01 01:48:15 +0000] [12] [INFO] Worker exiting (pid: 12)
[2022-08-01 01:48:15 +0000] [13] [INFO] Booting worker with pid: 13
[2022-08-01 01:48:58 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:13)
[2022-08-01 01:48:58 +0000] [13] [INFO] Worker exiting (pid: 13)
[2022-08-01 01:48:58 +0000] [14] [INFO] Booting worker with pid: 14
[2022-08-01 01:49:56 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:14)
[2022-08-01 01:49:56 +0000] [14] [INFO] Worker exiting (pid: 14)
[2022-08-01 01:49:56 +0000] [15] [INFO] Booting worker with pid: 15
[2022-08-01 01:50:34 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:15)
[2022-08-01 01:50:34 +0000] [15] [INFO] Worker exiting (pid: 15)
[2022-08-01 01:50:34 +0000] [16] [INFO] Booting worker with pid: 16

Postgresql and Mysql Docker containers

DB, docker, mysql, postgresql

To quickly get an instance of PostgreSQL and MySQL up and running, use the following docker-compose setup:

Create a subdirectory to place the docker-compose.yml and optionally the data files for the DBs:

Windows

cd %USERPROFILE%
mkdir dbs\data\mysql
mkdir dbs\data\psql
cd dbs

Others

cd ~
mkdir -p dbs/data/mysql
mkdir -p dbs/data/psql
cd dbs

Add this docker-compose.yml to start with:

version: '3.1'
services:
  mysql:
    image: mysql
    command: --default-authentication-plugin=mysql_native_password
    restart: always
    environment:
      MYSQL_ROOT_PASSWORD: password
    ports:
      - "3306:3306"
    expose:
      - "3306"
    volumes:
      - ./data/mysql:/var/lib/mysql
  postgresql:
    image: postgres
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=password
    ports:
      - "5432:5432"
    expose:
      - "5432"
    volumes:
      - ./data/psql:/var/lib/postgresql/data

To bring them up:

docker-compose up

By default:

  • The PostgreSQL instance uses postgres / password as the admin.
  • The MySQL instance uses root / password for the admin.

Django Dev Env w/ Docker Compose

django, docker, postgresql, programming

A few years back I ran into a problem when working with Django on Windows while my colleagues were on Mac OS where a datetime routine (forgot which one) behaved differently between us. Even after syncing on the version of Python and Django between us, the discrepancy still existed. Turns out it’s due to the difference between Python on Windows vs. Python on Mac OS. We ended up working around it by not using that routine.

Thinking back now, I guess the problem could’ve been avoided if we used Docker or Vagrant or similar so that we at least are all on the same environment. It’s the type of thing that “real” work environments would’ve been. But since we were working on that project on our own as a hobby, we didn’t think too much about it.

ALSO: Docker Desktop or even Linux on Windows Home was not available at the time, so most likely I would’ve had to wrestle w/ Docker Toolbar and VirtualBox which still had problems with host volumes.

UPDATE: this post has been updated on 2022-05 based on new learnings.

Setting Up Environment in Docker

If I were to do it now, this is how I would do it:

  • Create a subdirectory for DB data. We were using PostgreSQL, so I would create something like C:\dbdata\ and use host volume to mount it to the container’s /var/lib/postgresql/data.
  • Use the postgres and python:3 base images from Docker Hub.

Step-by-step, here’s how I would set it up:

Project scaffold

NOTE: the following is using “myproject” as the name of the Django project. Replace it with the name of your Django project as appropriate.

cd dev/projects
mkdir dj

Create two starter versions of Dockerfile and docker-compose.yml:

Dockerfile

FROM python:3.7-buster
ENV PYTHONUNBUFFERED 1

WORKDIR /code
#COPY Pipfile Pipfile.lock /code/
#
RUN pip install pipenv
#RUN pipenv install

docker-compose.yml

version: '3'
services:
  app:
    build: .
#    command: >
#      sh -c "pipenv run python manage.py migrate &&
#             pipenv run python manage.py runserver 0.0.0.0:8000"
    ports:
      - "8000:8000"
    expose:
      - "8000"
    volumes:
      - ./:/code
    tty: true
    stdin_open: true

Then build and start up the containers:

docker-compose build
docker-compose run --rm app /bin/bash
pipenv install
pipenv install django 
pipenv install <other stuff as needed>

pipenv run django-admin startproject myproject .
pipenv run django-admin startapp myapp

Now uncomment the lines previously commented in Dockerfile and docker-compose.yml.

PostgreSQL Setup

Modify myapp/settings.py to use PostgreSQL:

...
DATABASES = {
    #'default': {
    #    'ENGINE': 'django.db.backends.sqlite3',
    #    'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
    #}
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': 'postgres',
        'USER': 'postgres',
        'PASSWORD': 'postgres',
        'HOST': 'db',  # MUST match the service name for the DB
        'PORT': 5432,
    }
}
...

All pipenv-related operations should be done inside the container.

docker-compose run --rm app /bin/bash
pipenv install psycopg2-binary

Modify docker-compose.yml to bring up the DB and app containers:

version: '3'
services:
  # service name must match the HOST in myproject/settings.py's
  db:
    image: postgres
    environment:
      # Must match the values in myproject/settings.py's DATABASES
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      # Put the DB data for myproject under myproject_db 
      # so that I can add more projects later
      - PGDATA=/var/lib/postgresql/data/myproject_db
    ports:
      - "5432:5432"
    expose:
      - "5432"
    volumes:
      # host volume where DB data are actually stored
      - c:/dbdata:/var/lib/postgresql/data
  app:
    build: .
    command: >
      sh -c "pipenv run python manage.py migrate &&
             pipenv run python manage.py runserver 0.0.0.0:8000"
    ports:
      - "8000:8000"
    expose:
      - "8000"
    volumes:
      - ./:/code
    depends_on:
      - db

The above:

  • sets up two “services” (containers): a “db” service for the DB in addition to the “app” service for the app.
  • sets up a host mount (for the “db” service) of c:\dbdata to the container’s /var/lib/postgresql/data where PostgreSQL stores/uses data for the DBs. This will allow the data to persist beyond the container’s life time.
  • sets up the PGPATH environment variable that specifies to PostgreSQL the data subdirectory to be /var/lib/postgresql/data/myproject_db which, because of the mount, will end up as c:\dbdata\myproject_db on my Windows host. This allows c:\dbdata to be used as a parent subdirectory for multiple project DBs.

Bring Up The Environment

Just run:

docker-compose up app --build

The above will:

  • Build the images and start the containers for the db and web services.
  • Initialize a new empty PostgreSQL database.
  • Run the Django migrations to prime the database for Django.
  • Run the app and have it listen on port 8000.

NOTE: there may be a race condition in the first run where the DB is still being build/initialize before the web service is starting.

This error happens in that case:

web_1 | psycopg2.OperationalError: could not connect to server: Connection refused
web_1 | Is the server running on host "db" (172.19.0.2) and accepting
web_1 | TCP/IP connections on port 5432?

Just wait until the “db_1” service is finished, hit CTRL-C, and run the

docker-compose up app --build

command again. It should now work fine.

Optionally, start up the “db” service first in the background, then start up the “web” service:

docker-compose up -d db
docker-compose up app