Embedding Vector Search w/ Django & PostgreSQL, Part 2

Data Model for Embedding

Typically, a model is used whose record includes:

the text being embedded
the embedding (vector of floats)

However, RDBMSes have strict and relatively static schemas that don’t like changes. Therefore, to allow it some flexibility in supporting different variations, two more fields are added:

model name — the model used to generate the embedding
embedding size — the cardinality of the vector

This is so that we can store vectors of different models and sizes without having to either changing the schema and/or creating a new table altogether.

Of course, the field to store the text should be large enough to accommodate the largest clause anticipated.

The final schema looks like this:

from django.db import models
from pgvector.django import VectorField


class EmbeddingVector(models.Model):
    # Make sure the max_length is large enough for all anticipated values
    the_text = models.CharField(max_length=10000, unique=True)

    model_name = models.CharField(max_length=100)

    # do not add dimensions=... 
    embedding = VectorField()

    embedding_size = models.IntegerField()

    # Optional but recommended composite index since all queries will
    # have these as a filter
    class Meta:
        indexes = [
            models.Index(fields=["model_name", "embedding_size"]),
        ]

The examples from the pgvector-python site has a dimensions=xxxx property added to the VectorField. Do not do this unless the dimension size will never change. Otherwise, it’d be an ALTER TABLE down the road, and that may even end up locking the table during migration.

The idea is that, when storing an embedding record, include also the dimension into embedding_size and the model used to perform the embedding into model_name. This will allow the same table to store different embeddings.

Querying

When querying, include a filter for the model_name and embedding_size so that we’re comparing apples to apples:

        # Embedding for the question/clause to search similar vectors for.
        # Be sure to use the SAME model/logic used to create the embeddings
        # stored in the table.
        embedding = calculate_embedding(question)
 
        similar_vectors = EmbeddingVector.objects.filter(
            model_name=xxxx, embedding_size=n
        ).annotate(
            distance=CosineDistance('embedding', embedding)
        ).order_by('distance')

where xxxx and n indicate the model and dimension of vectors to process. If the embedding size is wrong or omitted, you’d get an error if a vector of a different dimension is picked up. Unfortunately, if the model is wrong or incorrect, you’d most likely get incorrect results with no warnings.

Vector Distance vs. Confidence

Another thing to note is that, typically when querying cosine similar vectors, the value associated with each result is in the range [-1.0, 1.0] where the closer to 1.0 the better the match. However, the way this lookup works is using by the distance between the provided embedding and the records in the table. In this case, the smaller the better (i.e. the closer to 0.0 the better). So don’t order by descending order. If for some reason the traditional confidence value is needed, use 1.0 - distance. A warning, though: while playing with this I have seen records whose distance is < -1.0, so take that into account.

Next post: comparing how to do this with Neo4j.

Data Model for Embedding

Querying

Vector Distance vs. Confidence

You Might Also Like

Sliding Cards Part 2

Loading Resources from Python Packages

Celery with RabbitMQ on Docker