Data Model for Embedding
Typically, a model is used whose record includes:
- the text being embedded
- the embedding (vector of floats)
However, RDBMSes have strict and relatively static schemas that don’t like changes. Therefore, to allow it some flexibility in supporting different variations, two more fields are added:
- model name — the model used to generate the embedding
- embedding size — the cardinality of the vector
This is so that we can store vectors of different models and sizes without having to either changing the schema and/or creating a new table altogether.
Of course, the field to store the text should be large enough to accommodate the largest clause anticipated.
The final schema looks like this:
from django.db import models
from pgvector.django import VectorField
class EmbeddingVector(models.Model):
# Make sure the max_length is large enough for all anticipated values
the_text = models.CharField(max_length=10000, unique=True)
model_name = models.CharField(max_length=100)
# do not add dimensions=...
embedding = VectorField()
embedding_size = models.IntegerField()
# Optional but recommended composite index since all queries will
# have these as a filter
class Meta:
indexes = [
models.Index(fields=["model_name", "embedding_size"]),
]
The examples from the pgvector-python site has a dimensions=xxxx
property added to the VectorField
. Do not do this unless the dimension size will never change. Otherwise, it’d be an ALTER TABLE
down the road, and that may even end up locking the table during migration.
The idea is that, when storing an embedding record, include also the dimension into embedding_size
and the model used to perform the embedding into model_name
. This will allow the same table to store different embeddings.
Querying
When querying, include a filter for the model_name
and embedding_size
so that we’re comparing apples to apples:
# Embedding for the question/clause to search similar vectors for.
# Be sure to use the SAME model/logic used to create the embeddings
# stored in the table.
embedding = calculate_embedding(question)
similar_vectors = EmbeddingVector.objects.filter(
model_name=xxxx, embedding_size=n
).annotate(
distance=CosineDistance('embedding', embedding)
).order_by('distance')
where xxxx
and n
indicate the model and dimension of vectors to process. If the embedding size is wrong or omitted, you’d get an error if a vector of a different dimension is picked up. Unfortunately, if the model is wrong or incorrect, you’d most likely get incorrect results with no warnings.
Vector Distance vs. Confidence
Another thing to note is that, typically when querying cosine similar vectors, the value associated with each result is in the range [-1.0, 1.0]
where the closer to 1.0 the better the match. However, the way this lookup works is using by the distance between the provided embedding and the records in the table. In this case, the smaller the better (i.e. the closer to 0.0 the better). So don’t order by descending order. If for some reason the traditional confidence value is needed, use 1.0 - distance
. A warning, though: while playing with this I have seen records whose distance is < -1.0
, so take that into account.
Next post: comparing how to do this with Neo4j.