
Build a Semantic Search Engine with Embeddings and Qdrant
A Python tutorial: embed your data, store it in Qdrant, and run semantic search that understands meaning, then add keyword hybrid search and reranking.
Key takeaways
- Semantic search compares meaning instead of characters, so a query like 'how do I cancel my plan' still matches a document about 'terminating a subscription'.
- Read the vector size from the model rather than hard-coding it, since a mismatch between model and collection makes Qdrant reject the upsert.
- Hybrid search fuses dense and sparse results with Reciprocal Rank Fusion, recovering keyword precision for exact terms like part numbers and error codes.
- A cross-encoder reranker reads query and document together for a more accurate final ordering, so retrieve a wide candidate set cheaply then rerank only those.
Keyword search matches strings. If a user types "how do I cancel my plan" and your help docs say "terminating a subscription," classic keyword search returns nothing useful, because none of those words overlap. Semantic search fixes that by comparing meaning instead of characters. Both queries land near each other in a vector space, so the right document comes back even when the wording is completely different.
In this tutorial we build a working semantic search engine in Python. We embed a small dataset, store the vectors in Qdrant, and run meaning-based queries. Then we add hybrid search to bring keyword precision back into the mix, and finish with a cross-encoder reranker that reorders the top hits for quality. By the end you'll have something you can point at your own data.
I'll be honest about where each technique earns its keep and where it doesn't, because semantic search is not free and it's not always the right answer.
What you'll need
- Python 3.10 or newer.
- Docker, to run Qdrant locally. You can also use Qdrant Cloud, but local is simpler to start with.
- About 2 GB of free RAM. The embedding model is small, but it still loads into memory.
Install the Python packages:
pip install qdrant-client sentence-transformers
qdrant-client talks to the database. sentence-transformers gives us the embedding model and, later, the cross-encoder for reranking. No API keys, no cloud account, nothing to sign up for.
Run Qdrant
Qdrant ships as a single Docker image. Start it:
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage" \
qdrant/qdrant
Port 6333 is the REST and dashboard port, 6334 is gRPC (the client prefers gRPC for bulk operations). The volume mount keeps your data on disk so a container restart doesn't wipe everything. Once it's up, open http://localhost:6333/dashboard to poke around the web UI.
Pick a model and create a collection
The single most common mistake with vector databases is mismatching the vector size between the model and the collection. So decide the model first, then build the collection around it.
We'll use all-MiniLM-L6-v2 from sentence-transformers. It produces 384-dimensional vectors, runs fine on CPU, and is good enough for most search tasks. If you'd rather use OpenAI's text-embedding-3-small, that's 1536 dimensions, and you'd set size=1536 below. The number has to match the model exactly or upserts will be rejected.
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
from sentence_transformers import SentenceTransformer
client = QdrantClient(url="http://localhost:6333")
model = SentenceTransformer("all-MiniLM-L6-v2")
VECTOR_SIZE = model.get_sentence_embedding_dimension() # 384
COLLECTION = "articles"
client.recreate_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)
I read the dimension straight off the model with get_sentence_embedding_dimension() instead of hard-coding 384. That way, if you swap models later, the collection follows along and you can't drift out of sync.
About distance: cosine is the right default for sentence-transformers models, because the embeddings are meant to be compared by angle, not magnitude. Qdrant also offers Distance.DOT and Distance.EUCLID. If your model emits normalized vectors, dot product and cosine give the same ranking, and dot is slightly cheaper. Cosine is the safe choice when you're not sure.
I used recreate_collection here so the script is re-runnable during development, it drops the collection if it exists and makes a fresh one. In production you'd use create_collection once and never blow it away by accident.
Embed the documents
Here's a tiny dataset standing in for whatever you actually have, support articles, product descriptions, log lines, anything textual.
docs = [
{"id": 1, "title": "Reset your password", "text": "If you forgot your password, use the reset link on the login screen to set a new one."},
{"id": 2, "title": "Cancel a subscription", "text": "You can terminate your plan at any time from billing settings. Access continues until the period ends."},
{"id": 3, "title": "Update payment method", "text": "Add or change the credit card on file under billing. We charge the default card on renewal."},
{"id": 4, "title": "Export your data", "text": "Download a full archive of your account data as JSON from the privacy section."},
{"id": 5, "title": "Invite teammates", "text": "Send invitations by email to add members to your workspace. Seats are billed per active user."},
]
texts = [f"{d['title']}. {d['text']}" for d in docs]
vectors = model.encode(texts, normalize_embeddings=True).tolist()
Two things worth flagging. I concatenate the title and body before embedding so the model sees both signals in one vector. And I pass normalize_embeddings=True, which scales every vector to unit length. With normalized vectors, cosine distance behaves predictably and you avoid a class of subtle ranking bugs. model.encode returns a NumPy array, so .tolist() converts it to plain Python lists that the client serializes cleanly.
Upsert points with payload
A point in Qdrant is an id, a vector, and a payload. The payload is arbitrary JSON you attach to the vector, and it's what you read back at query time, so store the fields you'll want to display or filter on.
from qdrant_client.models import PointStruct
points = [
PointStruct(
id=doc["id"],
vector=vec,
payload={"title": doc["title"], "text": doc["text"]},
)
for doc, vec in zip(docs, vectors)
]
client.upsert(collection_name=COLLECTION, points=points)
upsert inserts new points and overwrites any with a matching id, so re-running the script is harmless. For a real dataset you'd batch this, a few hundred to a couple thousand points per call, rather than loading everything into one giant request.
Run a semantic query
Now the part that makes this worth the trouble. Embed the query with the same model, then ask Qdrant for the nearest points.
def search(query, limit=3):
qvec = model.encode(query, normalize_embeddings=True).tolist()
result = client.query_points(
collection_name=COLLECTION,
query=qvec,
limit=limit,
with_payload=True,
)
return result.points
for hit in search("how do I stop being billed"):
print(round(hit.score, 3), hit.payload["title"])
The query "how do I stop being billed" shares no words with "Cancel a subscription," yet that article comes back at the top, followed by the payment method one. That's the whole point. The model maps billing, charging, plans, and subscriptions into the same neighborhood, so the match survives a complete change of vocabulary.
Note that query_points is the current API. Older tutorials use client.search(...), which still works but is deprecated. query_points returns a response object with a .points list, each carrying a score, an id, and the payload you asked for. Higher score means closer match for cosine.
Hybrid search
Pure semantic search has a weakness: it's fuzzy by design, so it can miss exact terms. Search for a part number, an error code, or a specific name, and the dense vector might rank a vaguely related document above the exact match. Keyword search is the opposite, precise on exact tokens, useless on paraphrase. Hybrid search runs both and fuses the results, so you get recall from the dense side and precision from the sparse side.
Qdrant supports this natively. You store a sparse keyword vector alongside the dense one and let the server fuse them. The cleanest way to produce sparse vectors is BM25 via the fastembed integration, install it with pip install fastembed.
First, recreate the collection so it holds both a dense and a sparse vector per point:
from qdrant_client.models import SparseVectorParams
client.recreate_collection(
collection_name=COLLECTION,
vectors_config={"dense": VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)},
sparse_vectors_config={"bm25": SparseVectorParams()},
)
Named vectors ("dense" and "bm25") let a single point carry more than one representation. Now upsert with both. FastEmbed computes the BM25 sparse vector; we keep using sentence-transformers for the dense one.
from fastembed import SparseTextEmbedding
from qdrant_client.models import SparseVector
bm25 = SparseTextEmbedding("Qdrant/bm25")
dense_vectors = model.encode(texts, normalize_embeddings=True).tolist()
sparse_vectors = list(bm25.embed(texts))
points = []
for doc, dvec, svec in zip(docs, dense_vectors, sparse_vectors):
points.append(
PointStruct(
id=doc["id"],
vector={
"dense": dvec,
"bm25": SparseVector(indices=svec.indices.tolist(), values=svec.values.tolist()),
},
payload={"title": doc["title"], "text": doc["text"]},
)
)
client.upsert(collection_name=COLLECTION, points=points)
To query, fire a prefetch against each vector type and fuse them. Reciprocal Rank Fusion (RRF) combines the two ranked lists without needing the scores to be on the same scale, which matters because cosine scores and BM25 scores are not comparable numbers.
from qdrant_client.models import Prefetch, FusionQuery, Fusion, SparseVector
def hybrid_search(query, limit=3):
dvec = model.encode(query, normalize_embeddings=True).tolist()
svec = next(bm25.embed([query]))
result = client.query_points(
collection_name=COLLECTION,
prefetch=[
Prefetch(query=dvec, using="dense", limit=20),
Prefetch(
query=SparseVector(indices=svec.indices.tolist(), values=svec.values.tolist()),
using="bm25",
limit=20,
),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=limit,
with_payload=True,
)
return result.points
Each prefetch pulls its own top 20 candidates, then RRF merges them into a final ranking and we keep the top few. Now a query like "reset password" benefits from the keyword hit on the exact words, while "I locked myself out" still works through the dense side. You get both behaviors from one call.
Is hybrid always better? Not always. It costs more compute and more storage, and on datasets where queries closely resemble the document wording, plain BM25 can hold its own. Hybrid pays off most when your queries are conversational and your documents are not, which describes most real user-facing search.
Reranking
Fusion gives a decent ordering, but it's still based on cheap similarity. A cross-encoder reads the query and a candidate document together, in one forward pass, and scores how well they match. That joint attention is far more accurate than comparing two separate embeddings, but it's also too slow to run over a whole collection. So the pattern is: retrieve a wide candidate set cheaply, then rerank only those with the cross-encoder.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def search_and_rerank(query, retrieve=20, top_k=3):
candidates = hybrid_search(query, limit=retrieve)
pairs = [(query, c.payload["text"]) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [(c.payload["title"], float(s)) for c, s in ranked[:top_k]]
for title, score in search_and_rerank("how do I stop being billed"):
print(round(score, 3), title)
We retrieve 20 candidates, build (query, document) pairs, and let the cross-encoder score each pair. Then we sort by that score and return the top 3. The reranker often promotes a result the first stage ranked third or fourth into the top spot, because it actually reads the text together rather than comparing precomputed points.
The tradeoff is latency. Scoring 20 pairs adds maybe 50 to 200 milliseconds on CPU, and it scales with how many candidates you rerank. Keep the candidate count modest, 20 to 50 is usually enough, and move the model to a GPU if you need the speed. For high-traffic systems, a hosted reranking API can be worth the cost to keep tail latency down.
Gotchas
A few things that will bite you, most of which I've been bitten by.
Vector size mismatch. If the collection expects 384 dimensions and you send 1536, Qdrant rejects the upsert with a dimension error. Read the size from the model rather than hard-coding it, and never change embedding models without recreating the collection. Vectors from different models are not comparable, even at the same dimension.
Normalization. Decide once and stick with it. If you embed documents with normalize_embeddings=True, normalize queries the same way. Mixing normalized and raw vectors quietly skews your scores, and there's no error to tell you, just worse results.
Chunking. Embedding a 5,000-word page into a single vector blurs everything together, and the match gets mushy. Split long documents into chunks of a few hundred tokens with a little overlap, embed each chunk as its own point, and store the parent document id in the payload so you can group results. Picking the chunk size is the highest-leverage tuning knob you have, more so than the choice of model.
Measuring recall. Don't eyeball it. Build a small set of real queries with the documents that should come back, then measure recall@k and the like. Without a test set you're guessing, and every change you make is a coin flip. Even 30 labeled queries will tell you whether hybrid and reranking actually help on your data, or whether they're just burning compute.
Wrap-up
You now have the full pipeline: embeddings in Qdrant, semantic queries that match on meaning, hybrid search that recovers keyword precision, and a cross-encoder that sharpens the final ranking. Each layer addresses a specific failure of the one before it.
Start with plain semantic search and only add complexity when your evaluation set shows you need it. Hybrid and reranking are real improvements, but they cost latency and money, and on some datasets the gain is marginal. Measure first, then decide. Point this at your own data, write a handful of test queries, and let the numbers tell you how far to go.