feat: Qdrant skills (#1412)

This commit is contained in:
Anush
2026-04-17 06:24:27 +05:30
committed by GitHub
parent 971139baf2
commit 9637e1ab08
24 changed files with 1234 additions and 0 deletions

View File

@@ -0,0 +1,37 @@
---
name: qdrant-performance-optimization
description: "Different techniques to optimize the performance of Qdrant, including indexing strategies, query optimization, and hardware considerations. Use when you want to improve the speed and efficiency of your Qdrant deployment."
allowed-tools:
- Read
- Grep
- Glob
---
# Qdrant Performance Optimization
There are different aspects of Qdrant performance, this document serves as a navigation hub for different aspects of performance optimization in Qdrant.
## Search Speed Optimization
There are two different criteria for search speed: latency and throughput.
Latency is the time it takes to get a response for a single query, while throughput is the number of queries that can be processed in a given time frame.
Depending on your use case, you may want to optimize for one or both of these metrics.
More on search speed optimization can be found in the [Search Speed Optimization](search-speed-optimization/SKILL.md) skill.
## Indexing Performance Optimization
Qdrant needs to build a vector index to perform efficient similarity search. The time it takes to build the index can vary depending on the size of your dataset, hardware, and configuration.
More on indexing performance optimization can be found in the [Indexing Performance Optimization](indexing-performance-optimization/SKILL.md) skill.
## Memory Usage Optimization
Vector search can be memory intensive, especially when dealing with large datasets.
Qdrant has a flexible memory management system, which allows you to precisely control which parts of storage are kept in memory and which are stored on disk. This can help you optimize memory usage without sacrificing performance.
More on memory usage optimization can be found in the [Memory Usage Optimization](memory-usage-optimization/SKILL.md) skill.

View File

@@ -0,0 +1,80 @@
---
name: qdrant-indexing-performance-optimization
description: "Diagnoses and fixes slow Qdrant indexing and data ingestion. Use when someone reports 'uploads are slow', 'indexing takes forever', 'optimizer is stuck', 'HNSW build time too long', or 'data uploaded but search is bad'. Also use when optimizer status shows errors, segments won't merge, or indexing threshold questions arise."
---
# What to Do When Qdrant Indexing Is Too Slow
Qdrant does NOT build HNSW indexes immediately. Small segments use brute-force until they exceed `indexing_threshold_kb` (default: 20 MB). Search during this window is slower by design, not a bug.
- Understand the indexing optimizer [Indexing optimizer](https://search.qdrant.tech/md/documentation/operations/optimizer/?s=indexing-optimizer)
## Uploads/Ingestion Too Slow
Use when: upload or upsert API calls are slow.
Identify bottleneck: client-side (network, batching) vs server-side (CPU, disk I/O)
For client-side, optimize batching and parallelism:
- Use batch upserts (64-256 points per request) [Points API](https://search.qdrant.tech/md/documentation/manage-data/points/?s=upload-points)
- Use 2-4 parallel upload streams
For server-side, optimize Qdrant configuration and indexing strategy:
- Create more shards (3-12), each shard has an independent update worker [Sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=sharding)
- Create payload indexes before HNSW builds (needed for filterable vector index) [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
Suitable for initial bulk load of large datasets:
- Disable HNSW during bulk load (set `indexing_threshold_kb` very high, restore after) [Collection params](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=update-collection-parameters)
- Setting `m=0` to disable HNSW is legacy, use high `indexing_threshold_kb` instead
Careful, fast unindexed upload might temporarily use more RAM and degrade search performance until optimizer catches up.
See https://search.qdrant.tech/md/documentation/tutorials-develop/bulk-upload/
## Optimizer Stuck or Taking Too Long
Use when: optimizer running for hours, not finishing.
- Check actual progress via optimizations endpoint (v1.17+) [Optimization monitoring](https://search.qdrant.tech/md/documentation/operations/optimizer/?s=optimization-monitoring)
- Large merges and HNSW rebuilds legitimately take hours on big datasets
- Check CPU and disk I/O (HNSW is CPU-bound, merging is I/O-bound, HDD is not viable)
- If `optimizer_status` shows an error, check logs for disk full or corrupted segments
## HNSW Build Time Too High
Use when: HNSW index build dominates total indexing time.
- Reduce `m` (default 16, good for most cases, 32+ rarely needed) [HNSW params](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index)
- Reduce `ef_construct` (100-200 sufficient) [HNSW config](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=indexing-vectors-in-hnsw)
- Keep `max_indexing_threads` proportional to CPU cores [Configuration](https://search.qdrant.tech/md/documentation/operations/configuration/)
- Use GPU for indexing [GPU indexing](https://search.qdrant.tech/md/documentation/operations/running-with-gpu/)
## HNSW index for multi-tenant collections
If you have a multi-tenant use case where all data is split by some payload field (e.g. `tenant_id`), you can avoid building a global HNSW index and instead rely on `payload_m` to build HNSW index only for subsets of data.
Skipping global HNSW index can significantly reduce indexing time.
See [Multi-tenant collections](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/) for details.
## Additional Payload Indexes Are Too Slow
Qdrant builds extra HNSW links for all payload indexes to ensure that quality of filtered vector search does not degrade.
Some payload indexes (e.g. `text` fields with long texts) can have a very high number of unique values per point, which can lead to long HNSW build time.
You can disable building extra HNSW links for specific payload index and instead rely on slightly slower query-time strategies like ACORN.
Read more about disabling extra HNSW links in [documentation](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=disable-the-creation-of-extra-edges-for-payload-fields)
Read more about ACORN in [documentation](https://search.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm)
## What NOT to Do
- Do not create payload indexes AFTER HNSW is built (breaks filterable vector index)
- Do not use `m=0` for bulk uploads into an existing collection, it might drop the existing HNSW and cause long reindexing
- Do not upload one point at a time (per-request overhead dominates)

View File

@@ -0,0 +1,67 @@
---
name: qdrant-memory-usage-optimization
description: "Diagnoses and reduces Qdrant memory usage. Use when someone reports 'memory too high', 'RAM keeps growing', 'node crashed', 'out of memory', 'memory leak', or asks 'why is memory usage so high?', 'how to reduce RAM?'. Also use when memory doesn't match calculations, quantization didn't help, or nodes crash during recovery."
---
# Understanding memory usage
Qdrant operates with two types of memory:
- Resident memory (aka RSSAnon) - memory used for internal data structures like the ID tracker, plus components that must stay in RAM, such as quantized vectors when `always_ram=true` and payload indexes.
- OS page cache - memory used for caching disk reads, which can be released when needed. Original vectors are normally stored in page cache, so the service won't crash if RAM is full, but performance may degrade.
It is normal for the OS page cache to occupy all available RAM, but if resident memory is above 80% of total RAM, it is a sign of a problem.
## Memory usage monitoring
- Qdrant exposes memory usage through the `/metrics` endpoint. See [Monitoring docs](https://search.qdrant.tech/md/documentation/operations/monitoring/).
<!-- ToDo: Talk about memory usage of each components once API is available -->
## How much memory is needed for Qdrant?
Optimal memory usage depends on the use case.
- For regular search scenarios, general guidelines are provided in the [Capacity planning docs](https://search.qdrant.tech/md/documentation/operations/capacity-planning/).
For a detailed breakdown of memory usage at large scale, see [Large scale memory usage example](https://search.qdrant.tech/md/documentation/tutorials-operations/large-scale-search/?s=memory-usage).
Payload indexes and HNSW graph also require memory, along with vectors themselves, so it's important to consider them in calculations.
Additionally, Qdrant requires some extra memory for optimizations. During optimization, optimized segments are fully loaded into RAM, so it is important to leave enough headroom.
The larger `max_segment_size` is, the more headroom is needed.
### When to put HNSW index on disk
Putting frequently used components (such as HNSW index) on disk might cause significant performance degradation.
There are some scenarios, however, when it can be a good option:
- Deployments with low latency disks - local NVMe or similar.
- Multi-tenant deployments, where only a subset of tenants is frequently accessed, so that only a fraction of data & index is loaded in RAM at a time.
- For deployments with [inline storage](https://search.qdrant.tech/md/documentation/operations/optimize/?s=inline-storage-in-hnsw-index) enabled.
## How to minimize memory footprint
The main challenge is to put on disk those parts of data, which are rarely accessed.
Here are the main techniques to achieve that:
- Use quantization to store only compressed vectors in RAM [Quantization docs](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
- Use float16 or int8 datatypes to reduce memory usage of vectors by 2x or 4x respectively, with some tradeoff in precision. Read more about vector datatypes in [documentation](https://search.qdrant.tech/md/documentation/manage-data/vectors/?s=datatypes)
- Leverage Matryoshka Representation Learning (MRL) to store only small vectors in RAM while keeping large vectors on disk. Examples of how to use MRL with Qdrant Cloud inference: [MRL docs](https://search.qdrant.tech/md/documentation/inference/?s=reduce-vector-dimensionality-with-matryoshka-models)
- For multi-tenant deployments with small tenants, vectors might be stored on disk because the same tenant's data is stored together [Multitenancy docs](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance)
- For deployments with fast local storage and relatively low requirements for search throughput, it may be possible to store all components of vector store on disk. Read more about the performance implications of on-disk storage in [the article](https://qdrant.tech/articles/memory-consumption/)
- For low RAM environments, consider `async_scorer` config, which enables support of `io_uring` for parallel disk access, which can significantly improve performance of on-disk storage. Read more about `async_scorer` in [the article](https://qdrant.tech/articles/io_uring/) (only available on Linux with kernel 5.11+)
- Consider storing Sparse Vectors and text payload on disk, as they are usually more disk-friendly than dense vectors.
- Configure payload indexes to be stored on disk [docs](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=on-disk-payload-index)
- Configure sparse vectors to be stored on disk [docs](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=sparse-vector-index)

View File

@@ -0,0 +1,77 @@
---
name: qdrant-search-speed-optimization
description: "Diagnoses and fixes slow Qdrant search. Use when someone reports 'search is slow', 'high latency', 'queries take too long', 'low QPS', 'throughput too low', 'filtered search is slow', or 'search was fast but now it's slow'. Also use when search performance degrades after config changes or data growth."
---
# Diagnose a problem
There the multiple possible reasons for search performance degradation. The most common ones are:
* Memory pressure: if the working set exceeds available RAM
* Complex requests (e.g. high `hnsw_ef`, complex filters without payload index)
* Competing background processes (e.g. optimizer still running after bulk upload)
* Problem with the cluster (e.g. network issues, hardware degradation)
## Single Query Too Slow (Latency)
Use when: individual queries take too long regardless of load.
### Diagnostic steps:
- Check if second run of the same request is significantly faster (indicates memory pressure)
- Try the same query with `with_payload: false` and `with_vectors: false` to see if payload retrieval is the bottleneck
- If request uses filters, try to remove them one by one to identify if a specific filter condition is the bottleneck
### Common fixes:
- Tune HNSW parameters: [Fine-tuning search](https://search.qdrant.tech/md/documentation/operations/optimize/?s=fine-tuning-search-parameters)
- Enable in-memory quantization: [Scalar quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=scalar-quantization)
- Reduce Vector Dimensionality with Matryoshka Models: [Matryoshka Models](https://search.qdrant.tech/md/documentation/inference/?s=reduce-vector-dimensionality-with-matryoshka-models)
- Use oversampling + rescore for high-dimensional vectors [Search with quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=searching-with-quantization)
- Enable io_uring for disk-heavy workloads on Linux [io_uring](https://qdrant.tech/articles/io_uring/)
## Can't Handle Enough QPS (Throughput)
Use when: system can't serve enough queries per second under load.
- Reduce segment count (`default_segment_number` to 2) [Maximizing throughput](https://search.qdrant.tech/md/documentation/operations/optimize/?s=maximizing-throughput)
- Use batch search API instead of single queries [Batch search](https://search.qdrant.tech/md/documentation/search/search/?s=batch-search-api)
- Enable quantization to reduce CPU cost [Scalar quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=scalar-quantization)
- Add replicas to distribute read load [Replication](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=replication)
## Filtered Search Is Slow
Use when: filtered search is significantly slower than unfiltered. Most common SA complaint after memory.
- Create payload index on the filtered field [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
- Use `is_tenant=true` for primary filtering condition: [Tenant index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=tenant-index)
- Try ACORN algorithm for complex filters: [ACORN](https://search.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm)
- Avoid using `nested` filtering conditions as a primary filter. It might force qdrant to read raw payload values instead of using index.
- If payload index was added after HNSW build, trigger re-index to create filterable subgraph links
## Optimize search performance with parallel updates
### Diagnostic steps
- Try to run the same query with `indexed_only=true` parameter, if the query is significantly faster, it means that the optimizer is still running and has not yet indexed all segments.
- If CPU or IO usage is high even with no queries, it also indicates that the optimizer is still running.
### Recommended configuration changes
- reduce `optimizer_cpu_budget` to reserve more CPU for queries
- Use `prevent_unoptimized=true` to prevent creating segments with a large amount of unindexed data for searches. Instead, once a segment reaches the so called indexing_threshold, all additional points will be added in deferred state.
Learn more [here](https://search.qdrant.tech/md/documentation/search/low-latency-search/?s=query-indexed-data-only)
## What NOT to Do
- Set `always_ram=false` on quantization (disk thrashing on every search)
- Put HNSW on disk for latency-sensitive production (only for cold storage)
- Increase segment count for throughput (opposite: fewer = better)
- Create payload indexes on every field (wastes memory)
- Blame Qdrant before checking optimizer status