feat: Qdrant skills (#1412)

2026-06-10 01:53:29 +00:00 · 2026-04-17 06:24:27 +05:30
parent 971139baf2
commit 9637e1ab08
24 changed files with 1234 additions and 0 deletions
@@ -0,0 +1,51 @@
+---
+name: qdrant-scaling
+description: "Guides Qdrant scaling decisions. Use when someone asks 'how many nodes do I need', 'data doesn't fit on one node', 'need more throughput', 'cluster is slow', 'too many tenants', 'vertical or horizontal', 'how to shard', or 'need to add capacity'."
+allowed-tools:
+  - Read
+  - Grep
+  - Glob
+---
+
+# Qdrant Scaling
+
+First determine what you're scaling for:
+
+- data volume
+- query throughput (QPS)
+- query latency
+- query volume
+
+After determining the scaling goal, we can choose scaling strategy based on tradeoffs and assumptions.
+Each pulls toward different strategies. Scaling for throughput and latency are opposite tuning directions.
+
+
+## Scaling Data Volume
+
+This becomes relevant when volume of the dataset exceeds the capacity of a single node.
+Read more about scaling for data volume in [Scaling Data Volume](scaling-data-volume/SKILL.md)
+
+
+## Scaling for Query Throughput
+
+If your system needs to handle more parallel queries than a single node can handle,
+ then you need to scale for query throughput.
+
+Read more about scaling for query throughput in [Scaling for Query Throughput](scaling-qps/SKILL.md)
+
+## Scaling for Query Latency
+
+Latency of a single query is determined by the slowest component in the query execution path.
+It is in sometimes correlated with throughput, but not always. It might require different strategies for scaling.
+
+Read more about scaling for query latency in [Scaling for Query Latency](minimize-latency/SKILL.md)
+
+
+## Scaling for Query Volume
+
+By query volume we understand the amount of results that a single query returns. 
+If the query volume is too high, it can cause performance issues and increase latency.
+
+Tuning for query volume is opposite might require special strategies. 
+
+Read more about scaling for query volume in [Scaling for Query Volume](scaling-query-volume/SKILL.md)
@@ -0,0 +1,41 @@
+---
+name: qdrant-minimize-latency
+description: "Guides Qdrant query latency optimization. Use when someone asks 'search is slow', 'how to reduce latency', 'p99 is too high', 'tail latency', 'single query too slow', 'how to make search faster', or 'latency spikes'."
+---
+
+# Scaling for Query Latency
+
+Latency of a single query is determined by the slowest component in the query execution path. It is sometimes correlated with throughput, but not always — throughput and latency are opposite tuning directions.
+
+Low latency optimization is aimed at utilising maximum resource saturation for a single query, while throughput optimization is aimed at minimizing per-query resource usage to allow more parallel queries.
+
+## Performance Tuning for Lower Latency
+
+- Increase segment count to match CPU cores (`default_segment_number: 16`) [Minimizing latency](https://search.qdrant.tech/md/documentation/operations/optimize/?s=minimizing-latency)
+- Keep quantized vectors and HNSW in RAM (`always_ram=true`)
+- Reduce `hnsw_ef` at query time (trade recall for speed) [Search params](https://search.qdrant.tech/md/documentation/operations/optimize/?s=fine-tuning-search-parameters)
+- Use local NVMe, avoid network-attached storage
+
+## Memory Pressure and Latency
+
+RAM is the most critical resource for latency. If working set exceeds available RAM, OS cache eviction causes severe, sustained latency degradation.
+
+- Vertical scale RAM first. Critical if working set >80%.
+- Use quantization: scalar (4x reduction) or binary (16x reduction) [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
+- Move payload indexes to disk if filtering is infrequent [On-disk payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=on-disk-payload-index)
+- Set `optimizer_cpu_budget` to limit background optimization CPUs
+- Schedule indexing: set high `indexing_threshold` during peak hours
+
+
+## Vertical Scaling for Latency
+
+More RAM and faster CPU directly reduce latency. See [Vertical Scaling](../scaling-data-volume/vertical-scaling/SKILL.md) for node sizing guidelines.
+
+
+## What NOT to Do
+
+- Do not expect to optimize latency and throughput simultaneously on the same node
+- Do not use few large segments for latency-sensitive workloads (each segment takes longer to search)
+- Do not run at >90% RAM (cache eviction causes severe latency degradation that can last days)
+- Do not ignore optimizer status during performance debugging
+- Do not scale down RAM without load testing (cache eviction causes days-long latency incidents)
@@ -0,0 +1,49 @@
+---
+name: qdrant-scaling-data-volume
+description: "Guides Qdrant data volume scaling decisions. Use when someone asks 'data doesn't fit on one node', 'too much data', 'need more storage', 'vertical or horizontal scaling', 'tenant scaling', 'time window rotation', or 'data growth exceeds capacity'."
+allowed-tools:
+  - Read
+  - Grep
+  - Glob
+---
+
+# Scaling Data Volume
+
+This document covers data volume scaling scenarios,
+where the total size of the dataset exceeds the capacity of a single node.
+
+## Tenant Scaling
+
+If the use case is multi-tenant, meaning that each user only has access to a subset of the data,
+and we never need to query across all the data, then we can use multi-tenancy patterns to scale.
+
+The recommended way is to use multi-tenant workloads with payload partitioning, per-tenant indexes, and tiered multitenancy.
+
+Learn more [Tenant Scaling](tenant-scaling/SKILL.md)
+
+## Sliding Time Window
+
+Some use-cases are based on a sliding time window, where only the most recent data is relevant.
+For example an index for social media posts, where only the last 6 months of data require fast search.
+
+Learn more [Sliding Time Window](sliding-time-window/SKILL.md)
+
+## Global Search
+
+Most general use-cases require global search across all data.
+In these situations, we might need to fall back to vertical scaling,
+and then horizontal scaling when we reach the limits of vertical scaling.
+
+
+### Vertical Scaling
+
+When data doesn't fit in a single node, the first approach is to scale the node itself — more RAM, better disk, quantization, mmap.
+Exhaust vertical options before going horizontal, as horizontal scaling adds permanent operational complexity.
+
+Learn more [Vertical Scaling](vertical-scaling/SKILL.md)
+
+### Horizontal Scaling
+
+When a single node can't hold the data even with quantization and mmap, distribute data across multiple nodes via sharding.
+
+Learn more [Horizontal Scaling](horizontal-scaling/SKILL.md)
@@ -0,0 +1,47 @@
+---
+name: qdrant-horizontal-scaling
+description: "Diagnoses and guides Qdrant horizontal scaling decisions. Use when someone asks 'vertical or horizontal?', 'how many nodes?', 'how many shards?', 'how to add nodes', 'resharding', 'data doesn't fit', or 'need more capacity'. Also use when data growth outpaces current deployment."
+---
+
+# What to Do When Qdrant Needs More Capacity
+
+Vertical first: simpler operations, no network overhead, good up to ~100M vectors per node depending on dimensions and quantization. Horizontal when: data exceeds single node capacity, need fault tolerance, need to isolate tenants, or IOPS-bound (more nodes = more independent IOPS).
+
+## Most basic distributed configuration
+
+- 3 nodes, 3 shards with `replication_factor: 2` for zero-downtime scaling
+
+Minimum of 3 nodes is important for consensus and fault tolerance. With 3 nodes, you can lose 1 node without downtime. With 2 nodes, losing 1 node causes downtime for collection operations.
+Replication factor of 2 means each shard has 1 replica, so you have 2 copies of data. This allows for zero-downtime scaling and maintenance. With `replication_factor: 1`, zero-downtime is not guaranteed even for point-level operations, and cluster maintenance requires downtime.
+
+## Choosing number of shards
+
+Shards are the unit of data distribution. 
+More shards allows more nodes and better distribution, but adds overhead. Fewer shards reduces overhead but limits horizontal scaling.
+
+For cluster of 3-6 nodes the recommended shard count is 6-12. 
+This allows for 2-4 shards per node, which balances distribution and overhead. 
+
+## Changing number of shards
+
+Use when: shard count isn't evenly divisible by node count, causing uneven distribution, or need to rebalance.
+
+Resharding is expensive and time-consuming, it should be used as a last resort if regular data distribution is not possible.
+Resharding is designed to be transparent for user operations, updates and searches should still work during resharding with some small performance impact.
+
+But resharding operation itself is time-consuming and requires to move large amounts of data between nodes.
+
+- Available in Qdrant Cloud [Resharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=resharding)
+- Resharding is not available for self-hosted deployments.
+
+Better alternatives: over-provision shards initially, or spin up new cluster with correct config and migrate data.
+
+
+## What NOT to Do
+
+- Do not jump to horizontal before exhausting vertical (adds complexity for no gain)
+- Do not set `shard_number` that isn't a multiple of node count (uneven distribution)
+- Do not use `replication_factor: 1` in production if you need fault tolerance
+- Do not add nodes without rebalancing shards (use shard move API to redistribute)
+- Do not scale down RAM without load testing (cache eviction causes days-long latency incidents)
+- Do not hit the collection limit by using one collection per tenant (use payload partitioning)
@@ -0,0 +1,68 @@
+---
+name: qdrant-sliding-time-window
+description: "Guides sliding time window scaling in Qdrant. Use when someone asks 'only recent data matters', 'how to expire old vectors', 'time-based data rotation', 'delete old data efficiently', 'social media feed search', 'news search', 'log search with retention', or 'how to keep only last N months of data'."
+---
+
+# Scaling with a Sliding Time Window
+
+Use when only recent data needs fast search -- social media posts, news articles, support tickets, logs, job listings. Old data either becomes irrelevant or can tolerate slower access.
+
+Three strategies: **shard rotation** (recommended), **collection rotation** (when per-period config differs), and **filter-and-delete** (simplest, for continuous cleanup).
+
+
+## Shard Rotation (Recommended)
+
+Use when: data has natural time boundaries (daily, weekly, monthly). Preferred because queries span all time periods in one request without application-level fan-out. [User-defined sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=user-defined-sharding)
+
+1. Create a collection with user-defined sharding enabled
+2. Create one shard key per time period (e.g., `2025-01`, `2025-02`, ..., `2025-06`)
+3. Ingest data into the current period's shard key
+4. When a new period starts, create a new shard key and redirect writes
+5. Delete the oldest shard key outside the retention window
+
+- Deleting a shard key reclaims all resources instantly (no fragmentation, no optimizer overhead)
+- Pre-create the next period's shard key before rotation to avoid write disruption
+- Use `shard_key_selector` at query time to search only specific periods for efficiency
+- Shard keys can be placed on specific nodes for hot/cold tiering
+
+
+## Collection Rotation (Alias Swap)
+
+Use when: you need per-period collection configuration (e.g., different quantization or storage settings). [Collection aliases](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=collection-aliases)
+
+1. Create one collection per time period, point a write alias at the newest
+2. Query across all active collections in parallel, merge results client-side
+3. When a new period starts, create the new collection and swap the write alias [Switch collection](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=switch-collection)
+4. Drop the oldest collection outside the window
+
+Trade-off vs shard rotation: allows per-collection config differences, but requires application-level fan-out and more operational overhead.
+
+
+## Filter-and-Delete
+
+Use when: data arrives continuously without clear time boundaries, or you want the simplest setup.
+
+1. Store a `timestamp` payload on every point, create a payload index on it [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
+2. Filter to the desired window at query time using `range` condition [Range filter](https://search.qdrant.tech/md/documentation/search/filtering/?s=range)
+3. Periodically delete expired points using delete-by-filter [Delete points](https://search.qdrant.tech/md/documentation/manage-data/points/?s=delete-points)
+
+- Run cleanup during off-peak hours in batches (10k-50k points) to avoid optimizer locks
+- Deletes are not free: tombstoned points degrade search until optimizer compacts segments
+- Does not reclaim disk instantly (compaction is asynchronous)
+
+
+## Hot/Cold Tiers
+
+Use when: recent data needs fast in-RAM search, older data should remain searchable at lower performance.
+
+- **Shard rotation:** place current shard key on fast-storage nodes, move older shard keys to cheaper nodes via shard placement. All queries still go through a single collection.
+- **Collection rotation:** keep current collection in RAM (`always_ram: true`), move older collections to mmap/on-disk vectors. [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
+
+
+## What NOT to Do
+
+- Do not use filter-and-delete for high-volume time-series with millions of daily deletes (use rotation instead)
+- Do not forget to index the timestamp field (range filters without an index cause full scans)
+- Do not use collection rotation when shard rotation would suffice (unnecessary fan-out complexity)
+- Do not drop a shard key or collection before verifying its period is fully outside the retention window
+- Do not skip pre-creating the next period's shard key or collection (write failures during rotation are hard to recover)
@@ -0,0 +1,44 @@
+---
+name: qdrant-tenant-scaling
+description: "Guides Qdrant multi-tenant scaling. Use when someone asks 'how to scale tenants', 'one collection per tenant?', 'tenant isolation', 'dedicated shards', or reports tenant performance issues. Also use when multi-tenant workloads outgrow shared infrastructure."
+---
+
+# What to Do When Scaling Multi-Tenant Qdrant
+
+Do not create one collection per tenant. Does not scale past a few hundred and wastes resources. One company hit the 1000 collection limit after a year of collection-per-repo and had to migrate to payload partitioning. Use a shared collection with a tenant key.
+
+- Understand multitenancy patterns [Multitenancy](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/)
+
+Here is a short summary of the patterns:
+
+## Number of Tenants is around 10k
+
+Use the default multitenancy strategy via payload filtering.
+
+Read about [Partition by payload](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=partition-by-payload) and [Calibrate performance](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance) for best practices on indexing and query performance.
+
+
+## Number of Tenants is around 100k and more
+
+At this scale, the cluster may consist of several peers.
+To localize tenant data and improve performance, use [custom sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=user-defined-sharding) to assign tenants to specific shards based on tenant ID hash.
+This will localize tenant requests to specific nodes instead of broadcasting them to all nodes, improving performance and reducing load on each node.
+
+## If tenants are unevenly sized
+
+If some tenants are much larger than others, use [tiered multitenancy](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=tiered-multitenancy) to promote large tenants to dedicated shards while keeping small tenants on shared shards. This optimizes resource allocation and performance for tenants of varying sizes.
+
+## Need Strict Tenant Isolation
+
+Use when: legal/compliance requirements demand per-tenant encryption or strict isolation beyond what payload filtering provides.
+
+- Multiple collections may be necessary for per-tenant encryption keys
+- Limit collection count and use payload filtering within each collection
+- This is the exception, not the default. Only use when compliance requires it.
+
+
+## What NOT to Do
+
+- Do not create one collection per tenant without compliance justification (does not scale past hundreds)
+- Do not skip `is_tenant=true` on the tenant index (kills sequential read performance)
+- Do not build global HNSW for multi-tenant collections (wasteful, use `payload_m` instead)
@@ -0,0 +1,69 @@
+---
+name: qdrant-vertical-scaling
+description: "Guides Qdrant vertical scaling decisions. Use when someone asks 'how to scale up a node', 'need more RAM', 'upgrade node size', 'vertical scaling', 'resize cluster', 'scale up vs scale out', or when memory/CPU is insufficient on current nodes. Also use when someone wants to avoid the complexity of horizontal scaling."
+---
+
+# What to Do When Qdrant Needs to Scale Vertically
+
+Vertical scaling means increasing CPU, RAM, or disk on existing nodes rather than adding more nodes. This is the recommended first step before considering horizontal scaling. Vertical scaling is simpler, avoids distributed system complexity, and is reversible.
+
+- Vertical scaling for Qdrant Cloud is done through the [Qdrant Cloud Console](https://cloud.qdrant.io/)
+- For self-hosted deployments, resize the underlying VM or container resources
+
+## When to Scale Vertically
+
+Use when: current node resources (RAM, CPU, disk) are insufficient, but the workload doesn't yet require distribution.
+
+- RAM usage approaching 80% of available memory (OS page cache eviction starts, severe performance degradation)
+- CPU saturation during query serving or indexing
+- Disk space running low for on-disk vectors and payloads
+- A single node can handle up to ~100M vectors depending on dimensions and quantization
+- For non-production workloads, which are tolerant to single-point-of-failure and don't require high availability
+
+
+## How to Scale Vertically in Qdrant Cloud
+
+Vertical scaling is managed through the Qdrant Cloud Console.
+
+- Log into [Qdrant Cloud Console](https://cloud.qdrant.io/) or use [CLI tool](https://github.com/qdrant/qcloud-cli)
+- Select the cluster to resize
+- Choose a larger node configuration (more RAM, CPU, or both)
+- The upgrade process involves a rolling restart with no downtime if replication is configured
+- Ensure `replication_factor: 2` or higher before resizing to maintain availability during the rolling restart
+
+**Important:** Scaling up is straightforward. Scaling down requires care -- if the working set no longer fits in RAM after downsizing, performance will degrade severely due to cache eviction. Always load test before scaling down.
+
+
+## RAM Sizing Guidelines
+
+RAM is the most critical resource for Qdrant performance. Use these guidelines to right-size.
+
+- Exact estimation of RAM usage is difficult; use this simple approximate formula: `num_vectors * dimensions * 4 bytes * 1.5` for full-precision vectors in RAM
+- With scalar quantization: divide by 4 (INT8 reduces each float32 to 1 byte) [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
+- With binary quantization: divide by 32 [Binary quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=binary-quantization)
+- Add overhead for HNSW index (~20-30% of vector data), payload indexes, and WAL
+- Reserve 20% headroom for optimizer operations and OS cache
+- Monitor actual usage via Grafana/Prometheus before and after resizing [Monitoring](../../../qdrant-monitoring/SKILL.md)
+
+
+## When Vertical Scaling Is No Longer Enough
+
+Recognize these signals that it's time to go horizontal:
+
+- Data volume exceeds what a single node can hold even with quantization and mmap
+- IOPS are saturated (more nodes = more independent disk I/O)
+- Need fault tolerance (requires replication across nodes)
+- Need tenant isolation via dedicated shards
+- Single-node CPU is maxed and query latency is unacceptable
+- Next vertical scaling step is the largest available node size. You might need to be able to temporarily scale up to the larger node size to do batch operations or recovery. If you are already at the largest node size, you won't be able to do that.
+
+When you hit these limits, see [Horizontal Scaling](../horizontal-scaling/SKILL.md) for guidance on sharding and node planning.
+
+
+## What NOT to Do
+
+- Do not scale down RAM without load testing first (cache eviction = severe latency degradation that can last days)
+- Do not ignore the 80% RAM threshold (performance cliff, not gradual degradation)
+- Do not skip replication before resizing in Cloud (rolling restart without replicas = downtime)
+- Do not jump to horizontal scaling before exhausting vertical options (adds permanent operational complexity)
+- Do not assume more CPU always helps (IOPS-bound workloads won't improve with more cores)
@@ -0,0 +1,56 @@
+---
+name: qdrant-scaling-qps
+description: "Guides Qdrant query throughput (QPS) scaling. Use when someone asks 'how to increase QPS', 'need more throughput', 'queries per second too low', 'batch search', 'read replicas', or 'how to handle more concurrent queries'."
+---
+
+# Scaling for Query Throughput (QPS)
+
+Throughput scaling means handling more parallel queries per second. 
+This is different from latency - throughput and latency are opposite tuning directions and cannot be optimized simultaneously on the same node.
+
+High throughput favors fewer, larger segments so each query touches less overhead.
+
+
+## Performance Tuning for Higher RPS
+
+- Use fewer, larger segments (`default_segment_number: 2`) [Maximizing throughput](https://search.qdrant.tech/md/documentation/operations/optimize/?s=maximizing-throughput)
+- Enable quantization with `always_ram=true` to reduce disk IO [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
+- Use batch search API to amortize overhead [Batch search](https://search.qdrant.tech/md/documentation/search/search/?s=batch-search-api)
+
+## Minimize impact of Update Workloads
+
+- Configure update throughput control (v1.17+) to prevent unoptimized searches degrading reads [Low latency search](https://search.qdrant.tech/md/documentation/search/low-latency-search/)
+- Set `optimizer_cpu_budget` to limit indexing CPUs (e.g. `2` on an 8-CPU node reserves 6 for queries)
+- Configure delayed read fan-out (v1.17+) for tail latency [Delayed fan-outs](https://search.qdrant.tech/md/documentation/search/low-latency-search/?s=use-delayed-fan-outs)
+
+
+
+## Horizontal Scaling for Throughput
+
+If a single node is saturated on CPU after applying the tuning above, scale horizontally with read replicas.
+
+- Shard replicas serve queries from replicated shards, distributing read load across nodes
+- Each replica adds independent query capacity without re-sharding
+- Use `replication_factor: 2+` and route reads to replicas [Distributed deployment](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=replication)
+
+See also [Horizontal Scaling](../scaling-data-volume/horizontal-scaling/SKILL.md) for general horizontal scaling guidance.
+
+
+## Disk I/O Bottlenecks
+
+If it is not possible to keep all vectors in RAM, disk I/O can become the bottleneck for throughput. 
+In this case:
+
+- Upgrade to provisioned IOPS or local NVMe first. See impact of disk performance to vector search in [Disk performance article](https://qdrant.tech/articles/memory-consumption/)
+- Use `io_uring` on Linux (kernel 5.11+) [io_uring article](https://qdrant.tech/articles/io_uring/)
+- In case of quantized vectors, prefer global rescoring over per-segment rescoring to reduce disk reads. Example in the [tutorial](https://search.qdrant.tech/md/documentation/tutorials-operations/large-scale-search/?s=search-query)
+- Configure higher number of search threads to parallelize disk reads. Default is `cpu_count - 1`, which is optimal for RAM-based search but may be too low for disk-based search. See [configuration reference](https://search.qdrant.tech/md/documentation/operations/configuration/?s=configuration-options)
+- If still saturated, scale out horizontally (each node adds independent IOPS)
+
+
+## What NOT to Do
+
+- Do not expect to optimize throughput and latency simultaneously on the same node
+- Do not use many small segments for throughput workloads (increases per-query overhead)
+- Do not scale horizontally when IOPS-bound without also upgrading disk tier
+- Do not run at >90% RAM (OS cache eviction = severe performance degradation)
@@ -0,0 +1,23 @@
+---
+name: qdrant-scaling-query-volume
+description: "Guides Qdrant query volume scaling. Use when someone asks 'query returns too many results', 'scroll performance', 'large limit values', 'paginating search results', 'fetching many vectors', or 'high cardinality results'."
+---
+
+# Scaling for Query Volume
+
+Problem: When a query has a large limit (e.g. 1000) and there are multiple shards (e.g. 10), naively each shard must return the full 1000 results — totaling 10,000 scored points transferred and merged. This is wasteful since data is randomly distributed across auto-shards.
+
+## Core idea
+
+Instead of asking every shard for the full limit, ask each shard for a smaller limit computed via Poisson distribution statistics, then merge. This is safe because auto-sharding guarantees random, independent data distribution.
+
+## When it activates
+
+- More than 1 shard
+- Auto-sharding is in use (all queried shards share the same shard key)
+- The request's limit + offset >= SHARD_QUERY_SUBSAMPLING_LIMIT (128)
+- The query is not exact
+
+## Key tradeoff
+
+ The strategy trades a small probability of slightly incomplete results for a large reduction in inter-shard data transfer, especially for high-limit queries across many shards. The 1.2x safety factor and the 99.9% Poisson threshold keep the error rate very low — comparable to inaccuracies already introduced by approximate vector indices like HNSW.