feat: Qdrant skills (#1412)

This commit is contained in:
Anush
2026-04-17 06:24:27 +05:30
committed by GitHub
parent 971139baf2
commit 9637e1ab08
24 changed files with 1234 additions and 0 deletions

View File

@@ -0,0 +1,49 @@
---
name: qdrant-scaling-data-volume
description: "Guides Qdrant data volume scaling decisions. Use when someone asks 'data doesn't fit on one node', 'too much data', 'need more storage', 'vertical or horizontal scaling', 'tenant scaling', 'time window rotation', or 'data growth exceeds capacity'."
allowed-tools:
- Read
- Grep
- Glob
---
# Scaling Data Volume
This document covers data volume scaling scenarios,
where the total size of the dataset exceeds the capacity of a single node.
## Tenant Scaling
If the use case is multi-tenant, meaning that each user only has access to a subset of the data,
and we never need to query across all the data, then we can use multi-tenancy patterns to scale.
The recommended way is to use multi-tenant workloads with payload partitioning, per-tenant indexes, and tiered multitenancy.
Learn more [Tenant Scaling](tenant-scaling/SKILL.md)
## Sliding Time Window
Some use-cases are based on a sliding time window, where only the most recent data is relevant.
For example an index for social media posts, where only the last 6 months of data require fast search.
Learn more [Sliding Time Window](sliding-time-window/SKILL.md)
## Global Search
Most general use-cases require global search across all data.
In these situations, we might need to fall back to vertical scaling,
and then horizontal scaling when we reach the limits of vertical scaling.
### Vertical Scaling
When data doesn't fit in a single node, the first approach is to scale the node itself — more RAM, better disk, quantization, mmap.
Exhaust vertical options before going horizontal, as horizontal scaling adds permanent operational complexity.
Learn more [Vertical Scaling](vertical-scaling/SKILL.md)
### Horizontal Scaling
When a single node can't hold the data even with quantization and mmap, distribute data across multiple nodes via sharding.
Learn more [Horizontal Scaling](horizontal-scaling/SKILL.md)

View File

@@ -0,0 +1,47 @@
---
name: qdrant-horizontal-scaling
description: "Diagnoses and guides Qdrant horizontal scaling decisions. Use when someone asks 'vertical or horizontal?', 'how many nodes?', 'how many shards?', 'how to add nodes', 'resharding', 'data doesn't fit', or 'need more capacity'. Also use when data growth outpaces current deployment."
---
# What to Do When Qdrant Needs More Capacity
Vertical first: simpler operations, no network overhead, good up to ~100M vectors per node depending on dimensions and quantization. Horizontal when: data exceeds single node capacity, need fault tolerance, need to isolate tenants, or IOPS-bound (more nodes = more independent IOPS).
## Most basic distributed configuration
- 3 nodes, 3 shards with `replication_factor: 2` for zero-downtime scaling
Minimum of 3 nodes is important for consensus and fault tolerance. With 3 nodes, you can lose 1 node without downtime. With 2 nodes, losing 1 node causes downtime for collection operations.
Replication factor of 2 means each shard has 1 replica, so you have 2 copies of data. This allows for zero-downtime scaling and maintenance. With `replication_factor: 1`, zero-downtime is not guaranteed even for point-level operations, and cluster maintenance requires downtime.
## Choosing number of shards
Shards are the unit of data distribution.
More shards allows more nodes and better distribution, but adds overhead. Fewer shards reduces overhead but limits horizontal scaling.
For cluster of 3-6 nodes the recommended shard count is 6-12.
This allows for 2-4 shards per node, which balances distribution and overhead.
## Changing number of shards
Use when: shard count isn't evenly divisible by node count, causing uneven distribution, or need to rebalance.
Resharding is expensive and time-consuming, it should be used as a last resort if regular data distribution is not possible.
Resharding is designed to be transparent for user operations, updates and searches should still work during resharding with some small performance impact.
But resharding operation itself is time-consuming and requires to move large amounts of data between nodes.
- Available in Qdrant Cloud [Resharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=resharding)
- Resharding is not available for self-hosted deployments.
Better alternatives: over-provision shards initially, or spin up new cluster with correct config and migrate data.
## What NOT to Do
- Do not jump to horizontal before exhausting vertical (adds complexity for no gain)
- Do not set `shard_number` that isn't a multiple of node count (uneven distribution)
- Do not use `replication_factor: 1` in production if you need fault tolerance
- Do not add nodes without rebalancing shards (use shard move API to redistribute)
- Do not scale down RAM without load testing (cache eviction causes days-long latency incidents)
- Do not hit the collection limit by using one collection per tenant (use payload partitioning)

View File

@@ -0,0 +1,68 @@
---
name: qdrant-sliding-time-window
description: "Guides sliding time window scaling in Qdrant. Use when someone asks 'only recent data matters', 'how to expire old vectors', 'time-based data rotation', 'delete old data efficiently', 'social media feed search', 'news search', 'log search with retention', or 'how to keep only last N months of data'."
---
# Scaling with a Sliding Time Window
Use when only recent data needs fast search -- social media posts, news articles, support tickets, logs, job listings. Old data either becomes irrelevant or can tolerate slower access.
Three strategies: **shard rotation** (recommended), **collection rotation** (when per-period config differs), and **filter-and-delete** (simplest, for continuous cleanup).
## Shard Rotation (Recommended)
Use when: data has natural time boundaries (daily, weekly, monthly). Preferred because queries span all time periods in one request without application-level fan-out. [User-defined sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=user-defined-sharding)
1. Create a collection with user-defined sharding enabled
2. Create one shard key per time period (e.g., `2025-01`, `2025-02`, ..., `2025-06`)
3. Ingest data into the current period's shard key
4. When a new period starts, create a new shard key and redirect writes
5. Delete the oldest shard key outside the retention window
- Deleting a shard key reclaims all resources instantly (no fragmentation, no optimizer overhead)
- Pre-create the next period's shard key before rotation to avoid write disruption
- Use `shard_key_selector` at query time to search only specific periods for efficiency
- Shard keys can be placed on specific nodes for hot/cold tiering
## Collection Rotation (Alias Swap)
Use when: you need per-period collection configuration (e.g., different quantization or storage settings). [Collection aliases](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=collection-aliases)
1. Create one collection per time period, point a write alias at the newest
2. Query across all active collections in parallel, merge results client-side
3. When a new period starts, create the new collection and swap the write alias [Switch collection](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=switch-collection)
4. Drop the oldest collection outside the window
Trade-off vs shard rotation: allows per-collection config differences, but requires application-level fan-out and more operational overhead.
## Filter-and-Delete
Use when: data arrives continuously without clear time boundaries, or you want the simplest setup.
1. Store a `timestamp` payload on every point, create a payload index on it [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
2. Filter to the desired window at query time using `range` condition [Range filter](https://search.qdrant.tech/md/documentation/search/filtering/?s=range)
3. Periodically delete expired points using delete-by-filter [Delete points](https://search.qdrant.tech/md/documentation/manage-data/points/?s=delete-points)
- Run cleanup during off-peak hours in batches (10k-50k points) to avoid optimizer locks
- Deletes are not free: tombstoned points degrade search until optimizer compacts segments
- Does not reclaim disk instantly (compaction is asynchronous)
## Hot/Cold Tiers
Use when: recent data needs fast in-RAM search, older data should remain searchable at lower performance.
- **Shard rotation:** place current shard key on fast-storage nodes, move older shard keys to cheaper nodes via shard placement. All queries still go through a single collection.
- **Collection rotation:** keep current collection in RAM (`always_ram: true`), move older collections to mmap/on-disk vectors. [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
## What NOT to Do
- Do not use filter-and-delete for high-volume time-series with millions of daily deletes (use rotation instead)
- Do not forget to index the timestamp field (range filters without an index cause full scans)
- Do not use collection rotation when shard rotation would suffice (unnecessary fan-out complexity)
- Do not drop a shard key or collection before verifying its period is fully outside the retention window
- Do not skip pre-creating the next period's shard key or collection (write failures during rotation are hard to recover)

View File

@@ -0,0 +1,44 @@
---
name: qdrant-tenant-scaling
description: "Guides Qdrant multi-tenant scaling. Use when someone asks 'how to scale tenants', 'one collection per tenant?', 'tenant isolation', 'dedicated shards', or reports tenant performance issues. Also use when multi-tenant workloads outgrow shared infrastructure."
---
# What to Do When Scaling Multi-Tenant Qdrant
Do not create one collection per tenant. Does not scale past a few hundred and wastes resources. One company hit the 1000 collection limit after a year of collection-per-repo and had to migrate to payload partitioning. Use a shared collection with a tenant key.
- Understand multitenancy patterns [Multitenancy](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/)
Here is a short summary of the patterns:
## Number of Tenants is around 10k
Use the default multitenancy strategy via payload filtering.
Read about [Partition by payload](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=partition-by-payload) and [Calibrate performance](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance) for best practices on indexing and query performance.
## Number of Tenants is around 100k and more
At this scale, the cluster may consist of several peers.
To localize tenant data and improve performance, use [custom sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=user-defined-sharding) to assign tenants to specific shards based on tenant ID hash.
This will localize tenant requests to specific nodes instead of broadcasting them to all nodes, improving performance and reducing load on each node.
## If tenants are unevenly sized
If some tenants are much larger than others, use [tiered multitenancy](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=tiered-multitenancy) to promote large tenants to dedicated shards while keeping small tenants on shared shards. This optimizes resource allocation and performance for tenants of varying sizes.
## Need Strict Tenant Isolation
Use when: legal/compliance requirements demand per-tenant encryption or strict isolation beyond what payload filtering provides.
- Multiple collections may be necessary for per-tenant encryption keys
- Limit collection count and use payload filtering within each collection
- This is the exception, not the default. Only use when compliance requires it.
## What NOT to Do
- Do not create one collection per tenant without compliance justification (does not scale past hundreds)
- Do not skip `is_tenant=true` on the tenant index (kills sequential read performance)
- Do not build global HNSW for multi-tenant collections (wasteful, use `payload_m` instead)

View File

@@ -0,0 +1,69 @@
---
name: qdrant-vertical-scaling
description: "Guides Qdrant vertical scaling decisions. Use when someone asks 'how to scale up a node', 'need more RAM', 'upgrade node size', 'vertical scaling', 'resize cluster', 'scale up vs scale out', or when memory/CPU is insufficient on current nodes. Also use when someone wants to avoid the complexity of horizontal scaling."
---
# What to Do When Qdrant Needs to Scale Vertically
Vertical scaling means increasing CPU, RAM, or disk on existing nodes rather than adding more nodes. This is the recommended first step before considering horizontal scaling. Vertical scaling is simpler, avoids distributed system complexity, and is reversible.
- Vertical scaling for Qdrant Cloud is done through the [Qdrant Cloud Console](https://cloud.qdrant.io/)
- For self-hosted deployments, resize the underlying VM or container resources
## When to Scale Vertically
Use when: current node resources (RAM, CPU, disk) are insufficient, but the workload doesn't yet require distribution.
- RAM usage approaching 80% of available memory (OS page cache eviction starts, severe performance degradation)
- CPU saturation during query serving or indexing
- Disk space running low for on-disk vectors and payloads
- A single node can handle up to ~100M vectors depending on dimensions and quantization
- For non-production workloads, which are tolerant to single-point-of-failure and don't require high availability
## How to Scale Vertically in Qdrant Cloud
Vertical scaling is managed through the Qdrant Cloud Console.
- Log into [Qdrant Cloud Console](https://cloud.qdrant.io/) or use [CLI tool](https://github.com/qdrant/qcloud-cli)
- Select the cluster to resize
- Choose a larger node configuration (more RAM, CPU, or both)
- The upgrade process involves a rolling restart with no downtime if replication is configured
- Ensure `replication_factor: 2` or higher before resizing to maintain availability during the rolling restart
**Important:** Scaling up is straightforward. Scaling down requires care -- if the working set no longer fits in RAM after downsizing, performance will degrade severely due to cache eviction. Always load test before scaling down.
## RAM Sizing Guidelines
RAM is the most critical resource for Qdrant performance. Use these guidelines to right-size.
- Exact estimation of RAM usage is difficult; use this simple approximate formula: `num_vectors * dimensions * 4 bytes * 1.5` for full-precision vectors in RAM
- With scalar quantization: divide by 4 (INT8 reduces each float32 to 1 byte) [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
- With binary quantization: divide by 32 [Binary quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=binary-quantization)
- Add overhead for HNSW index (~20-30% of vector data), payload indexes, and WAL
- Reserve 20% headroom for optimizer operations and OS cache
- Monitor actual usage via Grafana/Prometheus before and after resizing [Monitoring](../../../qdrant-monitoring/SKILL.md)
## When Vertical Scaling Is No Longer Enough
Recognize these signals that it's time to go horizontal:
- Data volume exceeds what a single node can hold even with quantization and mmap
- IOPS are saturated (more nodes = more independent disk I/O)
- Need fault tolerance (requires replication across nodes)
- Need tenant isolation via dedicated shards
- Single-node CPU is maxed and query latency is unacceptable
- Next vertical scaling step is the largest available node size. You might need to be able to temporarily scale up to the larger node size to do batch operations or recovery. If you are already at the largest node size, you won't be able to do that.
When you hit these limits, see [Horizontal Scaling](../horizontal-scaling/SKILL.md) for guidance on sharding and node planning.
## What NOT to Do
- Do not scale down RAM without load testing first (cache eviction = severe latency degradation that can last days)
- Do not ignore the 80% RAM threshold (performance cliff, not gradual degradation)
- Do not skip replication before resizing in Cloud (rolling restart without replicas = downtime)
- Do not jump to horizontal scaling before exhausting vertical options (adds permanent operational complexity)
- Do not assume more CPU always helps (IOPS-bound workloads won't improve with more cores)