feat: Qdrant skills (#1412)

2026-06-10 01:53:29 +00:00 · 2026-04-17 06:24:27 +05:30
parent 971139baf2
commit 9637e1ab08
24 changed files with 1234 additions and 0 deletions
@@ -250,6 +250,14 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [pytest-coverage](../skills/pytest-coverage/SKILL.md) | Run pytest tests with coverage, discover lines missing coverage, and increase coverage to 100%. | None |
 | [python-mcp-server-generator](../skills/python-mcp-server-generator/SKILL.md) | Generate a complete MCP server project in Python with tools, resources, and proper configuration | None |
 | [python-pypi-package-builder](../skills/python-pypi-package-builder/SKILL.md) | End-to-end skill for building, testing, linting, versioning, and publishing a production-grade Python library to PyPI. Covers all four build backends (setuptools+setuptools_scm, hatchling, flit, poetry), PEP 440 versioning, semantic versioning, dynamic git-tag versioning, OOP/SOLID design, type hints (PEP 484/526/544/561), Trusted Publishing (OIDC), and the full PyPA packaging flow. Use for: creating Python packages, pip-installable SDKs, CLI tools, framework plugins, pyproject.toml setup, py.typed, setuptools_scm, semver, mypy, pre-commit, GitHub Actions CI/CD, or PyPI publishing. | `references/architecture-patterns.md`<br />`references/ci-publishing.md`<br />`references/community-docs.md`<br />`references/library-patterns.md`<br />`references/pyproject-toml.md`<br />`references/release-governance.md`<br />`references/testing-quality.md`<br />`references/tooling-ruff.md`<br />`references/versioning-strategy.md`<br />`scripts/scaffold.py` |
 | [qdrant-clients-sdk](../skills/qdrant-clients-sdk/SKILL.md) | Qdrant provides client SDKs for various programming languages, allowing easy integration with Qdrant deployments. | None |
 | [qdrant-deployment-options](../skills/qdrant-deployment-options/SKILL.md) | Guides Qdrant deployment selection. Use when someone asks 'how to deploy Qdrant', 'Docker vs Cloud', 'local mode', 'embedded Qdrant', 'Qdrant EDGE', 'which deployment option', 'self-hosted vs cloud', or 'need lowest latency deployment'. Also use when choosing between deployment types for a new project. | None |
 | [qdrant-model-migration](../skills/qdrant-model-migration/SKILL.md) | Guides embedding model migration in Qdrant without downtime. Use when someone asks 'how to switch embedding models', 'how to migrate vectors', 'how to update to a new model', 'zero-downtime model change', 'how to re-embed my data', or 'can I use two models at once'. Also use when upgrading model dimensions, switching providers, or A/B testing models. | None |
 | [qdrant-monitoring](../skills/qdrant-monitoring/SKILL.md) | Guides Qdrant monitoring and observability setup. Use when someone asks 'how to monitor Qdrant', 'what metrics to track', 'is Qdrant healthy', 'optimizer stuck', 'why is memory growing', 'requests are slow', or needs to set up Prometheus, Grafana, or health checks. Also use when debugging production issues that require metric analysis. | `debugging`<br />`setup` |
 | [qdrant-performance-optimization](../skills/qdrant-performance-optimization/SKILL.md) | Different techniques to optimize the performance of Qdrant, including indexing strategies, query optimization, and hardware considerations. Use when you want to improve the speed and efficiency of your Qdrant deployment. | `indexing-performance-optimization`<br />`memory-usage-optimization`<br />`search-speed-optimization` |
 | [qdrant-scaling](../skills/qdrant-scaling/SKILL.md) | Guides Qdrant scaling decisions. Use when someone asks 'how many nodes do I need', 'data doesn't fit on one node', 'need more throughput', 'cluster is slow', 'too many tenants', 'vertical or horizontal', 'how to shard', or 'need to add capacity'. | `minimize-latency`<br />`scaling-data-volume`<br />`scaling-qps`<br />`scaling-query-volume` |
 | [qdrant-search-quality](../skills/qdrant-search-quality/SKILL.md) | Diagnoses and improves Qdrant search relevance. Use when someone reports 'search results are bad', 'wrong results', 'low precision', 'low recall', 'irrelevant matches', 'missing expected results', or asks 'how to improve search quality?', 'which embedding model?', 'should I use hybrid search?', 'should I use reranking?'. Also use when search quality degrades after quantization, model change, or data growth. | `diagnosis`<br />`search-strategies` |
 | [qdrant-version-upgrade](../skills/qdrant-version-upgrade/SKILL.md) | Guidance on how to upgrade your Qdrant version without interrupting the availability of your application and ensuring data integrity. | None |
 | [quality-playbook](../skills/quality-playbook/SKILL.md) | Explore any codebase from scratch and generate six quality artifacts: a quality constitution (QUALITY.md), spec-traced functional tests, a code review protocol with regression test generation, an integration testing protocol, a multi-model spec audit (Council of Three), and an AI bootstrap file (AGENTS.md). Includes state machine completeness analysis and missing safeguard detection. Works with any language (Python, Java, Scala, TypeScript, Go, Rust, etc.). Use this skill whenever the user asks to set up a quality playbook, generate functional tests from specifications, create a quality constitution, build testing protocols, audit code against specs, or establish a repeatable quality system for a project. Also trigger when the user mentions 'quality playbook', 'spec audit', 'Council of Three', 'fitness-to-purpose', 'coverage theater', or wants to go beyond basic test generation to build a full quality system grounded in their actual codebase. | `LICENSE.txt`<br />`references/constitution.md`<br />`references/defensive_patterns.md`<br />`references/functional_tests.md`<br />`references/review_protocols.md`<br />`references/schema_mapping.md`<br />`references/spec_audit.md`<br />`references/verification.md` |
 | [quasi-coder](../skills/quasi-coder/SKILL.md) | Expert 10x engineer skill for interpreting and implementing code from shorthand, quasi-code, and natural language descriptions. Use when collaborators provide incomplete code snippets, pseudo-code, or descriptions with potential typos or incorrect terminology. Excels at translating non-technical or semi-technical descriptions into production-quality code. | None |
 | [react-audit-grep-patterns](../skills/react-audit-grep-patterns/SKILL.md) | Provides the complete, verified grep scan command library for auditing React codebases before a React 18.3.1 or React 19 upgrade. Use this skill whenever running a migration audit - for both the react18-auditor and react19-auditor agents. Contains every grep pattern needed to find deprecated APIs, removed APIs, unsafe lifecycle methods, batching vulnerabilities, test file issues, dependency conflicts, and React 19 specific removals. Always use this skill when writing audit scan commands - do not rely on memory for grep syntax, especially for the multi-line async setState patterns which require context flags. | `references/dep-scans.md`<br />`references/react18-scans.md`<br />`references/react19-scans.md`<br />`references/test-scans.md` |
@@ -0,0 +1,74 @@
 ---
 name: qdrant-clients-sdk
 description: "Qdrant provides client SDKs for various programming languages, allowing easy integration with Qdrant deployments."
 allowed-tools:
  - Read
  - Grep
  - Glob
  - Bash
 ---
 # Qdrant Clients SDK
 Qdrant has the following officially supported client SDKs:
 - Python — [qdrant-client](https://github.com/qdrant/qdrant-client) · Installation: `pip install qdrant-client[fastembed]`
 - JavaScript / TypeScript — [qdrant-js](https://github.com/qdrant/qdrant-js) · Installation: `npm install @qdrant/js-client-rest`
 - Rust — [rust-client](https://github.com/qdrant/rust-client) · Installation: `cargo add qdrant-client`
 - Go — [go-client](https://github.com/qdrant/go-client) · Installation: `go get github.com/qdrant/go-client`
 - .NET — [qdrant-dotnet](https://github.com/qdrant/qdrant-dotnet) · Installation: `dotnet add package Qdrant.Client`
 - Java — [java-client](https://github.com/qdrant/java-client) · Available on Maven Central: https://central.sonatype.com/artifact/io.qdrant/client
 ## API Reference
 All interaction with Qdrant can happen through the REST API or gRPC API. We recommend using the REST API if you are using Qdrant for the first time or working on a prototype.
 * REST API - [OpenAPI Reference](https://api.qdrant.tech/api-reference) - [GitHub](https://github.com/qdrant/qdrant/blob/master/docs/redoc/master/openapi.json)
 * gRPC API - [gRPC protobuf definitions](https://github.com/qdrant/qdrant/tree/master/lib/api/src/grpc/proto)
 ## Code examples
 To obtain code examples for a specific client and use case, you can send a search request to the library of curated code snippets for the Qdrant client.
 ```bash
 curl -X GET "https://snippets.qdrant.tech/search?language=python&query=how+to+upload+points"
 ```
 Available languages: `python`, `typescript`, `rust`, `java`, `go`, `csharp`
 Response example:
 ```markdown
 ## Snippet 1
 *qdrant-client* (vlatest) — https://search.qdrant.tech/md/documentation/manage-data/points/
 Uploads multiple vector-embedded points to a Qdrant collection using the Python qdrant_client (PointStruct) with id, payload (e.g., color), and a 3D-like vector for similarity search. It supports parallel uploads (parallel=4) and a retry policy (max_retries=3) for robust indexing. The operation is idempotent: re-uploading with the same id overwrites existing points; if ids aren’t provided, Qdrant auto-generates UUIDs.
 client.upload_points(
    collection_name="{collection_name}",
    points=[
        models.PointStruct(
            id=1,
            payload={
                "color": "red",
            },
            vector=[0.9, 0.1, 0.1],
        ),
        models.PointStruct(
            id=2,
            payload={
                "color": "green",
            },
            vector=[0.1, 0.9, 0.1],
        ),
    ],
    parallel=4,
    max_retries=3,
 )
 ```
 Default response format is markdown, if snippet output is required in JSON format, you can add `&format=json` to the query string.
@@ -0,0 +1,53 @@
 ---
 name: qdrant-deployment-options
 description: "Guides Qdrant deployment selection. Use when someone asks 'how to deploy Qdrant', 'Docker vs Cloud', 'local mode', 'embedded Qdrant', 'Qdrant EDGE', 'which deployment option', 'self-hosted vs cloud', or 'need lowest latency deployment'. Also use when choosing between deployment types for a new project."
 ---
 # Which Qdrant Deployment Do I Need?
 Start with what you need: managed ops or full control? Network latency acceptable or not? Production or prototyping? The answer narrows to one of four options.
 ## Getting Started or Prototyping
 Use when: building a prototype, running tests, CI/CD pipelines, or learning Qdrant.
 - Use local mode (Python only): zero-dependency, in-memory or disk-persisted, no server needed [Local mode](https://search.qdrant.tech/md/documentation/quickstart/)
 - Local mode data format is NOT compatible with server. Do not use for production or benchmarking.
 - For a real server locally, use Docker [Quick start](https://search.qdrant.tech/md/documentation/quickstart/?s=download-and-run)
 ## Going to Production (Self-Hosted)
 Use when: you need full control over infrastructure, data residency, or custom configuration.
 - Docker is the default deployment. Full Qdrant Open Source feature set, minimal setup. [Quick start](https://search.qdrant.tech/md/documentation/quickstart/?s=download-and-run)
 - You own operations: upgrades, backups, scaling, monitoring
 - Must set up distributed mode manually for multi-node clusters [Distributed deployment](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/)
 - Consider Hybrid Cloud if you want Qdrant Cloud management on your infrastructure [Hybrid Cloud](https://search.qdrant.tech/md/documentation/hybrid-cloud/)
 ## Going to Production (Zero-Ops)
 Use when: you want managed infrastructure with zero-downtime updates, automatic backups, and resharding without operating clusters yourself.
 - Qdrant Cloud handles upgrades, scaling, backups, and monitoring [Qdrant Cloud](https://search.qdrant.tech/md/documentation/cloud-quickstart/)
 - Supports multi-version upgrades automatically
 - Provides features not available in self-hosted: `/sys_metrics`, managed resharding, pre-configured alerts
 ## Need Lowest Possible Latency
 Use when: network round-trip to a server is unacceptable. Edge devices, in-process search, or latency-critical applications.
 - Qdrant EDGE: in-process bindings to Qdrant shard-level functions, no network overhead [Qdrant EDGE](https://search.qdrant.tech/md/documentation/edge/edge-quickstart/)
 - Same data format as server. Can sync with server via shard snapshots.
 - Single-node feature set only. No distributed mode.
 ## What NOT to Do
 - Use local mode for production or benchmarking (not optimized, incompatible data format)
 - Self-host without monitoring and backup strategy (you will lose data or miss outages)
 - Choose EDGE when you need distributed search (single-node only)
 - Pick Hybrid Cloud unless you have data residency requirements (unnecessary Kubernetes complexity when Qdrant Cloud works)
@@ -0,0 +1,85 @@
 ---
 name: qdrant-model-migration
 description: "Guides embedding model migration in Qdrant without downtime. Use when someone asks 'how to switch embedding models', 'how to migrate vectors', 'how to update to a new model', 'zero-downtime model change', 'how to re-embed my data', or 'can I use two models at once'. Also use when upgrading model dimensions, switching providers, or A/B testing models."
 ---
 # What to Do When Changing Embedding Models
 Vectors from different models are incompatible. You cannot mix old and new embeddings in the same vector space. You also cannot add new named vector fields to an existing collection. All named vectors must be defined at collection creation time. Both migration strategies below require creating a new collection.
 - Understand collection aliases before choosing a strategy [Collection aliases](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=collection-aliases)
 ## Can I Avoid Re-embedding?
 Use when: looking for shortcuts before committing to full migration.
 You MUST re-embed if: changing model provider (OpenAI to Cohere), changing architecture (CLIP to BGE), incompatible dimension counts across different models, or adding sparse vectors to dense-only collection.
 You CAN avoid re-embedding if: using Matryoshka models (use `dimensions` parameter to output lower-dimensional embeddings, learn linear transformation from sample data, some recall loss, good for 100M+ datasets). Or changing quantization (binary to scalar): Qdrant re-quantizes automatically. [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 ## Need Zero Downtime (Alias Swap)
 Use when: production must stay available. Recommended for model replacement at scale.
 - Create a new collection with the new model's dimensions and distance metric
 - Re-embed all data into the new collection in the background
 - Point your application at a collection alias instead of a direct collection name
 - Atomically swap the alias to the new collection [Switch collection](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=switch-collection)
 - Verify search quality, then delete the old collection
 Careful, the alias swap only redirects queries. Payloads must be re-uploaded separately.
 ## Need Both Models Live (Side-by-Side)
 Use when: A/B testing models, multi-modal (dense + sparse), or evaluating a new model before committing.
 You cannot add a named vector to an existing collection. Create a new collection with both vector fields defined upfront:
 - Create new collection with old and new named vectors both defined [Collection with multiple vectors](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=collection-with-multiple-vectors)
 - Migrate data from old collection, preserving existing vectors in the old named field
 - Backfill new model embeddings incrementally using `UpdateVectors` [Update vectors](https://search.qdrant.tech/md/documentation/manage-data/points/?s=update-vectors)
 - Compare quality by querying with `using: "old_model"` vs `using: "new_model"`
 - Swap alias to new collection once satisfied
 Co-locating large multi-vectors (especially ColBERT) with dense vectors degrades ALL queries, even those only using dense. At millions of points, users report 13s latency dropping to 2s after removing ColBERT. Put large vectors on disk during side-by-side migration.
 If you anticipate future model migrations, define both vector fields upfront at collection creation.
 ## Dense to Hybrid Search Migration
 Use when: adding sparse/BM25 vectors to an existing dense-only collection. Most common migration pattern.
 You cannot add sparse vectors to an existing dense-only collection. Must recreate:
 - Create new collection with both dense and sparse vector configs defined
 - Re-embed all data with both dense and sparse models
 - Migrate payloads, swap alias
 Sparse vectors at chunk level have different TF-IDF characteristics than document level. Test retrieval quality after migration, especially for non-English text without stop-word removal.
 ## Re-embedding Is Too Slow
 Use when: dataset is large and re-embedding is the bottleneck.
 - Use `update_mode: insert` (v1.17+) for safe idempotent migration [Update mode](https://search.qdrant.tech/md/documentation/manage-data/points/?s=update-mode)
 - Scroll the old collection with `with_vectors=False`, re-embed in batches, upsert into new collection
 - Upload in parallel batches (64-256 points per request, 2-4 parallel streams) [Bulk upload](https://search.qdrant.tech/md/documentation/tutorials-develop/bulk-upload/)
 - Disable HNSW during bulk load (set `indexing_threshold_kb` very high, restore after)
 - For Qdrant Cloud inference, switching models is a config change, not a pipeline change [Inference docs](https://search.qdrant.tech/md/documentation/inference/)
 For 400GB+ datasets, expect days. For small datasets (<25MB), re-indexing from source is faster than using the migration tool.
 ## What NOT to Do
 - Assume you can add named vectors to an existing collection (must be defined at creation time)
 - Delete the old collection before verifying the new one
 - Forget to update the query embedding model in your application code
 - Skip payload migration when using alias swap (aliases redirect queries, they do not copy data)
 - Keep ColBERT vectors co-located with dense vectors during a long migration (I/O cost degrades all queries)
 - Migrate to hybrid search without testing BM25 quality at chunk level
@@ -0,0 +1,24 @@
 ---
 name: qdrant-monitoring
 description: "Guides Qdrant monitoring and observability setup. Use when someone asks 'how to monitor Qdrant', 'what metrics to track', 'is Qdrant healthy', 'optimizer stuck', 'why is memory growing', 'requests are slow', or needs to set up Prometheus, Grafana, or health checks. Also use when debugging production issues that require metric analysis."
 allowed-tools:
  - Read
  - Grep
  - Glob
 ---
 # Qdrant Monitoring
 Qdrant monitoring allows tracking performance and health of your deployment, and identifying issues before they become outages. First determine whether you need to set up monitoring or diagnose an active issue.
 - Understand available metrics [Monitoring docs](https://search.qdrant.tech/md/documentation/operations/monitoring/)
 ## Monitoring Setup
 Prometheus scraping, health probes, Hybrid Cloud specifics, alerting, and log centralization. [Monitoring Setup](setup/SKILL.md)
 ## Debugging with Metrics
 Optimizer stuck, memory growth, slow requests. Using metrics to diagnose active production issues. [Debugging with Metrics](debugging/SKILL.md)
@@ -0,0 +1,52 @@
 ---
 name: qdrant-monitoring-debugging
 description: "Diagnoses Qdrant production issues using metrics and observability tools. Use when someone reports 'optimizer stuck', 'indexing too slow', 'memory too high', 'OOM crash', 'queries are slow', 'latency spike', or 'search was fast now it's slow'. Also use when performance degrades without obvious config changes."
 ---
 # How to Debug Qdrant with Metrics
 First check optimizer status. Most production issues trace back to active optimizations competing for resources. If optimizer is clean, check memory, then request metrics.
 ## Optimizer Stuck or Too Slow
 Use when: optimizer running for hours, not finishing, or showing errors.
 - Use `/collections/{collection_name}/optimizations` endpoint (v1.17+) to check status [Optimization monitoring](https://search.qdrant.tech/md/documentation/operations/optimizer/?s=optimization-monitoring)
 - Query with optional detail flags: `?with=queued,completed,idle_segments`
 - Returns: queued optimizations count, active optimizer type, involved segments, progress tracking
 - Web UI has an Optimizations tab with timeline view and per-task duration metrics [Web UI](https://search.qdrant.tech/md/documentation/operations/optimizer/?s=web-ui)
 - If `optimizer_status` shows an error in collection info, check logs for disk full or corrupted segments
 - Large merges and HNSW rebuilds legitimately take hours on big datasets. Check progress before assuming it's stuck.
 ## Memory Seems Too High
 Use when: memory exceeds expectations, node crashes with OOM, or memory keeps growing.
 - Process memory metrics available via `/metrics` (RSS, allocated bytes, page faults)
 - Qdrant uses two types of RAM: resident memory (data structures, quantized vectors) and OS page cache (cached disk reads). Page cache filling available RAM is normal. [Memory article](https://qdrant.tech/articles/memory-consumption/)
 - If resident memory (RSSAnon) exceeds 80% of total RAM, investigate
 - Check `/telemetry` for per-collection breakdown of point counts and vector configurations
 - Estimate expected memory: `num_vectors * dimensions * 4 bytes * 1.5` for vectors, plus payload and index overhead [Capacity planning](https://search.qdrant.tech/md/documentation/operations/capacity-planning/)
 - Common causes of unexpected growth: quantized vectors with `always_ram=true`, too many payload indexes, large `max_segment_size` during optimization
 ## Queries Are Slow
 Use when: queries slower than expected and you need to identify the cause.
 - Track `rest_responses_avg_duration_seconds` and `rest_responses_max_duration_seconds` per endpoint
 - Use histogram metric `rest_responses_duration_seconds` (v1.8+) for percentile analysis in Grafana
 - Equivalent gRPC metrics with `grpc_responses_` prefix
 - Check optimizer status first. Active optimizations compete for CPU and I/O, degrading search latency.
 - Check segment count via collection info. Too many unmerged segments after bulk upload causes slower search.
 - Compare filtered vs unfiltered query times. Large gap means missing payload index. [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
 ## What NOT to Do
 - Ignore optimizer status when debugging slow queries (most common root cause)
 - Assume memory leak when page cache fills RAM (normal OS behavior)
 - Make config changes while optimizer is running (causes cascading re-optimizations)
 - Blame Qdrant before checking if bulk upload just finished (unmerged segments)
@@ -0,0 +1,61 @@
 ---
 name: qdrant-monitoring-setup
 description: "Guides Qdrant monitoring setup including Prometheus scraping, health probes, Hybrid Cloud metrics, alerting, and log centralization. Use when someone asks 'how to set up monitoring', 'Prometheus config', 'Grafana dashboard', 'health check endpoints', 'how to scrape Hybrid Cloud', 'what alerts to set', 'how to centralize logs', or 'audit logging'."
 ---
 # How to Set Up Qdrant Monitoring
 Get Prometheus scraping working first, then health probes, then alerting. Do not skip monitoring setup before going to production.
 ## Prometheus Metrics
 Use when: setting up metric collection for the first time or adding a new deployment.
 - Node metrics at `/metrics` endpoint [Monitoring docs](https://search.qdrant.tech/md/documentation/operations/monitoring/)
 - Cluster metrics at `/sys_metrics` (Qdrant Cloud only)
 - Prefix customization via `service.metrics_prefix` config or `QDRANT__SERVICE__METRICS_PREFIX` env var
 - Example self-hosted setup with Prometheus + Grafana [prometheus-monitoring repo](https://github.com/qdrant/prometheus-monitoring)
 ## Hybrid Cloud Scraping
 Use when: running Qdrant Hybrid Cloud and need cluster-level visibility.
 Do not just scrape Qdrant nodes. In Hybrid Cloud, you manage the Kubernetes data plane. You must also scrape the cluster-exporter and operator pods for full cluster visibility and operator state.
 - Hybrid Cloud Prometheus setup tutorial [Hybrid Cloud Prometheus](https://search.qdrant.tech/md/documentation/tutorials-and-examples/hybrid-cloud-prometheus/)
 - Official Grafana dashboards [Grafana dashboard repo](https://github.com/qdrant/qdrant-cloud-grafana-dashboard)
 ## Liveness and Readiness Probes
 Use when: configuring Kubernetes health checks.
 - Use `/healthz`, `/livez`, `/readyz` for basic status, liveness, and readiness [Kubernetes health endpoints](https://search.qdrant.tech/md/documentation/operations/monitoring/?s=kubernetes-health-endpoints)
 ## Alerting
 Use when: setting up alerts for production or Hybrid Cloud deployments.
 - Hybrid Cloud provides ~11 pre-configured Prometheus alerts out of the box [Cloud cluster monitoring](https://search.qdrant.tech/md/documentation/cloud/cluster-monitoring/)
 - Use AlertmanagerConfig to route alerts to Slack, PagerDuty, or other targets based on labels
 - At minimum, alert on: optimizer errors, node not ready, replication factor below target, disk usage >80%
 ## Log Centralization and Audit Logging
 Use when: enterprise compliance requires centralized logs or audit trails.
 - Enable JSON log format for structured analysis: set `logger.format` to `json` in config [Configuration](https://search.qdrant.tech/md/documentation/operations/configuration/)
 - Use FluentD/OpenSearch for log aggregation
 - Audit logs (v1.17+) write to local filesystem (`/qdrant/storage/audit/`), not stdout. Mount a Persistent Volume and deploy a sidecar container to tail these files to stdout so DaemonSets can pick them up. [Audit logging](https://search.qdrant.tech/md/documentation/operations/security/?s=audit-logging)
 ## What NOT to Do
 - Scrape `/sys_metrics` on self-hosted (only available on Qdrant Cloud)
 - Scrape only Qdrant nodes in Hybrid Cloud (miss cluster-exporter and operator metrics)
 - Skip monitoring setup before going to production (you will regret it)
 - Alert on page cache memory usage (it's supposed to fill available RAM, normal OS behavior)
@@ -0,0 +1,37 @@
 ---
 name: qdrant-performance-optimization
 description: "Different techniques to optimize the performance of Qdrant, including indexing strategies, query optimization, and hardware considerations. Use when you want to improve the speed and efficiency of your Qdrant deployment."
 allowed-tools:
  - Read
  - Grep
  - Glob
 ---
 # Qdrant Performance Optimization
 There are different aspects of Qdrant performance, this document serves as a navigation hub for different aspects of performance optimization in Qdrant.
 ## Search Speed Optimization
 There are two different criteria for search speed: latency and throughput. 
 Latency is the time it takes to get a response for a single query, while throughput is the number of queries that can be processed in a given time frame.
 Depending on your use case, you may want to optimize for one or both of these metrics.
 More on search speed optimization can be found in the [Search Speed Optimization](search-speed-optimization/SKILL.md) skill.
 ## Indexing Performance Optimization
 Qdrant needs to build a vector index to perform efficient similarity search. The time it takes to build the index can vary depending on the size of your dataset, hardware, and configuration.
 More on indexing performance optimization can be found in the [Indexing Performance Optimization](indexing-performance-optimization/SKILL.md) skill.
 ## Memory Usage Optimization
 Vector search can be memory intensive, especially when dealing with large datasets.
 Qdrant has a flexible memory management system, which allows you to precisely control which parts of storage are kept in memory and which are stored on disk. This can help you optimize memory usage without sacrificing performance.
 More on memory usage optimization can be found in the [Memory Usage Optimization](memory-usage-optimization/SKILL.md) skill.
@@ -0,0 +1,80 @@
 ---
 name: qdrant-indexing-performance-optimization
 description: "Diagnoses and fixes slow Qdrant indexing and data ingestion. Use when someone reports 'uploads are slow', 'indexing takes forever', 'optimizer is stuck', 'HNSW build time too long', or 'data uploaded but search is bad'. Also use when optimizer status shows errors, segments won't merge, or indexing threshold questions arise."
 ---
 # What to Do When Qdrant Indexing Is Too Slow
 Qdrant does NOT build HNSW indexes immediately. Small segments use brute-force until they exceed `indexing_threshold_kb` (default: 20 MB). Search during this window is slower by design, not a bug.
 - Understand the indexing optimizer [Indexing optimizer](https://search.qdrant.tech/md/documentation/operations/optimizer/?s=indexing-optimizer)
 ## Uploads/Ingestion Too Slow
 Use when: upload or upsert API calls are slow.
 Identify bottleneck: client-side (network, batching) vs server-side (CPU, disk I/O)
 For client-side, optimize batching and parallelism:
 - Use batch upserts (64-256 points per request) [Points API](https://search.qdrant.tech/md/documentation/manage-data/points/?s=upload-points)
 - Use 2-4 parallel upload streams
 For server-side, optimize Qdrant configuration and indexing strategy:
 - Create more shards (3-12), each shard has an independent update worker [Sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=sharding)
 - Create payload indexes before HNSW builds (needed for filterable vector index) [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
 Suitable for initial bulk load of large datasets:
 - Disable HNSW during bulk load (set `indexing_threshold_kb` very high, restore after) [Collection params](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=update-collection-parameters)
 - Setting `m=0` to disable HNSW is legacy, use high `indexing_threshold_kb` instead
 Careful, fast unindexed upload might temporarily use more RAM and degrade search performance until optimizer catches up.
 See https://search.qdrant.tech/md/documentation/tutorials-develop/bulk-upload/
 ## Optimizer Stuck or Taking Too Long
 Use when: optimizer running for hours, not finishing.
 - Check actual progress via optimizations endpoint (v1.17+) [Optimization monitoring](https://search.qdrant.tech/md/documentation/operations/optimizer/?s=optimization-monitoring)
 - Large merges and HNSW rebuilds legitimately take hours on big datasets
 - Check CPU and disk I/O (HNSW is CPU-bound, merging is I/O-bound, HDD is not viable)
 - If `optimizer_status` shows an error, check logs for disk full or corrupted segments
 ## HNSW Build Time Too High
 Use when: HNSW index build dominates total indexing time.
 - Reduce `m` (default 16, good for most cases, 32+ rarely needed) [HNSW params](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index)
 - Reduce `ef_construct` (100-200 sufficient) [HNSW config](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=indexing-vectors-in-hnsw)
 - Keep `max_indexing_threads` proportional to CPU cores [Configuration](https://search.qdrant.tech/md/documentation/operations/configuration/)
 - Use GPU for indexing [GPU indexing](https://search.qdrant.tech/md/documentation/operations/running-with-gpu/)
 ## HNSW index for multi-tenant collections
 If you have a multi-tenant use case where all data is split by some payload field (e.g. `tenant_id`), you can avoid building a global HNSW index and instead rely on `payload_m` to build HNSW index only for subsets of data.
 Skipping global HNSW index can significantly reduce indexing time.
 See [Multi-tenant collections](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/) for details.
 ## Additional Payload Indexes Are Too Slow
 Qdrant builds extra HNSW links for all payload indexes to ensure that quality of filtered vector search does not degrade.
 Some payload indexes (e.g. `text` fields with long texts) can have a very high number of unique values per point, which can lead to long HNSW build time.
 You can disable building extra HNSW links for specific payload index and instead rely on slightly slower query-time strategies like ACORN.
 Read more about disabling extra HNSW links in [documentation](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=disable-the-creation-of-extra-edges-for-payload-fields)
 Read more about ACORN in [documentation](https://search.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm)
 ## What NOT to Do
 - Do not create payload indexes AFTER HNSW is built (breaks filterable vector index)
 - Do not use `m=0` for bulk uploads into an existing collection, it might drop the existing HNSW and cause long reindexing 
 - Do not upload one point at a time (per-request overhead dominates)
@@ -0,0 +1,67 @@
 ---
 name: qdrant-memory-usage-optimization
 description: "Diagnoses and reduces Qdrant memory usage. Use when someone reports 'memory too high', 'RAM keeps growing', 'node crashed', 'out of memory', 'memory leak', or asks 'why is memory usage so high?', 'how to reduce RAM?'. Also use when memory doesn't match calculations, quantization didn't help, or nodes crash during recovery."
 ---
 # Understanding memory usage
 Qdrant operates with two types of memory:
 - Resident memory (aka RSSAnon) - memory used for internal data structures like the ID tracker, plus components that must stay in RAM, such as quantized vectors when `always_ram=true` and payload indexes.
 - OS page cache - memory used for caching disk reads, which can be released when needed. Original vectors are normally stored in page cache, so the service won't crash if RAM is full, but performance may degrade.
 It is normal for the OS page cache to occupy all available RAM, but if resident memory is above 80% of total RAM, it is a sign of a problem.
 ## Memory usage monitoring
 - Qdrant exposes memory usage through the `/metrics` endpoint. See [Monitoring docs](https://search.qdrant.tech/md/documentation/operations/monitoring/).
 <!-- ToDo: Talk about memory usage of each components once API is available -->
 ## How much memory is needed for Qdrant?
 Optimal memory usage depends on the use case.
 - For regular search scenarios, general guidelines are provided in the [Capacity planning docs](https://search.qdrant.tech/md/documentation/operations/capacity-planning/).
 For a detailed breakdown of memory usage at large scale, see [Large scale memory usage example](https://search.qdrant.tech/md/documentation/tutorials-operations/large-scale-search/?s=memory-usage).
 Payload indexes and HNSW graph also require memory, along with vectors themselves, so it's important to consider them in calculations.
 Additionally, Qdrant requires some extra memory for optimizations. During optimization, optimized segments are fully loaded into RAM, so it is important to leave enough headroom.
 The larger `max_segment_size` is, the more headroom is needed.
 ### When to put HNSW index on disk
 Putting frequently used components (such as HNSW index) on disk might cause significant performance degradation.
 There are some scenarios, however, when it can be a good option:
 - Deployments with low latency disks - local NVMe or similar.
 - Multi-tenant deployments, where only a subset of tenants is frequently accessed, so that only a fraction of data & index is loaded in RAM at a time.
 - For deployments with [inline storage](https://search.qdrant.tech/md/documentation/operations/optimize/?s=inline-storage-in-hnsw-index) enabled.
 ## How to minimize memory footprint
 The main challenge is to put on disk those parts of data, which are rarely accessed.
 Here are the main techniques to achieve that:
 - Use quantization to store only compressed vectors in RAM [Quantization docs](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 - Use float16 or int8 datatypes to reduce memory usage of vectors by 2x or 4x respectively, with some tradeoff in precision. Read more about vector datatypes in [documentation](https://search.qdrant.tech/md/documentation/manage-data/vectors/?s=datatypes)
 - Leverage Matryoshka Representation Learning (MRL) to store only small vectors in RAM while keeping large vectors on disk. Examples of how to use MRL with Qdrant Cloud inference: [MRL docs](https://search.qdrant.tech/md/documentation/inference/?s=reduce-vector-dimensionality-with-matryoshka-models)
 - For multi-tenant deployments with small tenants, vectors might be stored on disk because the same tenant's data is stored together [Multitenancy docs](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance)
 - For deployments with fast local storage and relatively low requirements for search throughput, it may be possible to store all components of vector store on disk. Read more about the performance implications of on-disk storage in [the article](https://qdrant.tech/articles/memory-consumption/)
 - For low RAM environments, consider `async_scorer` config, which enables support of `io_uring` for parallel disk access, which can significantly improve performance of on-disk storage. Read more about `async_scorer` in [the article](https://qdrant.tech/articles/io_uring/) (only available on Linux with kernel 5.11+)
 - Consider storing Sparse Vectors and text payload on disk, as they are usually more disk-friendly than dense vectors.
 - Configure payload indexes to be stored on disk [docs](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=on-disk-payload-index)
 - Configure sparse vectors to be stored on disk [docs](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=sparse-vector-index)
@@ -0,0 +1,77 @@
 ---
 name: qdrant-search-speed-optimization
 description: "Diagnoses and fixes slow Qdrant search. Use when someone reports 'search is slow', 'high latency', 'queries take too long', 'low QPS', 'throughput too low', 'filtered search is slow', or 'search was fast but now it's slow'. Also use when search performance degrades after config changes or data growth."
 ---
 # Diagnose a problem
 There the multiple possible reasons for search performance degradation. The most common ones are:
 * Memory pressure: if the working set exceeds available RAM
 * Complex requests (e.g. high `hnsw_ef`, complex filters without payload index)
 * Competing background processes (e.g. optimizer still running after bulk upload)
 * Problem with the cluster (e.g. network issues, hardware degradation)
 ## Single Query Too Slow (Latency)
 Use when: individual queries take too long regardless of load.
 ### Diagnostic steps:
 - Check if second run of the same request is significantly faster (indicates memory pressure)
 - Try the same query with `with_payload: false` and `with_vectors: false` to see if payload retrieval is the bottleneck
 - If request uses filters, try to remove them one by one to identify if a specific filter condition is the bottleneck
 ### Common fixes:
 - Tune HNSW parameters: [Fine-tuning search](https://search.qdrant.tech/md/documentation/operations/optimize/?s=fine-tuning-search-parameters)
 - Enable in-memory quantization: [Scalar quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=scalar-quantization)
 - Reduce Vector Dimensionality with Matryoshka Models: [Matryoshka Models](https://search.qdrant.tech/md/documentation/inference/?s=reduce-vector-dimensionality-with-matryoshka-models)
 - Use oversampling + rescore for high-dimensional vectors [Search with quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=searching-with-quantization)
 - Enable io_uring for disk-heavy workloads on Linux [io_uring](https://qdrant.tech/articles/io_uring/)
 ## Can't Handle Enough QPS (Throughput)
 Use when: system can't serve enough queries per second under load.
 - Reduce segment count (`default_segment_number` to 2) [Maximizing throughput](https://search.qdrant.tech/md/documentation/operations/optimize/?s=maximizing-throughput)
 - Use batch search API instead of single queries [Batch search](https://search.qdrant.tech/md/documentation/search/search/?s=batch-search-api)
 - Enable quantization to reduce CPU cost [Scalar quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=scalar-quantization)
 - Add replicas to distribute read load [Replication](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=replication)
 ## Filtered Search Is Slow
 Use when: filtered search is significantly slower than unfiltered. Most common SA complaint after memory.
 - Create payload index on the filtered field [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
 - Use `is_tenant=true` for primary filtering condition: [Tenant index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=tenant-index)
 - Try ACORN algorithm for complex filters: [ACORN](https://search.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm)
 - Avoid using `nested` filtering conditions as a primary filter. It might force qdrant to read raw payload values instead of using index.
 - If payload index was added after HNSW build, trigger re-index to create filterable subgraph links
 ## Optimize search performance with parallel updates
 ### Diagnostic steps
 - Try to run the same query with `indexed_only=true` parameter, if the query is significantly faster, it means that the optimizer is still running and has not yet indexed all segments.
 - If CPU or IO usage is high even with no queries, it also indicates that the optimizer is still running.
 ### Recommended configuration changes
 - reduce `optimizer_cpu_budget` to reserve more CPU for queries
 - Use `prevent_unoptimized=true` to prevent creating segments with a large amount of unindexed data for searches. Instead, once a segment reaches the so called indexing_threshold, all additional points will be added in ‘deferred state’. 
 Learn more [here](https://search.qdrant.tech/md/documentation/search/low-latency-search/?s=query-indexed-data-only)
 ## What NOT to Do
 - Set `always_ram=false` on quantization (disk thrashing on every search)
 - Put HNSW on disk for latency-sensitive production (only for cold storage)
 - Increase segment count for throughput (opposite: fewer = better)
 - Create payload indexes on every field (wastes memory)
 - Blame Qdrant before checking optimizer status
@@ -0,0 +1,51 @@
 ---
 name: qdrant-scaling
 description: "Guides Qdrant scaling decisions. Use when someone asks 'how many nodes do I need', 'data doesn't fit on one node', 'need more throughput', 'cluster is slow', 'too many tenants', 'vertical or horizontal', 'how to shard', or 'need to add capacity'."
 allowed-tools:
  - Read
  - Grep
  - Glob
 ---
 # Qdrant Scaling
 First determine what you're scaling for:
 - data volume
 - query throughput (QPS)
 - query latency
 - query volume
 After determining the scaling goal, we can choose scaling strategy based on tradeoffs and assumptions.
 Each pulls toward different strategies. Scaling for throughput and latency are opposite tuning directions.
 ## Scaling Data Volume
 This becomes relevant when volume of the dataset exceeds the capacity of a single node.
 Read more about scaling for data volume in [Scaling Data Volume](scaling-data-volume/SKILL.md)
 ## Scaling for Query Throughput
 If your system needs to handle more parallel queries than a single node can handle,
 then you need to scale for query throughput.
 Read more about scaling for query throughput in [Scaling for Query Throughput](scaling-qps/SKILL.md)
 ## Scaling for Query Latency
 Latency of a single query is determined by the slowest component in the query execution path.
 It is in sometimes correlated with throughput, but not always. It might require different strategies for scaling.
 Read more about scaling for query latency in [Scaling for Query Latency](minimize-latency/SKILL.md)
 ## Scaling for Query Volume
 By query volume we understand the amount of results that a single query returns. 
 If the query volume is too high, it can cause performance issues and increase latency.
 Tuning for query volume is opposite might require special strategies. 
 Read more about scaling for query volume in [Scaling for Query Volume](scaling-query-volume/SKILL.md)
@@ -0,0 +1,41 @@
 ---
 name: qdrant-minimize-latency
 description: "Guides Qdrant query latency optimization. Use when someone asks 'search is slow', 'how to reduce latency', 'p99 is too high', 'tail latency', 'single query too slow', 'how to make search faster', or 'latency spikes'."
 ---
 # Scaling for Query Latency
 Latency of a single query is determined by the slowest component in the query execution path. It is sometimes correlated with throughput, but not always — throughput and latency are opposite tuning directions.
 Low latency optimization is aimed at utilising maximum resource saturation for a single query, while throughput optimization is aimed at minimizing per-query resource usage to allow more parallel queries.
 ## Performance Tuning for Lower Latency
 - Increase segment count to match CPU cores (`default_segment_number: 16`) [Minimizing latency](https://search.qdrant.tech/md/documentation/operations/optimize/?s=minimizing-latency)
 - Keep quantized vectors and HNSW in RAM (`always_ram=true`)
 - Reduce `hnsw_ef` at query time (trade recall for speed) [Search params](https://search.qdrant.tech/md/documentation/operations/optimize/?s=fine-tuning-search-parameters)
 - Use local NVMe, avoid network-attached storage
 ## Memory Pressure and Latency
 RAM is the most critical resource for latency. If working set exceeds available RAM, OS cache eviction causes severe, sustained latency degradation.
 - Vertical scale RAM first. Critical if working set >80%.
 - Use quantization: scalar (4x reduction) or binary (16x reduction) [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 - Move payload indexes to disk if filtering is infrequent [On-disk payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=on-disk-payload-index)
 - Set `optimizer_cpu_budget` to limit background optimization CPUs
 - Schedule indexing: set high `indexing_threshold` during peak hours
 ## Vertical Scaling for Latency
 More RAM and faster CPU directly reduce latency. See [Vertical Scaling](../scaling-data-volume/vertical-scaling/SKILL.md) for node sizing guidelines.
 ## What NOT to Do
 - Do not expect to optimize latency and throughput simultaneously on the same node
 - Do not use few large segments for latency-sensitive workloads (each segment takes longer to search)
 - Do not run at >90% RAM (cache eviction causes severe latency degradation that can last days)
 - Do not ignore optimizer status during performance debugging
 - Do not scale down RAM without load testing (cache eviction causes days-long latency incidents)
@@ -0,0 +1,49 @@
 ---
 name: qdrant-scaling-data-volume
 description: "Guides Qdrant data volume scaling decisions. Use when someone asks 'data doesn't fit on one node', 'too much data', 'need more storage', 'vertical or horizontal scaling', 'tenant scaling', 'time window rotation', or 'data growth exceeds capacity'."
 allowed-tools:
  - Read
  - Grep
  - Glob
 ---
 # Scaling Data Volume
 This document covers data volume scaling scenarios,
 where the total size of the dataset exceeds the capacity of a single node.
 ## Tenant Scaling
 If the use case is multi-tenant, meaning that each user only has access to a subset of the data,
 and we never need to query across all the data, then we can use multi-tenancy patterns to scale.
 The recommended way is to use multi-tenant workloads with payload partitioning, per-tenant indexes, and tiered multitenancy.
 Learn more [Tenant Scaling](tenant-scaling/SKILL.md)
 ## Sliding Time Window
 Some use-cases are based on a sliding time window, where only the most recent data is relevant.
 For example an index for social media posts, where only the last 6 months of data require fast search.
 Learn more [Sliding Time Window](sliding-time-window/SKILL.md)
 ## Global Search
 Most general use-cases require global search across all data.
 In these situations, we might need to fall back to vertical scaling,
 and then horizontal scaling when we reach the limits of vertical scaling.
 ### Vertical Scaling
 When data doesn't fit in a single node, the first approach is to scale the node itself — more RAM, better disk, quantization, mmap.
 Exhaust vertical options before going horizontal, as horizontal scaling adds permanent operational complexity.
 Learn more [Vertical Scaling](vertical-scaling/SKILL.md)
 ### Horizontal Scaling
 When a single node can't hold the data even with quantization and mmap, distribute data across multiple nodes via sharding.
 Learn more [Horizontal Scaling](horizontal-scaling/SKILL.md)
@@ -0,0 +1,47 @@
 ---
 name: qdrant-horizontal-scaling
 description: "Diagnoses and guides Qdrant horizontal scaling decisions. Use when someone asks 'vertical or horizontal?', 'how many nodes?', 'how many shards?', 'how to add nodes', 'resharding', 'data doesn't fit', or 'need more capacity'. Also use when data growth outpaces current deployment."
 ---
 # What to Do When Qdrant Needs More Capacity
 Vertical first: simpler operations, no network overhead, good up to ~100M vectors per node depending on dimensions and quantization. Horizontal when: data exceeds single node capacity, need fault tolerance, need to isolate tenants, or IOPS-bound (more nodes = more independent IOPS).
 ## Most basic distributed configuration
 - 3 nodes, 3 shards with `replication_factor: 2` for zero-downtime scaling
 Minimum of 3 nodes is important for consensus and fault tolerance. With 3 nodes, you can lose 1 node without downtime. With 2 nodes, losing 1 node causes downtime for collection operations.
 Replication factor of 2 means each shard has 1 replica, so you have 2 copies of data. This allows for zero-downtime scaling and maintenance. With `replication_factor: 1`, zero-downtime is not guaranteed even for point-level operations, and cluster maintenance requires downtime.
 ## Choosing number of shards
 Shards are the unit of data distribution. 
 More shards allows more nodes and better distribution, but adds overhead. Fewer shards reduces overhead but limits horizontal scaling.
 For cluster of 3-6 nodes the recommended shard count is 6-12. 
 This allows for 2-4 shards per node, which balances distribution and overhead. 
 ## Changing number of shards
 Use when: shard count isn't evenly divisible by node count, causing uneven distribution, or need to rebalance.
 Resharding is expensive and time-consuming, it should be used as a last resort if regular data distribution is not possible.
 Resharding is designed to be transparent for user operations, updates and searches should still work during resharding with some small performance impact.
 But resharding operation itself is time-consuming and requires to move large amounts of data between nodes.
 - Available in Qdrant Cloud [Resharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=resharding)
 - Resharding is not available for self-hosted deployments.
 Better alternatives: over-provision shards initially, or spin up new cluster with correct config and migrate data.
 ## What NOT to Do
 - Do not jump to horizontal before exhausting vertical (adds complexity for no gain)
 - Do not set `shard_number` that isn't a multiple of node count (uneven distribution)
 - Do not use `replication_factor: 1` in production if you need fault tolerance
 - Do not add nodes without rebalancing shards (use shard move API to redistribute)
 - Do not scale down RAM without load testing (cache eviction causes days-long latency incidents)
 - Do not hit the collection limit by using one collection per tenant (use payload partitioning)
@@ -0,0 +1,68 @@
 ---
 name: qdrant-sliding-time-window
 description: "Guides sliding time window scaling in Qdrant. Use when someone asks 'only recent data matters', 'how to expire old vectors', 'time-based data rotation', 'delete old data efficiently', 'social media feed search', 'news search', 'log search with retention', or 'how to keep only last N months of data'."
 ---
 # Scaling with a Sliding Time Window
 Use when only recent data needs fast search -- social media posts, news articles, support tickets, logs, job listings. Old data either becomes irrelevant or can tolerate slower access.
 Three strategies: **shard rotation** (recommended), **collection rotation** (when per-period config differs), and **filter-and-delete** (simplest, for continuous cleanup).
 ## Shard Rotation (Recommended)
 Use when: data has natural time boundaries (daily, weekly, monthly). Preferred because queries span all time periods in one request without application-level fan-out. [User-defined sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=user-defined-sharding)
 1. Create a collection with user-defined sharding enabled
 2. Create one shard key per time period (e.g., `2025-01`, `2025-02`, ..., `2025-06`)
 3. Ingest data into the current period's shard key
 4. When a new period starts, create a new shard key and redirect writes
 5. Delete the oldest shard key outside the retention window
 - Deleting a shard key reclaims all resources instantly (no fragmentation, no optimizer overhead)
 - Pre-create the next period's shard key before rotation to avoid write disruption
 - Use `shard_key_selector` at query time to search only specific periods for efficiency
 - Shard keys can be placed on specific nodes for hot/cold tiering
 ## Collection Rotation (Alias Swap)
 Use when: you need per-period collection configuration (e.g., different quantization or storage settings). [Collection aliases](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=collection-aliases)
 1. Create one collection per time period, point a write alias at the newest
 2. Query across all active collections in parallel, merge results client-side
 3. When a new period starts, create the new collection and swap the write alias [Switch collection](https://search.qdrant.tech/md/documentation/manage-data/collections/?s=switch-collection)
 4. Drop the oldest collection outside the window
 Trade-off vs shard rotation: allows per-collection config differences, but requires application-level fan-out and more operational overhead.
 ## Filter-and-Delete
 Use when: data arrives continuously without clear time boundaries, or you want the simplest setup.
 1. Store a `timestamp` payload on every point, create a payload index on it [Payload index](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=payload-index)
 2. Filter to the desired window at query time using `range` condition [Range filter](https://search.qdrant.tech/md/documentation/search/filtering/?s=range)
 3. Periodically delete expired points using delete-by-filter [Delete points](https://search.qdrant.tech/md/documentation/manage-data/points/?s=delete-points)
 - Run cleanup during off-peak hours in batches (10k-50k points) to avoid optimizer locks
 - Deletes are not free: tombstoned points degrade search until optimizer compacts segments
 - Does not reclaim disk instantly (compaction is asynchronous)
 ## Hot/Cold Tiers
 Use when: recent data needs fast in-RAM search, older data should remain searchable at lower performance.
 - **Shard rotation:** place current shard key on fast-storage nodes, move older shard keys to cheaper nodes via shard placement. All queries still go through a single collection.
 - **Collection rotation:** keep current collection in RAM (`always_ram: true`), move older collections to mmap/on-disk vectors. [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 ## What NOT to Do
 - Do not use filter-and-delete for high-volume time-series with millions of daily deletes (use rotation instead)
 - Do not forget to index the timestamp field (range filters without an index cause full scans)
 - Do not use collection rotation when shard rotation would suffice (unnecessary fan-out complexity)
 - Do not drop a shard key or collection before verifying its period is fully outside the retention window
 - Do not skip pre-creating the next period's shard key or collection (write failures during rotation are hard to recover)
@@ -0,0 +1,44 @@
 ---
 name: qdrant-tenant-scaling
 description: "Guides Qdrant multi-tenant scaling. Use when someone asks 'how to scale tenants', 'one collection per tenant?', 'tenant isolation', 'dedicated shards', or reports tenant performance issues. Also use when multi-tenant workloads outgrow shared infrastructure."
 ---
 # What to Do When Scaling Multi-Tenant Qdrant
 Do not create one collection per tenant. Does not scale past a few hundred and wastes resources. One company hit the 1000 collection limit after a year of collection-per-repo and had to migrate to payload partitioning. Use a shared collection with a tenant key.
 - Understand multitenancy patterns [Multitenancy](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/)
 Here is a short summary of the patterns:
 ## Number of Tenants is around 10k
 Use the default multitenancy strategy via payload filtering.
 Read about [Partition by payload](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=partition-by-payload) and [Calibrate performance](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=calibrate-performance) for best practices on indexing and query performance.
 ## Number of Tenants is around 100k and more
 At this scale, the cluster may consist of several peers.
 To localize tenant data and improve performance, use [custom sharding](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=user-defined-sharding) to assign tenants to specific shards based on tenant ID hash.
 This will localize tenant requests to specific nodes instead of broadcasting them to all nodes, improving performance and reducing load on each node.
 ## If tenants are unevenly sized
 If some tenants are much larger than others, use [tiered multitenancy](https://search.qdrant.tech/md/documentation/manage-data/multitenancy/?s=tiered-multitenancy) to promote large tenants to dedicated shards while keeping small tenants on shared shards. This optimizes resource allocation and performance for tenants of varying sizes.
 ## Need Strict Tenant Isolation
 Use when: legal/compliance requirements demand per-tenant encryption or strict isolation beyond what payload filtering provides.
 - Multiple collections may be necessary for per-tenant encryption keys
 - Limit collection count and use payload filtering within each collection
 - This is the exception, not the default. Only use when compliance requires it.
 ## What NOT to Do
 - Do not create one collection per tenant without compliance justification (does not scale past hundreds)
 - Do not skip `is_tenant=true` on the tenant index (kills sequential read performance)
 - Do not build global HNSW for multi-tenant collections (wasteful, use `payload_m` instead)
@@ -0,0 +1,69 @@
 ---
 name: qdrant-vertical-scaling
 description: "Guides Qdrant vertical scaling decisions. Use when someone asks 'how to scale up a node', 'need more RAM', 'upgrade node size', 'vertical scaling', 'resize cluster', 'scale up vs scale out', or when memory/CPU is insufficient on current nodes. Also use when someone wants to avoid the complexity of horizontal scaling."
 ---
 # What to Do When Qdrant Needs to Scale Vertically
 Vertical scaling means increasing CPU, RAM, or disk on existing nodes rather than adding more nodes. This is the recommended first step before considering horizontal scaling. Vertical scaling is simpler, avoids distributed system complexity, and is reversible.
 - Vertical scaling for Qdrant Cloud is done through the [Qdrant Cloud Console](https://cloud.qdrant.io/)
 - For self-hosted deployments, resize the underlying VM or container resources
 ## When to Scale Vertically
 Use when: current node resources (RAM, CPU, disk) are insufficient, but the workload doesn't yet require distribution.
 - RAM usage approaching 80% of available memory (OS page cache eviction starts, severe performance degradation)
 - CPU saturation during query serving or indexing
 - Disk space running low for on-disk vectors and payloads
 - A single node can handle up to ~100M vectors depending on dimensions and quantization
 - For non-production workloads, which are tolerant to single-point-of-failure and don't require high availability
 ## How to Scale Vertically in Qdrant Cloud
 Vertical scaling is managed through the Qdrant Cloud Console.
 - Log into [Qdrant Cloud Console](https://cloud.qdrant.io/) or use [CLI tool](https://github.com/qdrant/qcloud-cli)
 - Select the cluster to resize
 - Choose a larger node configuration (more RAM, CPU, or both)
 - The upgrade process involves a rolling restart with no downtime if replication is configured
 - Ensure `replication_factor: 2` or higher before resizing to maintain availability during the rolling restart
 **Important:** Scaling up is straightforward. Scaling down requires care -- if the working set no longer fits in RAM after downsizing, performance will degrade severely due to cache eviction. Always load test before scaling down.
 ## RAM Sizing Guidelines
 RAM is the most critical resource for Qdrant performance. Use these guidelines to right-size.
 - Exact estimation of RAM usage is difficult; use this simple approximate formula: `num_vectors * dimensions * 4 bytes * 1.5` for full-precision vectors in RAM
 - With scalar quantization: divide by 4 (INT8 reduces each float32 to 1 byte) [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 - With binary quantization: divide by 32 [Binary quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=binary-quantization)
 - Add overhead for HNSW index (~20-30% of vector data), payload indexes, and WAL
 - Reserve 20% headroom for optimizer operations and OS cache
 - Monitor actual usage via Grafana/Prometheus before and after resizing [Monitoring](../../../qdrant-monitoring/SKILL.md)
 ## When Vertical Scaling Is No Longer Enough
 Recognize these signals that it's time to go horizontal:
 - Data volume exceeds what a single node can hold even with quantization and mmap
 - IOPS are saturated (more nodes = more independent disk I/O)
 - Need fault tolerance (requires replication across nodes)
 - Need tenant isolation via dedicated shards
 - Single-node CPU is maxed and query latency is unacceptable
 - Next vertical scaling step is the largest available node size. You might need to be able to temporarily scale up to the larger node size to do batch operations or recovery. If you are already at the largest node size, you won't be able to do that.
 When you hit these limits, see [Horizontal Scaling](../horizontal-scaling/SKILL.md) for guidance on sharding and node planning.
 ## What NOT to Do
 - Do not scale down RAM without load testing first (cache eviction = severe latency degradation that can last days)
 - Do not ignore the 80% RAM threshold (performance cliff, not gradual degradation)
 - Do not skip replication before resizing in Cloud (rolling restart without replicas = downtime)
 - Do not jump to horizontal scaling before exhausting vertical options (adds permanent operational complexity)
 - Do not assume more CPU always helps (IOPS-bound workloads won't improve with more cores)
@@ -0,0 +1,56 @@
 ---
 name: qdrant-scaling-qps
 description: "Guides Qdrant query throughput (QPS) scaling. Use when someone asks 'how to increase QPS', 'need more throughput', 'queries per second too low', 'batch search', 'read replicas', or 'how to handle more concurrent queries'."
 ---
 # Scaling for Query Throughput (QPS)
 Throughput scaling means handling more parallel queries per second. 
 This is different from latency - throughput and latency are opposite tuning directions and cannot be optimized simultaneously on the same node.
 High throughput favors fewer, larger segments so each query touches less overhead.
 ## Performance Tuning for Higher RPS
 - Use fewer, larger segments (`default_segment_number: 2`) [Maximizing throughput](https://search.qdrant.tech/md/documentation/operations/optimize/?s=maximizing-throughput)
 - Enable quantization with `always_ram=true` to reduce disk IO [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 - Use batch search API to amortize overhead [Batch search](https://search.qdrant.tech/md/documentation/search/search/?s=batch-search-api)
 ## Minimize impact of Update Workloads
 - Configure update throughput control (v1.17+) to prevent unoptimized searches degrading reads [Low latency search](https://search.qdrant.tech/md/documentation/search/low-latency-search/)
 - Set `optimizer_cpu_budget` to limit indexing CPUs (e.g. `2` on an 8-CPU node reserves 6 for queries)
 - Configure delayed read fan-out (v1.17+) for tail latency [Delayed fan-outs](https://search.qdrant.tech/md/documentation/search/low-latency-search/?s=use-delayed-fan-outs)
 ## Horizontal Scaling for Throughput
 If a single node is saturated on CPU after applying the tuning above, scale horizontally with read replicas.
 - Shard replicas serve queries from replicated shards, distributing read load across nodes
 - Each replica adds independent query capacity without re-sharding
 - Use `replication_factor: 2+` and route reads to replicas [Distributed deployment](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=replication)
 See also [Horizontal Scaling](../scaling-data-volume/horizontal-scaling/SKILL.md) for general horizontal scaling guidance.
 ## Disk I/O Bottlenecks
 If it is not possible to keep all vectors in RAM, disk I/O can become the bottleneck for throughput. 
 In this case:
 - Upgrade to provisioned IOPS or local NVMe first. See impact of disk performance to vector search in [Disk performance article](https://qdrant.tech/articles/memory-consumption/)
 - Use `io_uring` on Linux (kernel 5.11+) [io_uring article](https://qdrant.tech/articles/io_uring/)
 - In case of quantized vectors, prefer global rescoring over per-segment rescoring to reduce disk reads. Example in the [tutorial](https://search.qdrant.tech/md/documentation/tutorials-operations/large-scale-search/?s=search-query)
 - Configure higher number of search threads to parallelize disk reads. Default is `cpu_count - 1`, which is optimal for RAM-based search but may be too low for disk-based search. See [configuration reference](https://search.qdrant.tech/md/documentation/operations/configuration/?s=configuration-options)
 - If still saturated, scale out horizontally (each node adds independent IOPS)
 ## What NOT to Do
 - Do not expect to optimize throughput and latency simultaneously on the same node
 - Do not use many small segments for throughput workloads (increases per-query overhead)
 - Do not scale horizontally when IOPS-bound without also upgrading disk tier
 - Do not run at >90% RAM (OS cache eviction = severe performance degradation)
@@ -0,0 +1,23 @@
 ---
 name: qdrant-scaling-query-volume
 description: "Guides Qdrant query volume scaling. Use when someone asks 'query returns too many results', 'scroll performance', 'large limit values', 'paginating search results', 'fetching many vectors', or 'high cardinality results'."
 ---
 # Scaling for Query Volume
 Problem: When a query has a large limit (e.g. 1000) and there are multiple shards (e.g. 10), naively each shard must return the full 1000 results — totaling 10,000 scored points transferred and merged. This is wasteful since data is randomly distributed across auto-shards.
 ## Core idea
 Instead of asking every shard for the full limit, ask each shard for a smaller limit computed via Poisson distribution statistics, then merge. This is safe because auto-sharding guarantees random, independent data distribution.
 ## When it activates
 - More than 1 shard
 - Auto-sharding is in use (all queried shards share the same shard key)
 - The request's limit + offset >= SHARD_QUERY_SUBSAMPLING_LIMIT (128)
 - The query is not exact
 ## Key tradeoff
 The strategy trades a small probability of slightly incomplete results for a large reduction in inter-shard data transfer, especially for high-limit queries across many shards. The 1.2x safety factor and the 99.9% Poisson threshold keep the error rate very low — comparable to inaccuracies already introduced by approximate vector indices like HNSW.
@@ -0,0 +1,24 @@
 ---
 name: qdrant-search-quality
 description: "Diagnoses and improves Qdrant search relevance. Use when someone reports 'search results are bad', 'wrong results', 'low precision', 'low recall', 'irrelevant matches', 'missing expected results', or asks 'how to improve search quality?', 'which embedding model?', 'should I use hybrid search?', 'should I use reranking?'. Also use when search quality degrades after quantization, model change, or data growth."
 allowed-tools:
  - Read
  - Grep
  - Glob
 ---
 # Qdrant Search Quality
 First determine whether the problem is the embedding model, Qdrant configuration, or the query strategy. Most quality issues come from the model or data, not from Qdrant itself. If search quality is low, inspect how chunks are being passed to Qdrant before tuning any parameters. Splitting mid-sentence can drop quality 30-40%.
 - Start by testing with exact search to isolate the problem [Search API](https://search.qdrant.tech/md/documentation/search/search/?s=search-api)
 ## Diagnosis and Tuning
 Isolate the source of quality issues, tune HNSW parameters, and choose the right embedding model. [Diagnosis and Tuning](diagnosis/SKILL.md)
 ## Search Strategies
 Hybrid search, reranking, relevance feedback, and exploration APIs for improving result quality. [Search Strategies](search-strategies/SKILL.md)
@@ -0,0 +1,53 @@
 ---
 name: qdrant-search-quality-diagnosis
 description: "Diagnoses Qdrant search quality issues. Use when someone reports 'results are bad', 'wrong results', 'not relevant results', 'missing matches', 'recall is low', 'approximate search worse than exact', 'which embedding model', or 'quality dropped after quantization'. Also use when search quality degrades without obvious changes."
 ---
 # How to Diagnose Bad Search Quality
 Before tuning, establish baselines. Use exact KNN as ground truth, compare against approximate HNSW. Target >95% recall@K for production.
 ## Don't Know What's Wrong Yet
 Use when: results are irrelevant or missing expected matches and you need to isolate the cause.
 - Test with `exact=true` to bypass HNSW approximation [Search API](https://search.qdrant.tech/md/documentation/tutorials-search-engineering/retrieval-quality/?s=standard-mode-vs-exact-search)
 - Exact search bad = model or search pipeline problem. Exact good, approximate bad = tune HNSW.
 - Check if quantization degrades quality (compare with and without)
 - Check if filters are too restrictive (then you might need to use ACORN)
 - If duplicate results from chunked documents, use Grouping API to deduplicate [Grouping](https://search.qdrant.tech/md/documentation/search/search/?s=grouping-api)
 Payload filtering and sparse vector search are different things. Metadata (dates, categories, tags) goes in payload for filtering. Text content goes in sparse vectors for search.
 ## Approximate Search Worse Than Exact
 Use when: exact search returns good results but HNSW approximation misses them.
 - Increase `hnsw_ef` at query time [Search params](https://search.qdrant.tech/md/documentation/operations/optimize/?s=fine-tuning-search-parameters)
 - Increase `ef_construct` (200+ for high quality) [HNSW config](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index)
 - Increase `m` (16 default, 32 for high recall) [HNSW config](https://search.qdrant.tech/md/documentation/manage-data/indexing/?s=vector-index)
 - Enable oversampling + rescore with quantization [Search with quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/?s=searching-with-quantization)
 - ACORN for filtered queries (v1.16+) [ACORN](https://search.qdrant.tech/md/documentation/search/search/?s=acorn-search-algorithm)
 Binary quantization requires rescore. Without it, quality loss is severe. Use oversampling (3-5x minimum for binary) to recover recall. Always test quantization impact on your data before production. [Quantization](https://search.qdrant.tech/md/documentation/manage-data/quantization/)
 ## Wrong Embedding Model
 Use when: exact search also returns bad results.
 Test top 3 MTEB models on 100-1000 sample queries, measure recall@10. Domain-specific models often outperform general models. [Hosted inference](https://search.qdrant.tech/md/documentation/inference/)
 ## Unoptimized Search Pipeline
 Use when: exact search also returns bad results and model choice is confirmed by user.
 Optimize search according to advanced search-strategies skill.
 ## What NOT to Do
 - Tune Qdrant before verifying the model is right for the task (most quality issues are model issues)
 - Use binary quantization without rescore (severe quality loss)
 - Set `hnsw_ef` lower than results requested (guaranteed bad recall)
 - Skip payload indexes on filtered fields then blame quality (HNSW can't traverse filtered-out nodes, and filterable HNSW is built only if payload indexes were set up prior)
 - Deploy without baseline recall or other search relevance metrics (no way to measure regressions)
 - Confuse payload filtering with sparse vector search (different things, different config)
@@ -0,0 +1,70 @@
 ---
 name: qdrant-search-strategies
 description: "Guides Qdrant search strategy selection. Use when someone asks 'should I use hybrid search?', 'BM25 or sparse vectors?', 'how to rerank?', 'results are not relevant', 'I don't get needed results from my dataset but they're there', 'retrieval quality is not good enough', 'results too similar', 'need diversity', 'MMR', 'relevance feedback', 'recommendation API', 'discovery API', 'ColBERT reranking', or 'missing keyword matches'"
 ---
 # How to Improve Search Results with Advanced Strategies
 These strategies complement basic vector search. Use them after confirming the embedding model is fitting the task and HNSW config is correct. If exact search returns bad results, verify the selection of the embedding model (retriever) first.
 If the user wants to use a weaker embedding model because it is small, fast, and cheap, use reranking or relevance feedback to improve search quality.
 ## Missing Obvious Keyword Matches
 Use when: pure vector search misses results that contain obvious keyword matches. Domain terminology not in embedding training data, exact keyword matching critical (brand names, SKUs), acronyms common. Skip when: pure semantic queries, all data in training set, latency budget very tight.
 - Dense + sparse with `prefetch` and fusion [Hybrid search](https://search.qdrant.tech/md/documentation/search/hybrid-queries/?s=hybrid-search)
 - Prefer learned sparse ([miniCOIL](https://search.qdrant.tech/md/documentation/fastembed/fastembed-minicoil/), SPLADE, GTE) over raw BM25 if applicable (when user needs smart keywords matching and learned sparse models know the vocabulary of the domain)
 - For non-English languages, [configure sparse BM25 parameters accordingly](https://search.qdrant.tech/md/documentation/search/text-search/?s=language-specific-settings)
 - RRF: good default, supports weighted (v1.17+) [RRF](https://search.qdrant.tech/md/documentation/search/hybrid-queries/?s=reciprocal-rank-fusion-rrf)
 - DBSF with asymmetric limits (sparse_limit=250, dense_limit=100) can outperform RRF for technical docs [DBSF](https://search.qdrant.tech/md/documentation/search/hybrid-queries/?s=distribution-based-score-fusion-dbsf)
 - Fusion can also be done through reranking
 ## Right Documents Found But Wrong Order
 Use when: good recall but poor precision (right docs in top-100, not top-10).
 - Cross-encoder rerankers via FastEmbed [Rerankers](https://search.qdrant.tech/md/documentation/fastembed/fastembed-rerankers/)
 - See how to use [Multistage queries](https://search.qdrant.tech/md/documentation/search/hybrid-queries/?s=multi-stage-queries) in Qdrant
 - ColBERT and ColPali/ColQwen reranking is especially precise due to late interaction mechanisms, but it is heavy. It is important to configure and store multivectors without building HNSW for them to save resources. See [Multivector representation](https://search.qdrant.tech/md/documentation/tutorials-search-engineering/using-multivector-representations/)
 ## Right Documents Not Found But They Are There
 Use when: basic retrieval is in place but the retriever misses relevant items you know exist in the dataset. Works on any embeddable data (text, images, etc.).
 Relevance Feedback (RF) Query uses a feedback model's scores on retrieved results to steer the retriever through the full vector space on subsequent iterations, like reranking the entire collection through the retriever. Complementary to reranking: a reranker sees a limited subset, RF leverages feedback signals collection-wide. Even 3–5 feedback scores are enough. Can run multiple iterations.
 A feedback model is anything producing a relevance score per document: a bi-encoder, cross-encoder, late-interaction model, LLM-as-judge. Fuzzy relevance scores work, not just binary (good/bad, relevant/irrelevant), due to the fact that feedback is expressed as a graded relevance score (higher = more relevant).
 Skip when: if the retriever already has strong recall, or if retriever and feedback model strongly agree on relevance.
 - RF Query is currently based on a [3-parameter naive formula](https://search.qdrant.tech/md/documentation/search/search-relevance/?s=naive-strategy) with no universal defaults, so it must be tuned per dataset, retriever, and feedback model
 - Use [qdrant-relevance-feedback](https://pypi.org/project/qdrant-relevance-feedback/) to tune parameters, evaluate impact with Evaluator, and check retriever-feedback agreement. See README for setup instructions. No GPUs are needed, and the framework also provides predefined retriever and feedback model options.
 - Check the configuration of the [Relevance Feedback Query API](https://search.qdrant.tech/md/documentation/search/search-relevance/?s=relevance-feedback)
 - Use this as a helper end-to-end text retrieval example with parameter tuning and evals to understand how to use the API and run the `qdrant-relevance-feedback` framework: [RF tutorial](https://search.qdrant.tech/md/documentation/tutorials-search-engineering/using-relevance-feedback/)
 ## Results Too Similar
 Use when: top results are redundant, near-duplicates, or lack diversity. Common in dense content domains (academic papers, product catalogs).
 - Use MMR (v1.15+) as a query parameter with `diversity` to balance relevance and diversity [MMR](https://search.qdrant.tech/md/documentation/search/search-relevance/?s=maximal-marginal-relevance-mmr)
 - Start with `diversity=0.5`, lower for more precision, higher for more exploration
 - MMR is slower than standard search. Only use when redundancy is an actual problem.
 ## Know What Good Results Could Look Like But Can't Get Them
 Use when: you can provide positive and negative example points to steer search closer to positive and further from negative.
 - Recommendation API: positive/negative examples to recommend fitting vectors [Recommendation API](https://search.qdrant.tech/md/documentation/search/explore/?s=recommendation-api)
  - Best score strategy: better for diverse examples, supports negative-only [Best score](https://search.qdrant.tech/md/documentation/search/explore/?s=best-score-strategy)
 - Discovery API: context pairs (positive/negative) to constrain search regions without a request target [Discovery](https://search.qdrant.tech/md/documentation/search/explore/?s=discovery-api)
 ## Have Business Logic Behind Relevance
 Use when: results should be additionally ranked according to some business logic based on data, like recency or distance.
 Check how to set up in [Score Boosting docs](https://search.qdrant.tech/md/documentation/search/search-relevance/?s=score-boosting)
 ## What NOT to Do
 - Use hybrid search before verifying pure vector quality (adds complexity, may mask model issues)
 - Use BM25 on non-English text without correctly configuring language-specific stop-word removal (severely degraded results)
 - Skip evaluation when adding relevance feedback (it's good to check on real queries that it actually could help)
@@ -0,0 +1,21 @@
 ---
 name: qdrant-version-upgrade
 description: "Guidance on how to upgrade your Qdrant version without interrupting the availability of your application and ensuring data integrity."
 ---
 # Qdrant Version Upgrade
 Qdrant has the following guarantees about version compatibility:
 - Major and minor versions of Qdrant and SDK are expected to match. For example, Qdrant 1.17.x is compatible with SDK 1.17.x.
 - Qdrant is tested for backward compatibility between minor versions. For example, Qdrant 1.17.x should be compatible with SDK 1.16.x. Qdrant server 1.16.x is also expected to be compatible with SDK 1.17.x, but only for the subset of features that were available in 1.16.x.
 - For migration to the next minor version, it is recommended to first upgrade the SDK to the next minor version and then upgrade the Qdrant server.
 - Storage compatibility is only guaranteed for one minor version. For example, data stored with Qdrant 1.16.x is expected to be compatible with Qdrant 1.17.x. If you need to migrate more than one minor version, it is required do the upgrade step by step, one minor version at a time. For example, to migrate from 1.15.x to 1.17.x, you need to first upgrade to 1.16.x and then to 1.17.x. Note: Qdrant Cloud automates this process, so you can directly upgrade from 1.15.x to 1.17.x without intermediate steps.
 - A Qdrant cluster with a replication factor of 2 or higher can be upgraded without downtime by performing a rolling upgrade. This means that you can upgrade one node at a time while the other nodes continue to serve requests. This allows you to maintain availability of your application during the upgrade process. More about replication factor: [Replication factor](https://search.qdrant.tech/md/documentation/operations/distributed_deployment/?s=replication-factor)
 For managing Qdrant version upgrades in Qdrant Cloud, you can use the [qcloud](https://github.com/qdrant/qcloud-cli) CLI tool.