The Billion-Vector Problem: HNSW vs. DiskANN in Azure AI Search


The explosion of AI-powered applications has created a new challenge: how do you efficiently search through billions of vector embeddings without exploding infrastructure costs? In this episode, we explore the “billion-vector problem” and compare two leading vector search algorithms available in Azure AI Search: HNSW (Hierarchical Navigable Small World) and DiskANN. While HNSW has become the industry standard thanks to its fast in-memory performance and high recall, it requires significant RAM as datasets grow. DiskANN, originally developed by Microsoft Research, takes a different approach by leveraging SSD storage to dramatically reduce memory requirements while maintaining excellent search accuracy at massive scale.
We break down how each algorithm works, where they shine, and the trade-offs architects need to consider when designing Retrieval-Augmented Generation (RAG), semantic search, and AI agent solutions. The discussion covers performance, scalability, operational costs, update behavior, and why the right choice often depends on the size and growth trajectory of your vector data. Whether you're building a proof of concept or planning for billion-scale workloads, this episode provides practical guidance for selecting the most effective vector search strategy in Azure AI Search and beyond.
You face the billion-vector problem when you need to search through massive datasets in Azure AI Search. Microsoft’s diskann stands out as the best algorithm for this challenge. When you compare hnsw and diskann, you see clear differences. Hnsw gives you fast in-memory access for low-latency results but needs a lot of RAM. Diskann uses SSDs to handle huge datasets with low latency and reduces your memory costs. This innovation in graph-based index design and SSD-based architecture transforms how you manage enterprise AI workloads.
| Feature | HNSW | DiskANN |
|---|---|---|
| Speed | Optimized for low latency with fast in-memory access. | Utilizes SSDs efficiently to manage large datasets with low latency. |
| Scaling | Scales well for sizable datasets, particularly in RAM. | Scales effectively for massive datasets by leveraging SSDs for storage. |
| Compute | High memory requirements, making it more RAM-dependent. | Optimized for SSDs, which reduces RAM dependency and associated costs. |
Key Takeaways
- The billion-vector problem arises when searching through massive datasets in Azure AI Search.
- HNSW offers fast in-memory access but requires significant RAM, making it suitable for smaller datasets.
- DiskANN uses SSDs to manage large datasets efficiently, reducing memory costs and enabling scalability.
- Choosing the right algorithm impacts speed, accuracy, and resource use, so match it to your specific needs.
- Azure AI Search provides tools for data preprocessing, enhancing vector quality before indexing.
- Both HNSW and DiskANN utilize graph-based indexing, but their designs cater to different dataset sizes and requirements.
- DiskANN's hybrid architecture allows for cost-effective scaling, making it ideal for enterprise workloads.
- Regular monitoring and optimization of your search system are crucial for maintaining high performance.
The Billion-Vector Problem in Azure AI Search
Challenges of Billion-Scale Vector Search
You face the billion-vector problem when you need to index and search through massive datasets in enterprise AI. Handling billions of vectors brings unique technical challenges. You must build efficient indexing systems that can process large-scale data without slowing down. Real-time updates are important because you want your search results to stay current and relevant. Managing high-dimensional embeddings is also essential for accurate search results.
When you work with billion-scale datasets, you encounter several limitations:
- High dimensionality increases computational costs and can slow down search performance.
- The semantic gap between vector representations and real-world meaning can cause inaccuracies.
- Garbage collection becomes difficult, making it hard to remove outdated information from indexes.
- The quality of vectors depends on the embedding model you choose, which affects search accuracy.
- Scalability stretches memory requirements and can increase search times.
- The cold start problem makes it hard to search for new items without good vector representations.
- Interpretability is low, so understanding why a search result appears can be challenging.
These challenges make the billion-vector problem one of the toughest in modern AI.
Why Algorithm Choice Matters
Your choice of algorithm has a direct impact on how you solve the billion-vector problem. If you select an algorithm like HNSW, you get fast access and efficient searching for large datasets. However, you need a lot of RAM, which can increase costs. Other algorithms, such as exhaustive KNN, offer high accuracy but require much more computation, making them better for smaller datasets. The right algorithm helps you balance speed, accuracy, and resource use. You can optimize performance for your specific application by matching the algorithm to your needs. This decision shapes how well you handle massive datasets and how much you spend on hardware.
Azure AI Search Overview
Azure AI Search gives you a powerful platform for tackling the billion-vector problem. You can preprocess your data using advanced cleaning and enrichment tools, which improves the quality of your vectors before indexing. The platform organizes your information into a vector database designed for retrieval tasks and modern search needs. You benefit from robust search and retrieval features, including natural language processing for semantic understanding. Security is strong, with encryption and access controls to protect your data.
Azure AI Search integrates smoothly with other Microsoft Azure services and third-party platforms. You can use hybrid search, which combines keyword and vector search for the best results. Ranking and relevance tuning let you adjust how search results appear, making them more useful for your users. The platform also offers smart search experiences, such as autocomplete and suggested results. Built on Azure’s global infrastructure, Azure AI Search delivers the scale and reliability you need for massive datasets and large-scale data projects.
Graph-Based Index: HNSW and DiskANN

What Is HNSW
You may have heard of hnsw, which stands for hierarchical navigable small world. This graph-based index algorithm helps you find similar vectors quickly in large datasets. Hnsw builds a multi-layer graph structure that connects data points in a way that makes search fast and accurate. Each layer of the graph has a different density, with the top layers having fewer nodes and the lower layers having more. This design lets you start your search at the top and move down, narrowing your results as you go. Hnsw is optimized for in-memory vector search, so it keeps the entire graph in RAM. This approach gives you low-latency results, but it also means you need a lot of memory when your dataset grows.
What Is DiskANN
Diskann is Microsoft’s answer to the challenges of billion-scale vector search. This graph-based index uses a hybrid approach that combines memory and SSD storage. Diskann lets you store most of your index on SSDs, which are much cheaper than RAM. You keep only the most important navigation data in memory. This design helps you manage massive datasets without high hardware costs. Diskann uses the Vamana graph algorithm to guide your search through the data. It also uses a special pruning method to keep the graph diverse and efficient. Microsoft’s diskann powers Azure AI Search and other services, making it a key innovation for enterprise AI.
| Innovation | Description |
|---|---|
| High-speed ANN search | Diskann’s architecture allows for efficient ANN search on SSDs, which helps in reducing hardware costs. |
| Hybrid memory system | Diskann uses a combination of DRAM and SSD to manage billions of points on a single machine. |
| Cost-effective storage | Diskann stores the ANN index on disk, providing a budget-friendly solution for high-dimensional data handling. |
How HNSW Works
Graph Construction
Hnsw builds its graph-based index by creating a hierarchy of search graphs. You start with an empty structure. The first vector you add becomes the only node in the top layer. For each new vector, hnsw decides how many layers it should join using a random process. The new vector connects to its closest neighbors in each layer up to its assigned maximum. The top layer has the fewest nodes, which makes it a good starting point for search. Lower layers have more nodes, allowing for detailed searches. Hnsw can update its graph without a full rebuild, so you can add new data as your needs grow.
Search Process
When you search with hnsw, you begin at the top layer of the graph. You use long-range links to move quickly toward the area where your target vector might be. As you move down each layer, the connections become denser. This helps you refine your search and get closer to the best match. At the lowest layer, hnsw uses a beam search to find the nearest neighbors. This process gives you fast and accurate results, especially when your entire graph fits in memory.
- Hnsw’s design principles:
- Optimized for in-memory operations, delivering low latency and high recall.
- Uses a multi-layer graph structure based on small-world networks.
- Employs a pruning strategy to remove redundant edges and encourage diversity.
- Search process involves descending layers, starting with sparse long-range links and ending with beam search at the base layer.
You can see that both hnsw and diskann use a graph-based index, but their designs fit different needs. Hnsw works best when you can keep everything in memory. Diskann lets you scale to billions of vectors by using SSDs, making it ideal for enterprise workloads in Azure AI Search.
How DiskANN Works
Hybrid Memory and SSD Architecture
You encounter a unique approach when you use DiskANN for billion-vector search. DiskANN combines memory and SSD storage to create a hybrid architecture. This design lets you store most of your vector index on SSDs, which cost less than RAM. You keep only the essential navigation data in memory. This method allows you to scale your search to billions of vectors without needing expensive hardware.
DiskANN uses two main techniques to optimize storage and speed:
- Vamana graph: You use this structure for efficient navigation through the index.
- Product quantization (PQ): You compress vectors in memory, which reduces RAM usage.
You benefit from a two-phase query execution process. In the first phase, DiskANN uses PQ-compressed vectors in RAM to quickly identify a candidate set. You do not need to read from disk during this step. In the second phase, DiskANN fetches full-precision vectors from SSD for the candidate set. You then compute exact distances to find the best matches. This process gives you high recall and low latency while keeping memory requirements low.
Tip: DiskANN’s hybrid memory and SSD architecture enables web-scale search on commodity hardware. You can store vector indexes on SSDs instead of RAM, which makes large-scale applications more affordable.
Search Process
You start a search in DiskANN by navigating the Vamana graph using compressed vectors in memory. This step helps you find promising candidates quickly. You avoid disk reads at this stage, which speeds up the process. Once you have a candidate set, DiskANN moves to the next phase.
You fetch full-precision vectors from SSD for each candidate. You calculate exact distances between your query and these candidates. This two-phase process ensures that you get accurate results without slowing down your search. You achieve high recall because DiskANN checks the best candidates in detail. You also keep latency low because most of the work happens in memory.
DiskANN’s search process works well for massive datasets. You can handle billions of vectors without needing huge amounts of RAM. You get fast, reliable results even as your data grows. This makes DiskANN a strong choice for enterprise AI workloads in Azure AI Search.
| Step | What You Do | Benefit |
|---|---|---|
| Phase 1: Navigation | Use PQ-compressed vectors in RAM to find candidates | Fast search, no disk reads |
| Phase 2: Refinement | Fetch full-precision vectors from SSD and compute distances | High recall, low latency |
You see that DiskANN’s hybrid architecture and search process help you solve the billion-vector problem efficiently. You can scale your search, reduce costs, and maintain performance as your data expands.
Strengths and Weaknesses
HNSW Pros and Cons
You often choose hnsw for vector retrieval because it delivers state-of-the-art search speed and high recall. When you tune hnsw properly, you get fast results and accurate matches. You can handle high-dimensional vector data well, which is important for modern AI workloads. You also benefit from incremental additions, so you can update your index without rebuilding everything.
Here is a quick overview of hnsw’s strengths and weaknesses:
| Strengths | Weaknesses |
|---|---|
| State-of-the-art search speed and high recall when properly tuned. | High memory consumption due to storing graph links for each vector. |
| Handles high-dimensional data well. | Lengthy index build time with high-quality construction parameters. |
| Supports incremental additions efficiently. | Complex deletion of elements may degrade performance over time. |
You scale hnsw efficiently because it uses O(N log N) for index building. Parallel construction lets you use multi-core processors, which speeds up the process. Memory usage scales linearly with dataset size, so you can predict storage needs. However, you must keep the entire graph in RAM, which limits scalability for billion-vector datasets.
DiskANN Pros and Cons
You turn to diskann when you need to scale vector retrieval beyond what RAM can handle. Diskann optimizes for SSD usage, which reduces hardware costs and lets you manage massive datasets. You get high accuracy and low latency by combining in-memory searches with batched SSD reads. Diskann supports real-time updates and hybrid search, so you can filter and retrieve vectors efficiently.
Consider these points about diskann:
- Cost-effective scalability by using SSDs instead of RAM.
- High accuracy and low latency at scale through in-memory searches and SSD reads.
- Real-time updates and hybrid vector-plus-filtered retrieval.
- Slow build times because the Vamana graph construction is computationally heavy.
- Performance depends on SSD throughput and latency.
- Updates are more expensive compared to in-memory structures like hnsw.
- Implementation and tuning require careful optimization.
You benefit from diskann’s ability to handle quantized vectors, which compresses data and improves memory-efficient search. You can use diskann for billion-vector workloads without worrying about RAM limits.
Memory and Storage Efficiency
You must consider memory and storage efficiency when choosing between hnsw and diskann for vector retrieval. Hnsw works well for datasets that fit within your server’s RAM. You get fast access and high performance, but you cannot scale beyond your memory limits. Diskann lets you extend your dataset size to disk capacity, trading some speed for storage. You store most of your index on SSDs, which makes diskann ideal for massive vector datasets.
Here is a comparison:
| Feature | HNSW | DiskANN |
|---|---|---|
| Memory Efficiency | Highly efficient for datasets that fit within server’s cache (RAM). | Designed for datasets too large to fit into RAM. |
| Storage Efficiency | Limited by the amount of available RAM. | Extends dataset size to disk capacity, trading some speed for storage. |
| Performance Optimization | Leverages fast memory access for speed. | Minimizes performance penalties of slower disk storage. |
| Dataset Suitability | Suitable for datasets that fit in RAM. | Suitable for massive datasets beyond RAM capacity. |
You use quantized vectors in both hnsw and diskann to reduce storage needs and improve retrieval speed. Diskann’s hybrid architecture lets you scale your vector search while keeping costs low. You achieve memory-efficient search for large-scale retrieval tasks, especially when you work with quantized vectors and SSD storage.
Vector Search Algorithm Comparison

Performance at Scale
You need to measure how a vector search algorithm performs as your dataset grows. HNSW and DiskANN both aim to deliver high recall and speed, but their strategies differ. HNSW relies on in-memory graphs, which means you get fast nearest neighbor search when your vectors fit in RAM. DiskANN uses a hybrid memory and SSD approach, so you can scale your search performance without worrying about memory limits.
When you run nearest neighbor search on millions of vectors, HNSW gives you high-speed retrieval and low latency. You can tune parameters for recall and speed. DiskANN maintains high recall and speed even as your dataset expands. You store most of your index on SSDs, which lets you handle billions of vectors. You use two-stage search to balance speed and accuracy. The first stage finds candidates in memory, and the second stage refines results using SSDs.
You achieve high recall and speed with both algorithms, but DiskANN lets you scale your search performance beyond the limits of RAM.
Scalability for Billions of Vectors
You face new challenges when your dataset grows from millions to billions of vectors. HNSW works well for up to 50 million vectors. You use HNSW as your default vector search algorithm and tune parameters for recall. When you reach 50 to 500 million vectors, you implement sharding and scalar quantization. You may use two-phase retrieval for efficiency. For datasets over 500 million vectors, you need hierarchical retrieval and multi-tier architectures. You use aggressive quantization for cold data.
Here is a comparison of HNSW scaling strategies:
| Vector Count Range | HNSW Strategy |
|---|---|
| 1-50M | Use HNSW as the default algorithm, ensure vectors fit in RAM, and tune parameters for recall. |
| 50-500M | Implement sharding, enable scalar quantization, and consider two-phase retrieval for efficiency. |
| 500M+ | Use hierarchical retrieval, multi-tier architectures, and aggressive quantization for cold data. |
DiskANN changes the game for billion-scale vector search. You store most of your index on SSDs, so you do not need to worry about RAM limits. You use product quantization to compress vectors and keep navigation data in memory. You scale your nearest neighbor search to billions of vectors without sacrificing search performance. DiskANN supports clustering-based approximate search and hybrid retrieval, which helps you manage massive datasets.
DiskANN lets you scale your vector search algorithm for billions of vectors, making it ideal for enterprise workloads.
Query Latency and Throughput
You care about query latency and throughput when you run vector search at scale. HNSW gives you low latency for nearest neighbor search because it keeps the entire graph in memory. You get fast query responses and high throughput for datasets that fit in RAM. When your dataset grows, latency increases as you shard or use multi-tier architectures.
DiskANN uses a two-stage search process. You first search compressed vectors in memory, which gives you fast candidate selection. You then refine your query by fetching full-precision vectors from SSDs. This approach keeps latency low and maintains high throughput, even for billion-scale datasets. You optimize your retrieval by batching SSD reads and using efficient navigation structures.
You achieve high recall and speed with DiskANN, even as your query volume increases. You do not need to worry about memory limits, so you can scale your search performance for enterprise applications.
You get reliable query latency and throughput with DiskANN, making it a strong choice for large-scale nearest neighbor search.
Cost and Resource Impact
You must consider cost and resource impact when you choose a vector search algorithm for Azure AI Search. The way each algorithm uses memory and storage affects your budget and your ability to scale.
DiskANN stands out for billion-vector workloads. You can store most of your index on SSDs, which cost less than RAM. This design lets you scale from thousands to billions of vectors without a huge increase in memory costs. DiskANN works well for production AI workloads because it uses SSDs efficiently and keeps only the most important data in memory.
HNSW works best for medium-sized datasets. You need to keep the entire index in RAM. This requirement limits how much you can scale. If your dataset grows too large, you may face high operational costs. HNSW is also limited to 2,000 dimensions, which can restrict your use cases.
Here is a quick comparison:
| Index Type | Suitable for Billion-Vector Workloads | Memory Requirements | Operational Costs |
|---|---|---|---|
| DiskANN | Yes, designed for SSD performance | Scales from thousands to billions of vectors | Recommended for production AI workloads |
| HNSW | No, suitable for medium datasets | Requires index to fit in RAM | Limited to 2,000 dimensions |
Tip: If you want to manage costs and scale your search to billions of vectors, DiskANN gives you a clear advantage. You can use affordable SSDs and keep your memory needs low.
Real-World Use Cases
You can see the value of these algorithms in real-world applications. Many organizations use DiskANN and HNSW for different types of vector search problems.
- You can use a cost-efficient hybrid method for approximate nearest neighbor search. This approach combines SSD storage with in-memory graph structures.
- You can deploy DiskANN for static datasets, such as research corpora, where the data does not change often.
- You can choose DiskANN for cost-sensitive deployments. It works well for billion-scale applications where you need to control expenses.
- You can use DiskANN for queries that can tolerate latencies of less than 10 milliseconds. This speed is fast enough for many enterprise search tasks.
- You can combine HNSW with disk-backed inverted files (HNSW-IF) for hybrid search. This method achieves high recall and keeps latency low.
Here is a summary of common use cases:
| Use Case | Description |
|---|---|
| Static datasets | Ideal for research corpora where data does not change frequently. |
| Cost-sensitive deployments | Suitable for billion-scale applications where cost is a concern. |
| Latency tolerant queries | Works well for queries that can tolerate latencies of less than 10ms. |
Note: Many organizations achieve 90% recall at 10ms latency using hybrid methods. This performance meets the needs of most enterprise AI search applications.
You can match your algorithm choice to your workload. If you need to scale, manage costs, and keep latency low, DiskANN gives you the flexibility and efficiency you need.
When to Use HNSW or DiskANN
Choosing the right algorithm for your Azure AI Search workload can make a big difference in speed, cost, and accuracy. You need to look at your data, your hardware, and how often your data changes. Let’s break down the main decision criteria so you can pick the best tool for your vector search needs.
Decision Criteria
Dataset Size
You should always start by looking at the size of your dataset. If you work with millions of vectors, hnsw gives you fast and accurate search results. It works best when your data fits in memory. When your dataset grows to billions of vectors, diskann becomes the better choice. It uses SSDs to store most of the index, so you do not need to worry about running out of RAM.
| Criteria | HNSW | DiskANN |
|---|---|---|
| Dataset Size | Suited for millions of vectors | Excels for billions of vectors |
You can see that hnsw fits smaller workloads, while diskann handles massive datasets without slowing down.
Hardware and Cost
Your hardware and budget also play a big role. Hnsw needs a lot of RAM because it keeps the whole graph in memory. This can get expensive as your vector database grows. Diskann helps you save money by storing most of the index on SSDs, which cost less than RAM. You only need enough memory for the navigation part of the index.
| Criteria | HNSW | DiskANN |
|---|---|---|
| Infrastructure | RAM requirements | Reduces memory costs |
If you want to keep costs low and still search through billions of vectors, diskann gives you a clear advantage.
Latency Needs
You should think about how fast you need your search results. Hnsw gives you very low latency because it searches in memory. You can get results in just a few milliseconds if your data fits in RAM. Diskann balances speed and scale. It uses a two-step search process, so you still get fast results even with huge datasets. For most enterprise applications, diskann keeps query times under 10 milliseconds.
- Hnsw is best for workloads where every millisecond counts.
- Diskann works well when you need to search large datasets quickly and can accept a small increase in latency.
Update Patterns
How often your data changes affects your choice. Hnsw works well for static or slowly changing data. If you update your vectors often, you may need to rebuild the index, which takes time. Diskann adapts better to frequent updates. It keeps accuracy and recall stable, even when you add or change vectors often.
| Feature | HNSW | DiskANN |
|---|---|---|
| Update Latency | Requires full index rebuilds | Adapts efficiently to changes |
| Recall | High, but affected by changes | Stable accuracy with frequent mutations |
If your application needs to handle lots of updates, diskann gives you more flexibility.
Example Scenarios
You can use real-world scenarios to help decide which algorithm fits your needs.
| Metric | HNSW | DiskANN |
|---|---|---|
| Average Recall | 0.6 | 0.7 |
| End-to-End Accuracy | 0.972 | 0.933 |
| Robustness-0.2@10 | 0.998 | 0.984 |
| Robustness-0.9 | 0.84 | 0.81 |
- Hnsw gives you high recall and fast query throughput. On datasets like SIFT1M, you can reach about 95% recall at 10 in just 1-2 milliseconds per query on a CPU.
- Diskann shows higher average recall, which means it finds more true matches in large datasets. It keeps accuracy high, even as your data grows.
- Hnsw works best for recommendation systems or search engines where you need top accuracy and speed.
- Diskann fits large-scale search, such as searching billions of product images or documents, where you need to balance speed, cost, and scale.
Hnsw stands out for workloads that need the fastest possible search on smaller datasets. Diskann shines when you need to search through massive amounts of data without breaking your budget.
Migration and Hybrid Approaches
You do not have to choose just one algorithm for every situation. Many organizations start with hnsw for smaller datasets. As their vector database grows, they move to diskann to handle more data and keep costs down. You can also use a hybrid approach. For example, you might keep your most popular or recent vectors in memory with hnsw, while storing older or less-used vectors on SSDs with diskann.
- Start with hnsw for fast prototyping and small workloads.
- Migrate to diskann as your dataset grows past memory limits.
- Combine both methods for the best mix of speed and scale.
Tip: Review your workload regularly. As your data and needs change, you can adjust your approach to get the best performance and value.
By understanding your dataset size, hardware, latency needs, and update patterns, you can make a smart choice between hnsw and diskann. This helps you build a vector search system that fits your goals and grows with your business.
Best Practices for Vector Search in Azure
Index Configuration Tips
You can achieve high performance in Azure AI Search by following a few important configuration tips. Start by making sure your index size fits within a single vector search unit. This helps you keep your search fast and reliable. Try to minimize the size of your embeddings. Smaller embeddings improve query times and increase queries per second. Use Approximate Nearest Neighbor queries for efficiency, as they speed up your search without sacrificing much accuracy.
Keep the number of results per query between 10 and 100. This range reduces latency and keeps your search responsive. Avoid using model endpoints that scale to zero in production. Cold starts can slow down your search and frustrate users. Plan for query spikes so your system can handle sudden increases in search traffic. Use service principals with OAuth tokens for secure and efficient authentication. Always use the latest version of the Python SDK to benefit from performance improvements. If you need higher throughput, scale your endpoint or parallelize across multiple endpoints.
Tip: A well-configured index is the foundation of optimized vector search in Azure.
Monitoring and Optimization
You need to monitor and optimize your search system to maintain high performance, especially with billion-vector workloads. The table below shows some effective techniques:
| Technique | Description |
|---|---|
| Multi-tier indexes | Use fast indexes for hot data and compress cold data to save space. |
| Fan-out control | Manage shard and cluster interactions to reduce network hops. |
| Quantization | Shrink vector size with methods like IVF-PQ to fit memory limits. |
| Centroid tuning | Adjust the number of centroids to balance recall and overhead. |
| Asymmetric Distance Computation | Speed up search by comparing full-precision queries to compressed vectors. |
| Intentional query routing | Use metadata and clustering to avoid unnecessary broadcasts during search. |
| Minimize network payloads | Transfer only IDs and scores, not full vectors, between nodes. |
| Hierarchical aggregation | Aggregate results locally to reduce coordination and stabilize latency. |
| Memory dominance | Ensure vectors fit in memory for best search performance. |
| Situational GPU use | Use GPUs for indexing if needed, but check if they help your search throughput. |
| Network quality | Monitor network conditions, as they affect search latency. |
Note: Regular monitoring helps you spot issues early and keep your search running smoothly.
Common Pitfalls
You may face some common pitfalls when setting up or scaling your search system in Azure. One mistake is letting your index grow too large for a single search unit. This can slow down your queries and make your search less reliable. Using large embeddings can also hurt performance. Always keep your embeddings as small as possible for your use case.
Another pitfall is ignoring query spikes. If you do not plan for sudden increases in search traffic, your system may fail under pressure. Relying on endpoints that scale to zero can cause cold start delays, which slow down your search. Failing to update your SDK or not using OAuth tokens can lead to security risks and missed performance gains.
Remember: Avoid these pitfalls to keep your search fast, secure, and reliable.
By following these best practices, you can build a robust and efficient search experience in Azure. You will handle large datasets with ease and deliver quick, accurate results to your users.
You solve the billion-vector problem in Azure AI Search by choosing the right algorithm for your needs. Diskann gives you cost-effective scalability for massive datasets, while hnsw delivers fast results for smaller workloads. You match your algorithm to your workload, scale, and budget. The table below helps you select the best endpoint:
| SKU Type | Use Case Description |
|---|---|
| Standard Endpoints | Best for critical latency needs with indexes under 320M vectors. |
| Storage-Optimized Endpoints | Ideal for 10M+ vectors, tolerating some latency, and requiring cost efficiency. |
- Estimate your storage needs by testing with a few documents.
- Plan for updates and total content volume.
Azure AI Search gives you flexibility and supports Microsoft’s innovations for enterprise AI.
FAQ
What is the main difference between HNSW and DiskANN?
You use HNSW for in-memory vector search. DiskANN lets you store most of your index on SSDs. This makes DiskANN better for handling billions of vectors without high memory costs.
Can you use DiskANN for real-time applications?
Yes. DiskANN delivers low-latency search, often under 10 milliseconds. You can use it for real-time AI search tasks, even with very large datasets.
How do you decide which algorithm to use in Azure AI Search?
You choose HNSW for smaller datasets that fit in RAM. You pick DiskANN for massive datasets that need cost-effective scaling. Consider your data size, hardware, and latency needs.
Does DiskANN require special hardware?
No. You can run DiskANN on standard servers with SSDs. You do not need expensive, high-memory machines. This makes scaling easier and more affordable.
How does Azure AI Search keep your data secure?
Azure AI Search uses encryption and access controls. You control who can access your data. Microsoft’s security features help you protect sensitive information.
Can you update your vector index without downtime?
You can update both HNSW and DiskANN indexes. DiskANN supports efficient updates for large datasets. You keep your search results fresh without taking your system offline.
What are some best practices for optimizing vector search performance?
Use smaller embeddings, keep your index within a single search unit, and monitor query latency. Always update your SDK and plan for query spikes to maintain high performance.
🚀 Want to be part of m365.fm?
Then stop just listening… and start showing up.
👉 Connect with me on LinkedIn and let’s make something happen:
- 🎙️ Be a podcast guest and share your story
- 🎧 Host your own episode (yes, seriously)
- 💡 Pitch topics the community actually wants to hear
- 🌍 Build your personal brand in the Microsoft 365 space
This isn’t just a podcast — it’s a platform for people who take action.
🔥 Most people wait. The best ones don’t.
👉 Connect with me on LinkedIn and send me a message:
"I want in"
Let’s build something awesome 👊
1
00:00:00,000 --> 00:00:03,400
Most architects default to H&SW because it is the industry standard.
2
00:00:03,400 --> 00:00:05,360
It is what the documentation recommends,
3
00:00:05,360 --> 00:00:06,760
what the tutorials use,
4
00:00:06,760 --> 00:00:09,360
and what every vector database ships by default.
5
00:00:09,360 --> 00:00:11,280
For a long time, following that default was fine,
6
00:00:11,280 --> 00:00:13,920
but at scale, that default becomes a financial liability.
7
00:00:13,920 --> 00:00:16,240
Here is the number that changes the conversation.
8
00:00:16,240 --> 00:00:20,760
One billion embeddings requires about six terabytes of RAM when you use H&SW.
9
00:00:20,760 --> 00:00:22,240
We are not talking about storage here.
10
00:00:22,240 --> 00:00:23,600
We are talking about RAM.
11
00:00:23,600 --> 00:00:25,240
That is not a search strategy.
12
00:00:25,240 --> 00:00:27,400
It is a budget crisis waiting to happen.
13
00:00:27,400 --> 00:00:30,880
And most teams do not see it coming until they are already stuck in it.
14
00:00:30,880 --> 00:00:31,960
By the end of this episode,
15
00:00:31,960 --> 00:00:34,680
you will know exactly when H&SW is the right call
16
00:00:34,680 --> 00:00:37,160
and when disk end becomes the only rational choice.
17
00:00:37,160 --> 00:00:39,480
You will have a decision framework tied to four variables
18
00:00:39,480 --> 00:00:41,000
that actually drive the outcome
19
00:00:41,000 --> 00:00:43,680
and a cost model that translates algorithm choice
20
00:00:43,680 --> 00:00:45,800
into a real Azure invoice impact.
21
00:00:45,800 --> 00:00:47,600
If you want to stay ahead of architecture decisions
22
00:00:47,600 --> 00:00:49,320
before they become expensive mistakes,
23
00:00:49,320 --> 00:00:50,480
you should subscribe now,
24
00:00:50,480 --> 00:00:53,040
because that is exactly what this channel is for.
25
00:00:53,040 --> 00:00:54,840
What vector search actually does?
26
00:00:54,840 --> 00:00:57,160
Most explanations of vector search start with math
27
00:00:57,160 --> 00:00:59,720
like cosine similarity, high dimensional space,
28
00:00:59,720 --> 00:01:01,160
or Euclidean distance.
29
00:01:01,160 --> 00:01:03,720
Within about 30 seconds, half the audience checks out
30
00:01:03,720 --> 00:01:05,320
because it feels like a graduate seminar
31
00:01:05,320 --> 00:01:07,320
instead of a decision they need to make by Thursday.
32
00:01:07,320 --> 00:01:08,920
So let's start somewhere different.
33
00:01:08,920 --> 00:01:10,240
Let's start with the problem.
34
00:01:10,240 --> 00:01:11,560
Imagine you have a knowledge base
35
00:01:11,560 --> 00:01:13,760
with thousands of documents, including policies,
36
00:01:13,760 --> 00:01:15,800
product specs, and support tickets.
37
00:01:15,800 --> 00:01:18,240
A user types a question asking for the process
38
00:01:18,240 --> 00:01:20,880
to escalate a customer complaint in the EMEA region.
39
00:01:20,880 --> 00:01:23,120
That question does not contain the exact words
40
00:01:23,120 --> 00:01:24,880
found in your escalation policy.
41
00:01:24,880 --> 00:01:26,560
The policy uses the phrase,
42
00:01:26,560 --> 00:01:28,320
"Complaint resolution pathway,"
43
00:01:28,320 --> 00:01:30,640
while the user said, "escalating a complaint."
44
00:01:30,640 --> 00:01:33,560
Traditional keyword search, which we call lexical search,
45
00:01:33,560 --> 00:01:35,320
looks for the specific words.
46
00:01:35,320 --> 00:01:38,400
It finds documents that match escalating and complaint
47
00:01:38,400 --> 00:01:41,040
and returns whatever scores highest by term frequency.
48
00:01:41,040 --> 00:01:42,640
Sometimes it works, but often it fails
49
00:01:42,640 --> 00:01:43,960
because the user and the document
50
00:01:43,960 --> 00:01:46,400
are using different words to describe the same concept.
51
00:01:46,400 --> 00:01:48,200
That is the problem vector search solves.
52
00:01:48,200 --> 00:01:50,040
It does not try to be smarter about keywords.
53
00:01:50,040 --> 00:01:51,480
It abandons keywords entirely.
54
00:01:51,480 --> 00:01:52,880
Here's how it actually works.
55
00:01:52,880 --> 00:01:54,520
You take a document and run it through
56
00:01:54,520 --> 00:01:56,600
an embedding model, which is a neural network
57
00:01:56,600 --> 00:01:59,240
trained specifically to convert meaning into numbers.
58
00:01:59,240 --> 00:02:01,760
Text goes in and a long array of floating point numbers
59
00:02:01,760 --> 00:02:02,480
comes out.
60
00:02:02,480 --> 00:02:06,680
That array might be 768 numbers long or 1536,
61
00:02:06,680 --> 00:02:08,880
or even more depending on the model you use.
62
00:02:08,880 --> 00:02:12,680
Each number represents some aspect of the meaning of that text,
63
00:02:12,680 --> 00:02:15,080
though not in any way humans can directly interpret.
64
00:02:15,080 --> 00:02:16,240
What matters is the pattern
65
00:02:16,240 --> 00:02:17,760
because documents with similar meaning
66
00:02:17,760 --> 00:02:20,440
produce arrays that are numerically close to each other.
67
00:02:20,440 --> 00:02:22,920
That array is what we call a vector or an embedding.
68
00:02:22,920 --> 00:02:24,960
When you index a document for vector search,
69
00:02:24,960 --> 00:02:26,320
you are not storing the words.
70
00:02:26,320 --> 00:02:28,040
You are storing that array, which represents
71
00:02:28,040 --> 00:02:30,120
a point in high-dimensional space.
72
00:02:30,120 --> 00:02:31,840
Similar documents cluster near each other
73
00:02:31,840 --> 00:02:33,960
while the similar documents stay far apart.
74
00:02:33,960 --> 00:02:35,280
Now, when a user submits a query,
75
00:02:35,280 --> 00:02:36,640
the same thing happens on the other side.
76
00:02:36,640 --> 00:02:38,680
The query goes through the same embedding model
77
00:02:38,680 --> 00:02:40,200
and produces its own vector.
78
00:02:40,200 --> 00:02:42,320
The search system then finds the document vectors
79
00:02:42,320 --> 00:02:44,200
that are closest to the query vector.
80
00:02:44,200 --> 00:02:45,960
It does not care whether the words matched.
81
00:02:45,960 --> 00:02:48,800
It cares about proximity in that mathematical space.
82
00:02:48,800 --> 00:02:50,600
And proximity in that space correlates
83
00:02:50,600 --> 00:02:52,120
with similarity in meaning.
84
00:02:52,120 --> 00:02:53,520
This is why the system can retrieve
85
00:02:53,520 --> 00:02:55,760
complaint resolution pathway in response
86
00:02:55,760 --> 00:02:57,480
to escalating a complaint.
87
00:02:57,480 --> 00:02:59,280
Since they describe the same concept,
88
00:02:59,280 --> 00:03:00,760
their vectors are close together.
89
00:03:00,760 --> 00:03:02,560
And the search returns the right document,
90
00:03:02,560 --> 00:03:04,400
even though the words do not overlap.
91
00:03:04,400 --> 00:03:07,320
Azure AI search sits on top of this infrastructure.
92
00:03:07,320 --> 00:03:10,480
It stores those vector arrays alongside the original documents
93
00:03:10,480 --> 00:03:12,760
and builds a navigable index over those arrays
94
00:03:12,760 --> 00:03:15,160
so it can find nearest neighbors quickly.
95
00:03:15,160 --> 00:03:16,920
The embedding model itself runs separately
96
00:03:16,920 --> 00:03:18,800
typically through Azure OpenAI
97
00:03:18,800 --> 00:03:21,400
and the resulting vectors flow into Azure AI search
98
00:03:21,400 --> 00:03:22,960
for storage and retrieval.
99
00:03:22,960 --> 00:03:24,520
But here's the thing nobody talks about
100
00:03:24,520 --> 00:03:26,120
in the initial architecture meeting.
101
00:03:26,120 --> 00:03:28,920
The index those vectors live in has a physical size.
102
00:03:28,920 --> 00:03:31,760
That size scales with your document count.
103
00:03:31,760 --> 00:03:34,320
It scales with the dimensionality of your embeddings
104
00:03:34,320 --> 00:03:37,120
and it scales with the graph parameters you configure.
105
00:03:37,120 --> 00:03:39,760
Whether that index lives in RAM or on an SSD
106
00:03:39,760 --> 00:03:42,280
is the decision that turns a manageable infrastructure bill
107
00:03:42,280 --> 00:03:44,680
into an unmanageable one at enterprise scale.
108
00:03:44,680 --> 00:03:47,920
For small deployments, that variable barely registers.
109
00:03:47,920 --> 00:03:50,040
If a team is indexing 10,000 documents,
110
00:03:50,040 --> 00:03:51,840
the index fits comfortably in memory,
111
00:03:51,840 --> 00:03:54,440
latency is excellent and the cost is predictable.
112
00:03:54,440 --> 00:03:56,840
But organizations that start with 10,000 documents
113
00:03:56,840 --> 00:03:57,960
rarely stay there.
114
00:03:57,960 --> 00:04:00,800
Knowledge bases grow as new products launch acquisitions happen
115
00:04:00,800 --> 00:04:02,680
and support ticket history accumulates.
116
00:04:02,680 --> 00:04:04,920
The index that was fine at a million vectors
117
00:04:04,920 --> 00:04:06,960
starts behaving very differently at 50 million
118
00:04:06,960 --> 00:04:09,360
and it becomes very expensive at 500 million.
119
00:04:09,360 --> 00:04:11,560
The size of the index is the variable nobody talks about
120
00:04:11,560 --> 00:04:12,800
until the invoice arrives.
121
00:04:12,800 --> 00:04:14,840
That is what this episode is actually about.
122
00:04:14,840 --> 00:04:16,720
The approximate nearest neighbor problem.
123
00:04:16,720 --> 00:04:17,840
You have a vector index.
124
00:04:17,840 --> 00:04:20,720
Millions of documents have been converted into floating point arrays
125
00:04:20,720 --> 00:04:22,920
and they're sitting there ready to search.
126
00:04:22,920 --> 00:04:23,800
A query comes in.
127
00:04:23,800 --> 00:04:25,600
It gets turned into its own vector.
128
00:04:25,600 --> 00:04:27,840
Now the system has to find the closest matches.
129
00:04:27,840 --> 00:04:30,120
The mathematically perfect way to do this is simple.
130
00:04:30,120 --> 00:04:33,160
You compare the query vector against every single document
131
00:04:33,160 --> 00:04:36,080
vector in your index, calculate the distance for each one,
132
00:04:36,080 --> 00:04:38,560
sort them and return the top results.
133
00:04:38,560 --> 00:04:41,160
This is called exact nearest neighbor search.
134
00:04:41,160 --> 00:04:43,040
It's perfectly accurate and it will always
135
00:04:43,040 --> 00:04:44,800
give you the true closest matches.
136
00:04:44,800 --> 00:04:47,680
But at any real enterprise scale, it's completely useless.
137
00:04:47,680 --> 00:04:49,000
The reason is simple math.
138
00:04:49,000 --> 00:04:51,200
If you have 100 million documents in your index
139
00:04:51,200 --> 00:04:53,320
and a query arrives, an exact search
140
00:04:53,320 --> 00:04:55,560
requires 100 million distance calculations
141
00:04:55,560 --> 00:04:57,680
before you can return a single result.
142
00:04:57,680 --> 00:05:00,280
At a billion documents, you're looking at a billion calculations.
143
00:05:00,280 --> 00:05:02,560
The mathematics doesn't care about your latency requirements
144
00:05:02,560 --> 00:05:04,080
but your users definitely do.
145
00:05:04,080 --> 00:05:06,440
Exact search at scale produces answers that are technically
146
00:05:06,440 --> 00:05:09,080
correct, but they arrive about three seconds too late
147
00:05:09,080 --> 00:05:10,240
to actually matter.
148
00:05:10,240 --> 00:05:12,960
This creates a fundamental tension that every vector search
149
00:05:12,960 --> 00:05:14,280
system has to solve.
150
00:05:14,280 --> 00:05:15,720
How do you find the closest matches
151
00:05:15,720 --> 00:05:17,840
without checking every possible match?
152
00:05:17,840 --> 00:05:19,760
The answer is approximate nearest neighbor search,
153
00:05:19,760 --> 00:05:21,560
which everyone calls A and N.
154
00:05:21,560 --> 00:05:24,240
And algorithms don't promise to find the absolute closest
155
00:05:24,240 --> 00:05:25,320
match every time.
156
00:05:25,320 --> 00:05:27,680
Instead, they promise to find a match that is close enough
157
00:05:27,680 --> 00:05:30,800
and fast enough to be useful in a real production environment.
158
00:05:30,800 --> 00:05:33,520
We measure this trade off using a metric called Recall.
159
00:05:33,520 --> 00:05:36,360
In this context, Recall asks a simple question.
160
00:05:36,360 --> 00:05:39,000
Out of all the truly relevant documents in the index,
161
00:05:39,000 --> 00:05:41,800
what percentage did the A and N search actually find?
162
00:05:41,800 --> 00:05:45,560
If your index has 95% recall, it's returning 95%
163
00:05:45,560 --> 00:05:47,360
of what an exact search would have found,
164
00:05:47,360 --> 00:05:49,840
but it's doing it in milliseconds instead of seconds.
165
00:05:49,840 --> 00:05:52,400
For most production rag systems, 95% Recall
166
00:05:52,400 --> 00:05:53,640
is the standard threshold.
167
00:05:53,640 --> 00:05:56,120
That missing 5% is a fair price to pay for the speed
168
00:05:56,120 --> 00:05:57,640
that makes the system usable.
169
00:05:57,640 --> 00:06:00,560
There are two main ways to handle A and algorithms.
170
00:06:00,560 --> 00:06:02,800
Partition-based methods divide the vector space
171
00:06:02,800 --> 00:06:04,920
into regions, which is a lot like dividing a map
172
00:06:04,920 --> 00:06:06,200
into different counties.
173
00:06:06,200 --> 00:06:08,440
When a query comes in, the system figures out which county
174
00:06:08,440 --> 00:06:10,960
it belongs in and only searches that specific area.
175
00:06:10,960 --> 00:06:13,240
This is fast, but you might miss answers
176
00:06:13,240 --> 00:06:15,880
that are sitting right on the border of two regions.
177
00:06:15,880 --> 00:06:17,720
The other option is graph-based methods.
178
00:06:17,720 --> 00:06:19,640
Instead of drawing borders, these algorithms
179
00:06:19,640 --> 00:06:21,880
build a network of connections between vectors
180
00:06:21,880 --> 00:06:24,400
where each node connects to its nearest neighbors.
181
00:06:24,400 --> 00:06:26,280
Searching means navigating that network.
182
00:06:26,280 --> 00:06:28,560
You start at a random point, follow the connections
183
00:06:28,560 --> 00:06:30,680
toward the query and converge on the answer.
184
00:06:30,680 --> 00:06:32,880
Both H and S W and this can are graph-based.
185
00:06:32,880 --> 00:06:34,320
That shared foundation matters because it
186
00:06:34,320 --> 00:06:36,800
means their quality is actually very similar.
187
00:06:36,800 --> 00:06:40,480
Both can hit that 95% recall mark if you tune them correctly.
188
00:06:40,480 --> 00:06:42,480
The real difference isn't about how well they search.
189
00:06:42,480 --> 00:06:45,040
It's about where the graph they're navigating actually lives.
190
00:06:45,040 --> 00:06:47,160
H and S W keeps its entire graph in RAM.
191
00:06:47,160 --> 00:06:49,440
Every node, every connection, and every vector
192
00:06:49,440 --> 00:06:52,400
stays in memory so the search can happen at memory speeds.
193
00:06:52,400 --> 00:06:53,960
Disc and does things differently.
194
00:06:53,960 --> 00:06:56,080
It keeps a compressed version of the graph in RAM
195
00:06:56,080 --> 00:06:58,560
for navigation, but it stores the full precision graph
196
00:06:58,560 --> 00:07:00,320
on an SSD for verification.
197
00:07:00,320 --> 00:07:02,680
The search starts in memory and only jumps to the disk
198
00:07:02,680 --> 00:07:04,840
when it needs the exact vectors to confirm a match.
199
00:07:04,840 --> 00:07:06,960
This choice between RAM and SSD sounds
200
00:07:06,960 --> 00:07:08,400
like a minor technical detail.
201
00:07:08,400 --> 00:07:09,200
It isn't.
202
00:07:09,200 --> 00:07:10,600
This is the decision that determines
203
00:07:10,600 --> 00:07:14,160
if your infrastructure costs $10,000 a month or $100,000
204
00:07:14,160 --> 00:07:16,560
once your data reaches a certain size.
205
00:07:16,560 --> 00:07:18,000
Everything we're discussing today
206
00:07:18,000 --> 00:07:20,040
is a result of that one structural difference.
207
00:07:20,040 --> 00:07:21,880
You have to decide whether graph lives
208
00:07:21,880 --> 00:07:24,800
and what that location costs as your data grows.
209
00:07:24,800 --> 00:07:26,720
H and S W was the first to arrive,
210
00:07:26,720 --> 00:07:29,520
and it became the gold standard for some very good reasons.
211
00:07:29,520 --> 00:07:32,880
Understanding those reasons and the assumption hidden inside them
212
00:07:32,880 --> 00:07:36,720
is the only way to understand why disk and was even invented.
213
00:07:36,720 --> 00:07:39,200
H and S W, the gold standard explained.
214
00:07:39,200 --> 00:07:42,480
H and S W stands for hierarchical navigable small world.
215
00:07:42,480 --> 00:07:44,280
Every word in that name is doing heavy lifting,
216
00:07:44,280 --> 00:07:46,680
so we need to unpack it before we look at the cost problems
217
00:07:46,680 --> 00:07:47,520
it creates.
218
00:07:47,520 --> 00:07:49,640
Hierarchical means the index is built in layers,
219
00:07:49,640 --> 00:07:50,840
just like a pyramid.
220
00:07:50,840 --> 00:07:53,520
The bottom layer holds every single vector in your data set,
221
00:07:53,520 --> 00:07:56,360
including every document and every chunk you've ever generated.
222
00:07:56,360 --> 00:07:59,160
Above that is a thinner layer with a random subset
223
00:07:59,160 --> 00:07:59,880
of those vectors.
224
00:07:59,880 --> 00:08:01,400
The layers get thinner as you go up.
225
00:08:01,400 --> 00:08:03,680
The very top layer might only have a few nodes,
226
00:08:03,680 --> 00:08:05,280
but those nodes are connected to each other
227
00:08:05,280 --> 00:08:07,600
across huge distances in the vector space.
228
00:08:07,600 --> 00:08:10,120
Navigable small world describes how the search actually
229
00:08:10,120 --> 00:08:11,560
moves through this pyramid.
230
00:08:11,560 --> 00:08:13,680
Each node connects to its closest neighbors,
231
00:08:13,680 --> 00:08:15,400
but it also has long-range connections
232
00:08:15,400 --> 00:08:17,320
that let the search jump across the space.
233
00:08:17,320 --> 00:08:20,440
The small world property means any two nodes in the graph
234
00:08:20,440 --> 00:08:22,760
are reachable in a tiny number of hops.
235
00:08:22,760 --> 00:08:24,720
It's a lot like a social network where you're usually
236
00:08:24,720 --> 00:08:26,920
only six people away from anyone else on Earth.
237
00:08:26,920 --> 00:08:29,440
When you put these together, you get a very specific search
238
00:08:29,440 --> 00:08:30,240
strategy.
239
00:08:30,240 --> 00:08:32,720
You start at the top layer using those long-range connections
240
00:08:32,720 --> 00:08:34,480
to find the general neighborhood of the query.
241
00:08:34,480 --> 00:08:36,160
Then you move down through the layers,
242
00:08:36,160 --> 00:08:37,840
zooming in with every step until you're
243
00:08:37,840 --> 00:08:40,240
navigating the detailed connections at the very bottom.
244
00:08:40,240 --> 00:08:42,840
Think of it like highway driving followed by side streets.
245
00:08:42,840 --> 00:08:44,760
It's faster across long distances and precise
246
00:08:44,760 --> 00:08:45,920
once you get close.
247
00:08:45,920 --> 00:08:48,040
Two main settings control how this works.
248
00:08:48,040 --> 00:08:50,600
The first is M, which decides how many connections
249
00:08:50,600 --> 00:08:52,320
each node keeps in the graph.
250
00:08:52,320 --> 00:08:55,120
A higher M value gives you better connectivity and faster
251
00:08:55,120 --> 00:08:57,320
searches, but it also creates a larger graph
252
00:08:57,320 --> 00:08:58,520
that eats up more memory.
253
00:08:58,520 --> 00:09:00,120
The second is F-search.
254
00:09:00,120 --> 00:09:02,320
This controls how wide the search beam is when you're
255
00:09:02,320 --> 00:09:03,600
actually running a query.
256
00:09:03,600 --> 00:09:05,680
A higher EF-search gives you better recall,
257
00:09:05,680 --> 00:09:08,360
but it requires more computing power for every single search.
258
00:09:08,360 --> 00:09:11,200
These parameters are easy to tune, which is why H&SW
259
00:09:11,200 --> 00:09:13,360
became the default choice for almost everyone.
260
00:09:13,360 --> 00:09:15,600
You can dial in the balance between speed and accuracy
261
00:09:15,600 --> 00:09:16,840
for your specific needs.
262
00:09:16,840 --> 00:09:19,200
You can aim for suddenly second speeds on small data sets
263
00:09:19,200 --> 00:09:20,840
or expand the search for better accuracy
264
00:09:20,840 --> 00:09:21,880
when you have more time.
265
00:09:21,880 --> 00:09:24,120
The algorithm is clear, it's well documented,
266
00:09:24,120 --> 00:09:25,880
and the behavior is very predictable.
267
00:09:25,880 --> 00:09:27,920
That predictability is exactly why H&SW
268
00:09:27,920 --> 00:09:29,720
became the industry standard.
269
00:09:29,720 --> 00:09:32,640
When vector databases like Milvus, Q-Drand, or VV8
270
00:09:32,640 --> 00:09:35,480
needed a default index, they chose H&SW.
271
00:09:35,480 --> 00:09:37,560
When PGVector added search to PostgreSQL,
272
00:09:37,560 --> 00:09:39,280
they used H&SW.
273
00:09:39,280 --> 00:09:41,480
Even the early versions of Azure AI search
274
00:09:41,480 --> 00:09:43,640
relied on this style of in-memory indexing.
275
00:09:43,640 --> 00:09:46,120
The algorithm earned its spot because it actually works,
276
00:09:46,120 --> 00:09:48,040
but because it works so well, it's easy
277
00:09:48,040 --> 00:09:50,960
to miss the massive assumption buried inside the code.
278
00:09:50,960 --> 00:09:54,200
The entire design of H&SW relies on one single requirement.
279
00:09:54,200 --> 00:09:55,960
The whole graph must fit in RAM.
280
00:09:55,960 --> 00:09:58,280
Every node, every edge, and every vector
281
00:09:58,280 --> 00:10:01,000
has to live in memory before you can even start a query.
282
00:10:01,000 --> 00:10:03,760
When you can meet that condition, H&SW is incredible,
283
00:10:03,760 --> 00:10:05,080
but when that condition breaks,
284
00:10:05,080 --> 00:10:07,160
the algorithm doesn't just slow down a little bit.
285
00:10:07,160 --> 00:10:09,440
Maybe your dataset grew faster than your budget,
286
00:10:09,440 --> 00:10:12,040
or perhaps you had to create a full replica in another region.
287
00:10:12,040 --> 00:10:14,880
Suddenly, your infrastructure costs don't just nudge upward.
288
00:10:14,880 --> 00:10:15,640
They jump.
289
00:10:15,640 --> 00:10:17,600
You're no longer buying standard storage,
290
00:10:17,600 --> 00:10:20,320
and now you're shopping for memory-optimized virtual machines
291
00:10:20,320 --> 00:10:24,360
where a gigabyte of RAM costs way more than a gigabyte of SSD space.
292
00:10:24,360 --> 00:10:27,000
The algorithm has no way to fix this on its own.
293
00:10:27,000 --> 00:10:28,480
It doesn't know about your cloud bill,
294
00:10:28,480 --> 00:10:30,800
and it doesn't care that your index just crossed the line
295
00:10:30,800 --> 00:10:32,600
from expensive to unaffordable.
296
00:10:32,600 --> 00:10:34,440
It just keeps doing what it was designed to do,
297
00:10:34,440 --> 00:10:36,360
which is fast in memory traversal.
298
00:10:36,360 --> 00:10:39,600
It leaves the massive infrastructure bill entirely in your hands,
299
00:10:39,600 --> 00:10:42,600
that hidden assumption that the graph will always fit in RAM
300
00:10:42,600 --> 00:10:46,120
and that RAM will stay cheap is what Microsoft research wanted to solve.
301
00:10:46,120 --> 00:10:49,120
They didn't do it because H&SW is a bad algorithm.
302
00:10:49,120 --> 00:10:50,720
They did it because that assumption eventually
303
00:10:50,720 --> 00:10:52,400
fails for every large company,
304
00:10:52,400 --> 00:10:54,000
and most enterprise data is heading
305
00:10:54,000 --> 00:10:56,280
toward that breaking point right now.
306
00:10:56,280 --> 00:11:00,720
The H&SW memory wall, the numbers from Microsoft's own Cosmos DB benchmarks
307
00:11:00,720 --> 00:11:02,360
tell a story that's hard to ignore.
308
00:11:02,360 --> 00:11:03,800
If you want to store one million vectors,
309
00:11:03,800 --> 00:11:05,560
you're looking at over 12 gigabytes of RAM
310
00:11:05,560 --> 00:11:07,200
just for the H&SW index,
311
00:11:07,200 --> 00:11:09,360
and that doesn't include your raw document text,
312
00:11:09,360 --> 00:11:11,560
or the original embeddings you need for retrieval.
313
00:11:11,560 --> 00:11:13,600
We're talking about the index structure itself,
314
00:11:13,600 --> 00:11:16,920
the graph, the connections, and the full precision vectors.
315
00:11:16,920 --> 00:11:19,840
Eating up 12 gigabytes for every million entries you add.
316
00:11:19,840 --> 00:11:22,880
When you scale that up, the math starts to look pretty ugly.
317
00:11:22,880 --> 00:11:26,280
10 million vectors will cost you more than 120 gigabytes,
318
00:11:26,280 --> 00:11:29,520
and 100 million will put you at roughly 1.2 terabytes.
319
00:11:29,520 --> 00:11:31,080
By the time you hit a billion vectors,
320
00:11:31,080 --> 00:11:33,520
you're looking at about six terabytes of RAM.
321
00:11:33,520 --> 00:11:36,640
This assumes the best case scenario where your settings stay the same,
322
00:11:36,640 --> 00:11:39,880
but in reality, larger data sets usually need higher M values
323
00:11:39,880 --> 00:11:41,480
to keep the search quality high,
324
00:11:41,480 --> 00:11:43,880
which only pushes that memory bill further up.
325
00:11:43,880 --> 00:11:47,200
You won't find six terabytes of RAM on a standard virtual machine.
326
00:11:47,200 --> 00:11:49,440
On Azure, the memory optimized instances
327
00:11:49,440 --> 00:11:51,160
that can actually handle that capacity
328
00:11:51,160 --> 00:11:53,560
like the M series or MV2 series are priced
329
00:11:53,560 --> 00:11:56,240
to reflect how rare that kind of hardware really is.
330
00:11:56,240 --> 00:11:59,400
At that level, you aren't paying for the processor cores anymore.
331
00:11:59,400 --> 00:12:01,200
You're paying for the specialized engineering
332
00:12:01,200 --> 00:12:04,520
it takes to cram terabytes of RAM into a single server
333
00:12:04,520 --> 00:12:07,000
and keep it stable under a heavy production load.
334
00:12:07,000 --> 00:12:10,280
The problem gets worse because H&SW doesn't run on just one node
335
00:12:10,280 --> 00:12:11,640
in a real world setup.
336
00:12:11,640 --> 00:12:13,680
You need replicas for basic availability,
337
00:12:13,680 --> 00:12:16,400
which usually means two or three copies of your data.
338
00:12:16,400 --> 00:12:18,800
Since you can't share the graph across these replicas,
339
00:12:18,800 --> 00:12:22,080
each one needs its own full independent copy sitting in memory.
340
00:12:22,080 --> 00:12:25,480
That six terabyte requirement for a billion vectors suddenly jumps
341
00:12:25,480 --> 00:12:29,200
to 12 terabytes for two replicas or 18 terabytes for three.
342
00:12:29,200 --> 00:12:32,960
And if you go multi-region, those costs multiply again.
343
00:12:32,960 --> 00:12:36,760
This is the moment where a technical choice turns into a massive budget problem.
344
00:12:36,760 --> 00:12:39,120
A team building a search system with 10 million vectors
345
00:12:39,120 --> 00:12:41,200
didn't make a mistake by choosing H&SW
346
00:12:41,200 --> 00:12:44,160
because the memory is manageable and the speed is great at that size.
347
00:12:44,160 --> 00:12:47,640
The real danger is that the system feels stable when it actually isn't.
348
00:12:47,640 --> 00:12:49,080
The growth curve is still climbing
349
00:12:49,080 --> 00:12:51,520
and the cost curve is just waiting to catch up with it.
350
00:12:51,520 --> 00:12:53,280
There is another limit that people often overlook
351
00:12:53,280 --> 00:12:54,760
and that's the dimensionality cap.
352
00:12:54,760 --> 00:12:58,320
The version of H&SW used in Azure database for PostgreSQL caps
353
00:12:58,320 --> 00:13:00,720
its native support at 2000 dimensions.
354
00:13:00,720 --> 00:13:05,280
Most modern models from Azure OpenAI use 1,536 dimensions,
355
00:13:05,280 --> 00:13:08,800
which fits for now, but newer models are already pushing those limits.
356
00:13:08,800 --> 00:13:11,760
When your model produces vectors that H&SW can't handle,
357
00:13:11,760 --> 00:13:14,280
you're forced into workarounds like reducing dimensions
358
00:13:14,280 --> 00:13:18,320
or switching index types, both of which add complexity and hurt your accuracy.
359
00:13:18,320 --> 00:13:21,760
In most organizations, the experience follows a very specific pattern.
360
00:13:21,760 --> 00:13:23,560
The system runs perfectly at first,
361
00:13:23,560 --> 00:13:25,200
the team is happy with the performance
362
00:13:25,200 --> 00:13:27,880
and document ingestion continues as it always does.
363
00:13:27,880 --> 00:13:30,440
New product releases and support tickets keep flowing in
364
00:13:30,440 --> 00:13:32,320
and the vector counts slowly climbs.
365
00:13:32,320 --> 00:13:35,120
Once you hit a threshold of maybe 50 to 80 million vectors,
366
00:13:35,120 --> 00:13:38,120
the conversation changes from performance tuning to procurement.
367
00:13:38,120 --> 00:13:41,120
Teams hit the memory wall slowly over several months of growth,
368
00:13:41,120 --> 00:13:43,640
but the realization usually happens all at once.
369
00:13:43,640 --> 00:13:45,320
Someone finally runs the projections
370
00:13:45,320 --> 00:13:49,000
and realizes the infrastructure costs won't be sustainable in 18 months.
371
00:13:49,000 --> 00:13:50,960
That moment of panic, usually under a deadline,
372
00:13:50,960 --> 00:13:53,600
is the worst time to realize you should have used a different algorithm
373
00:13:53,600 --> 00:13:54,720
from the very beginning.
374
00:13:54,720 --> 00:13:56,560
Microsoft Research saw this coming years
375
00:13:56,560 --> 00:13:58,440
before most enterprise teams did.
376
00:13:58,440 --> 00:14:00,400
They developed disk N as the solution
377
00:14:00,400 --> 00:14:02,520
and the entire design starts with a different premise
378
00:14:02,520 --> 00:14:04,840
about where your data should actually live.
379
00:14:04,840 --> 00:14:07,200
Disk N, the architecture of the answer,
380
00:14:07,200 --> 00:14:10,160
disk N stands for disk accelerated nearest neighbors
381
00:14:10,160 --> 00:14:12,480
and the name tells you exactly what the goal is.
382
00:14:12,480 --> 00:14:16,720
It isn't just faster or smarter search, it's disk accelerated.
383
00:14:16,720 --> 00:14:18,520
The designers asked a simple question,
384
00:14:18,520 --> 00:14:21,040
what if the index didn't have to live in RAM?
385
00:14:21,040 --> 00:14:25,160
It sounds like an obvious fix because SSDs are significantly cheaper than RAM.
386
00:14:25,160 --> 00:14:28,560
If you can move the index to an SSD without ruining your query speed,
387
00:14:28,560 --> 00:14:30,240
the cost of scaling changes completely.
388
00:14:30,240 --> 00:14:33,920
The hard part is that random disk access is much slower than memory access.
389
00:14:33,920 --> 00:14:38,080
If a search algorithm has to make thousands of random reads for every single query,
390
00:14:38,080 --> 00:14:40,840
it would be too slow to use on anything other than RAM.
391
00:14:40,840 --> 00:14:42,720
The breakthrough with disk N is the realization
392
00:14:42,720 --> 00:14:44,640
that the storage medium isn't the bottleneck.
393
00:14:44,640 --> 00:14:46,120
Random access is the bottleneck.
394
00:14:46,120 --> 00:14:48,320
If you design the graph and the layout on the disk
395
00:14:48,320 --> 00:14:51,880
to minimize those random reads, NVMe SSDs are actually fast enough
396
00:14:51,880 --> 00:14:54,000
to compete with in-memory search.
397
00:14:54,000 --> 00:14:55,920
Microsoft bet on this engineering shift
398
00:14:55,920 --> 00:14:58,160
and the production data shows that the bet paid off.
399
00:14:58,160 --> 00:15:00,280
The system uses something called the Vamanagraph.
400
00:15:00,280 --> 00:15:02,960
Unlike the multi-layer setup you see in H&SW,
401
00:15:02,960 --> 00:15:07,840
Vamanag uses a single layer with edges that are specifically designed to be long and sparse.
402
00:15:07,840 --> 00:15:11,360
While H&SW uses different layers for local and long-range connections,
403
00:15:11,360 --> 00:15:16,000
Vamanag focuses on long-range jumps that let the search move through the data in fewer steps.
404
00:15:16,000 --> 00:15:18,000
Fewer steps mean fewer disc reads,
405
00:15:18,000 --> 00:15:21,120
which keeps the latency low even though the data is on a disk.
406
00:15:21,120 --> 00:15:24,600
The way the graph is laid out on the SSD is also highly optimized.
407
00:15:24,600 --> 00:15:28,000
Every node's list of neighbors is stored in a single contiguous block.
408
00:15:28,000 --> 00:15:29,600
When the algorithm visits a node,
409
00:15:29,600 --> 00:15:33,800
it pulls all that information in one sequential operation instead of hunting around the disk.
410
00:15:33,800 --> 00:15:36,720
Essentially the algorithm is teaching the hardware how to be efficient,
411
00:15:36,720 --> 00:15:39,640
rather than treating disk access as a problem to be solved.
412
00:15:39,640 --> 00:15:42,080
Two different structures work together to make this happen.
413
00:15:42,080 --> 00:15:45,120
The first is a compressed navigation layer that lives in memory.
414
00:15:45,120 --> 00:15:46,960
These vectors aren't full precision,
415
00:15:46,960 --> 00:15:50,080
but they are detailed enough to help the search navigate the graph
416
00:15:50,080 --> 00:15:51,760
and find the right neighborhood.
417
00:15:51,760 --> 00:15:56,160
Because they are compressed, they take up a tiny fraction of the space that full vectors would.
418
00:15:56,160 --> 00:16:00,080
The second part is the verification layer which lives on the SSD.
419
00:16:00,080 --> 00:16:04,640
This contains the full precision graph with all the high-quality connections and complete vectors.
420
00:16:04,640 --> 00:16:05,680
When you run a search,
421
00:16:05,680 --> 00:16:10,560
the system uses the RAM-based navigation layer to get close to the answer at memory speeds.
422
00:16:10,560 --> 00:16:15,040
Then it jumps to the SSD to pull the full precision data and re-rank the results.
423
00:16:15,040 --> 00:16:18,480
This two-stage process ensures the final answer is accurate
424
00:16:18,480 --> 00:16:20,640
without needing all that data in RAM.
425
00:16:20,640 --> 00:16:24,400
This design is the reason disk and can maintain over 95% recall
426
00:16:24,400 --> 00:16:26,400
while keeping most of its data on a disk.
427
00:16:26,400 --> 00:16:29,200
The compression might add a little bit of noise during the initial search,
428
00:16:29,200 --> 00:16:33,040
but the full precision check at the end cleans everything up before you ever see the results.
429
00:16:33,040 --> 00:16:37,280
The impact on your memory footprint is where the cost argument becomes impossible to ignore.
430
00:16:37,280 --> 00:16:41,520
A million vectors in a disk and an index only need about 200 megabytes of RAM
431
00:16:41,520 --> 00:16:43,600
because the heavy lifting stays on the SSD.
432
00:16:43,600 --> 00:16:48,240
That is roughly one-sixtieth of the memory H&SW requires for the exact same data.
433
00:16:48,240 --> 00:16:52,240
When you scale to a billion vectors, disk and N needs about 100 gigabytes of RAM
434
00:16:52,240 --> 00:16:54,560
while H&SW would need six terabytes.
435
00:16:54,560 --> 00:16:59,040
That one-sixtieth ratio isn't just a small optimization or a lucky edge case.
436
00:16:59,040 --> 00:17:01,600
It is a fundamental part of how the architecture works.
437
00:17:01,600 --> 00:17:05,760
Every time you add a replica or expand to a new region, that ratio stays the same.
438
00:17:05,760 --> 00:17:08,080
In an Azure environment where memory is expensive,
439
00:17:08,080 --> 00:17:13,360
that difference is what determines if a project is financially viable or a total non-starter.
440
00:17:13,360 --> 00:17:15,280
The architecture is elegant on paper,
441
00:17:15,280 --> 00:17:18,320
but the way it handles a real production load is what really matters.
442
00:17:18,320 --> 00:17:21,440
Disk A N in production, what the numbers actually say.
443
00:17:21,440 --> 00:17:25,280
Architecture diagrams are one thing, but production behavior is something else entirely.
444
00:17:25,280 --> 00:17:29,200
We need to look at what disk N actually delivers when real world workloads hit the system.
445
00:17:29,200 --> 00:17:33,440
Microsoft's fraud detection demo gives us the most direct comparison available today.
446
00:17:33,440 --> 00:17:36,960
The traditional approach, the baseline system doing what it was already doing,
447
00:17:36,960 --> 00:17:39,280
returned results in about 1.1 seconds.
448
00:17:39,280 --> 00:17:43,920
The disk and backed vector search returned those same results in about 47 milliseconds.
449
00:17:43,920 --> 00:17:46,560
This isn't just a marginal improvement on a benchmark.
450
00:17:46,560 --> 00:17:49,200
It's a production system changing its entire character,
451
00:17:49,200 --> 00:17:54,640
moving from a tool that makes analysts wait to one that responds within the window of a human thought.
452
00:17:54,640 --> 00:17:57,360
Detection quality actually improved alongside that speed,
453
00:17:57,360 --> 00:18:03,040
because the semantic search surfaced patterns that the old keyword-based system was simply missing.
454
00:18:03,040 --> 00:18:07,120
The compute reduction figure Microsoft publishes for disk N is 95%.
455
00:18:07,120 --> 00:18:12,880
Framed another way, disk A N uses less than 5% of the compute required by traditional in-memory indexing
456
00:18:12,880 --> 00:18:16,400
to get the same results. When you first see that number, it sounds like marketing fluff,
457
00:18:16,400 --> 00:18:19,680
but it isn't. It's the RAM to SSD cost ratio made visible.
458
00:18:19,680 --> 00:18:23,440
The bill for a massive H&S W deployment isn't driven by processing power,
459
00:18:23,440 --> 00:18:28,160
but by the massive cost of keeping terabytes of data sitting in RAM on specialized hardware.
460
00:18:28,160 --> 00:18:31,200
Disk A N collapses at cost because the data moves to the SSD,
461
00:18:31,200 --> 00:18:34,960
and SSD backed compute leaves in a fundamentally different pricing tier.
462
00:18:34,960 --> 00:18:39,360
The Cosmos DB benchmarks against commercial alternatives make this reality concrete.
463
00:18:39,360 --> 00:18:43,920
On 10 million vectors with 768 dimensions, a very realistic mid-scale workload,
464
00:18:43,920 --> 00:18:47,120
Cosmos DB with disk N was 43 times cheaper than Pinecone.
465
00:18:47,120 --> 00:18:50,160
It was also 12 times cheaper than Zillies Serverless Enterprise,
466
00:18:50,160 --> 00:18:54,400
or while keeping latency under 20 milliseconds and recall above 95%.
467
00:18:54,400 --> 00:18:56,480
Those competitors aren't using bad algorithms.
468
00:18:56,480 --> 00:19:00,960
Most commercial vector databases rely on H&S W style in-memory indexing,
469
00:19:00,960 --> 00:19:04,160
so the cost difference you're seeing is actually an architectural difference.
470
00:19:04,160 --> 00:19:06,080
We do need a necessary calibration here.
471
00:19:06,080 --> 00:19:10,000
Disk N is not faster than H&S W in absolute latency terms on small data sets.
472
00:19:10,000 --> 00:19:13,920
This needs to be said plainly because it's the most common way people misread the comparison.
473
00:19:13,920 --> 00:19:18,480
When a data set fits comfortably in RAM and H&S W can navigate its graph at memory speed,
474
00:19:18,480 --> 00:19:22,160
it will return results faster than disk N ends two-stage process.
475
00:19:22,160 --> 00:19:26,000
Sub-five millisecond performance on a small corpus is a real thing for H&S W,
476
00:19:26,000 --> 00:19:28,880
and Disk N simply doesn't match it at that specific scale.
477
00:19:28,880 --> 00:19:33,280
The crossover point usually happens somewhere between 50 and 100 million vectors
478
00:19:33,280 --> 00:19:35,120
for most enterprise workloads.
479
00:19:35,120 --> 00:19:40,080
Below that threshold, the latency advantage of H&S W is genuine and the cost is usually manageable.
480
00:19:40,080 --> 00:19:42,880
Once you go above it, two things happen at the same time.
481
00:19:42,880 --> 00:19:46,400
The memory requirements for H&S W start distorting your budget,
482
00:19:46,400 --> 00:19:51,760
and the 20 millisecond latency of disk N becomes perfectly acceptable for a rag pipeline anyway.
483
00:19:51,760 --> 00:19:56,320
By that point, retrieval is just one stage in a multi-step process that includes re-ranking
484
00:19:56,320 --> 00:20:02,480
and generation, so the total pipeline latency makes the difference between five and 20 milliseconds irrelevant.
485
00:20:02,480 --> 00:20:05,040
That last point matters more than it might seem at first.
486
00:20:05,040 --> 00:20:11,040
Teams that optimize for retrieval latency and isolation often discover they've been solving the wrong problem.
487
00:20:11,040 --> 00:20:14,720
In a hybrid rag pipeline with a re-ranker operating over the top 50 candidates,
488
00:20:14,720 --> 00:20:17,440
the re-ranking step itself adds tens of milliseconds.
489
00:20:17,440 --> 00:20:20,800
The semantic rancor in Azure AI search adds even more,
490
00:20:20,800 --> 00:20:23,760
and the LLM generation that follows adds hundreds.
491
00:20:23,760 --> 00:20:28,400
A 15 millisecond disk NN retrieval inside a two-second NN pipeline is not the bottleneck,
492
00:20:28,400 --> 00:20:30,320
and it isn't even close to being the bottleneck.
493
00:20:30,320 --> 00:20:34,240
The production story for disk N isn't about beating the raw speed of H&S W.
494
00:20:34,240 --> 00:20:37,520
It's about matching that quality at a fraction of the infrastructure cost.
495
00:20:37,520 --> 00:20:41,440
It wins at a scale where the cost of hardware has already become the only thing
496
00:20:41,440 --> 00:20:43,280
people are talking about in the architecture review.
497
00:20:43,280 --> 00:20:46,240
Where disk N lives in the Azure ecosystem,
498
00:20:46,240 --> 00:20:49,120
disk N isn't a service you just turn on in the Azure portal.
499
00:20:49,120 --> 00:20:52,640
It's an algorithm that Microsoft has embedded across multiple services,
500
00:20:52,640 --> 00:20:55,920
and each one has different operational rules and cost models.
501
00:20:55,920 --> 00:20:58,960
Choosing disk N means choosing a specific implementation,
502
00:20:58,960 --> 00:21:02,560
and that choice carries consequences that have nothing to do with the algorithm itself.
503
00:21:02,560 --> 00:21:06,160
Azure Cosmos DB NorthCure is the flagship for this technology.
504
00:21:06,160 --> 00:21:10,800
This is where Microsoft has invested most deeply to make disk N ready for the enterprise.
505
00:21:10,800 --> 00:21:16,160
The Cosmos DB version handles frequent rights, manages automatic partitioning as your data grows,
506
00:21:16,160 --> 00:21:19,280
and scales without you having to manually re-shard anything.
507
00:21:19,280 --> 00:21:22,240
The 2025 research paper on cost-effective search
508
00:21:22,240 --> 00:21:26,800
describes a system where the index stays in sync as documents are added or removed.
509
00:21:26,800 --> 00:21:31,280
That operational continuity is what makes Cosmos DB disk N viable for RAC systems
510
00:21:31,280 --> 00:21:33,120
that ingest new content every minute.
511
00:21:33,120 --> 00:21:37,040
Your knowledge base grows, the index updates, and the quality doesn't drop.
512
00:21:37,040 --> 00:21:38,960
That is the behavior a real business needs.
513
00:21:38,960 --> 00:21:42,480
SQL Server Vector Search is also based on disk NN,
514
00:21:42,480 --> 00:21:46,480
but the operational reality is different enough that it belongs in its own category.
515
00:21:46,480 --> 00:21:48,640
When you build a vector index in SQL Server,
516
00:21:48,640 --> 00:21:52,400
the process might lock your source table in read-only mode while it works.
517
00:21:52,400 --> 00:21:55,360
On large datasets that doesn't take minutes, it takes hours.
518
00:21:55,360 --> 00:21:58,000
Once that index is built, it's effectively frozen in time.
519
00:21:58,000 --> 00:22:01,040
You can't incrementally add vectors the way you can with other methods.
520
00:22:01,040 --> 00:22:04,960
Updating the data requires copying the table, rebuilding the index on the copy,
521
00:22:04,960 --> 00:22:06,160
and then swapping them out.
522
00:22:06,160 --> 00:22:08,480
That's a complex dance that works for monthly updates,
523
00:22:08,480 --> 00:22:11,840
but it breaks down for systems that need to show new information immediately.
524
00:22:11,840 --> 00:22:15,200
SQL Server Disk N is useful for a very specific class of workload.
525
00:22:15,200 --> 00:22:18,880
It's for large, static datasets that already live in SQL and don't change much
526
00:22:18,880 --> 00:22:22,080
like legal archives or quarterly product catalogs.
527
00:22:22,080 --> 00:22:25,760
For those use cases, it's a practical way to get billion-scale search without migrating
528
00:22:25,760 --> 00:22:27,280
of your existing infrastructure.
529
00:22:27,280 --> 00:22:29,440
For anything that ingests data continuously,
530
00:22:29,440 --> 00:22:32,400
it creates more operational pain than it actually solves.
531
00:22:32,400 --> 00:22:35,680
Azure AI Search occupies a different spot in this ecosystem.
532
00:22:35,680 --> 00:22:39,120
Historically, it relied on in-memory indexing for its native vector search,
533
00:22:39,120 --> 00:22:41,520
and that is still true for its primary structures today.
534
00:22:41,520 --> 00:22:45,520
The 2026 reference architecture from Microsoft now steers organizations
535
00:22:45,520 --> 00:22:48,560
with massive workloads toward Cosmos DB as the retrieval layer.
536
00:22:49,120 --> 00:22:54,720
In this setup, Azure AI Search handles the keyword components and the semantic re-ranking on top.
537
00:22:54,720 --> 00:22:58,320
This pattern treats the two services as partners rather than competitors.
538
00:22:58,320 --> 00:23:01,680
You use Azure AI Search for the hybrid pipeline and Cosmos DB
539
00:23:01,680 --> 00:23:03,760
for the scale-efficient storage beneath it.
540
00:23:03,760 --> 00:23:08,160
The most significant proof point for all of this sits outside the public service catalog.
541
00:23:08,160 --> 00:23:13,840
Disk N powers the semantic index that makes Microsoft 365 co-pilot work at a global scale.
542
00:23:13,840 --> 00:23:17,200
Every time a co-pilot user pulls context from SharePoint or Teams,
543
00:23:17,200 --> 00:23:19,680
that search is running against a disk and backed index.
544
00:23:19,680 --> 00:23:22,960
With 15 million paid co-pilot seats as of early 2026,
545
00:23:22,960 --> 00:23:24,960
the algorithm isn't just a theory in a lab.
546
00:23:24,960 --> 00:23:29,600
It's operating under massive production load across thousands of organizations simultaneously,
547
00:23:29,600 --> 00:23:30,800
and the system is holding up.
548
00:23:30,800 --> 00:23:35,760
That internal deployment is the proof of concept that no white paper can replace.
549
00:23:35,760 --> 00:23:38,960
Microsoft didn't build disk N just to sell a database service.
550
00:23:38,960 --> 00:23:41,360
They built it because they had to run semantic search
551
00:23:41,360 --> 00:23:45,440
at a scale that made the RAM requirements of HNSW impossible to afford.
552
00:23:45,440 --> 00:23:47,520
Co-pilot is the result of that necessity.
553
00:23:47,520 --> 00:23:49,920
The practical takeaway for architects is simple.
554
00:23:49,920 --> 00:23:53,440
When you select disk NN, you're selecting a service and an operational model,
555
00:23:53,440 --> 00:23:54,720
not just a piece of math.
556
00:23:54,720 --> 00:23:57,040
Cosmos DB brings dynamic data support,
557
00:23:57,040 --> 00:23:59,680
while SQL Server brings deep relational integration,
558
00:23:59,680 --> 00:24:02,320
as your AI search brings hybrid orchestration.
559
00:24:02,320 --> 00:24:04,240
Each pairing fits a different profile,
560
00:24:04,240 --> 00:24:07,520
and the wrong choice creates friction that the algorithm alone cannot fix.
561
00:24:07,520 --> 00:24:09,920
The next question is how to make that decision correctly,
562
00:24:09,920 --> 00:24:12,320
and the answer isn't just about how many vectors you have today.
563
00:24:12,880 --> 00:24:15,920
The decision framework, scale is not the only variable.
564
00:24:15,920 --> 00:24:19,760
The instinct after everything we've covered is to reduce this to a question of size.
565
00:24:19,760 --> 00:24:24,320
Small dataset use HNSW, large dataset use disk N.
566
00:24:24,320 --> 00:24:27,600
You draw a line somewhere around 100 million vectors and call it a day,
567
00:24:27,600 --> 00:24:30,160
but in reality, that framing is too simple.
568
00:24:30,160 --> 00:24:33,360
Following it blindly will put you in the wrong architecture for the wrong reasons.
569
00:24:33,360 --> 00:24:36,000
There are actually four variables that drive this decision.
570
00:24:36,000 --> 00:24:37,840
Data set size is just the first one you see,
571
00:24:37,840 --> 00:24:40,320
but update frequency, latency requirements,
572
00:24:40,320 --> 00:24:43,360
and cost sensitivity can each override the size factor.
573
00:24:43,360 --> 00:24:46,400
You need to work through all four before you commit to a direction.
574
00:24:46,400 --> 00:24:48,480
Data set size sets the initial frame.
575
00:24:48,480 --> 00:24:50,400
If you are under 50 million vectors,
576
00:24:50,400 --> 00:24:53,520
HNSW on standard infrastructure is usually the right call.
577
00:24:53,520 --> 00:24:56,400
The memory requirement at that scale is real but manageable,
578
00:24:56,400 --> 00:24:59,200
which means you aren't shopping for exotic VM skews.
579
00:24:59,200 --> 00:25:02,160
And the operational simplicity of a well understood algorithm
580
00:25:02,160 --> 00:25:03,440
has genuine value.
581
00:25:03,440 --> 00:25:05,920
When you get between 50 and 100 million vectors,
582
00:25:05,920 --> 00:25:07,600
you enter transition territory.
583
00:25:07,600 --> 00:25:11,520
HNSW still works here, but you should be actively modeling your cost trajectory
584
00:25:11,520 --> 00:25:15,440
and planning a path to disk N rather than waiting until you're forced to move.
585
00:25:15,440 --> 00:25:17,440
Once you cross that 100 million vector mark,
586
00:25:17,440 --> 00:25:20,480
the cost advantage of disk N becomes structurally decisive.
587
00:25:20,480 --> 00:25:24,160
The infrastructure savings compound so significantly at that scale
588
00:25:24,160 --> 00:25:27,280
that the cost of staying put starts to exceed the cost of migrating.
589
00:25:27,280 --> 00:25:30,080
But here is where update frequency changes the answer.
590
00:25:30,080 --> 00:25:32,880
HNSW handles incremental inserts naturally.
591
00:25:32,880 --> 00:25:36,560
So when you add a new document, the graph updates and the index stays current.
592
00:25:36,560 --> 00:25:40,720
That operational smoothness disappears the moment you move to disk N via SQL Server.
593
00:25:40,720 --> 00:25:45,760
As we discussed, SQL Server, disk N indexes are effectively immutable once you build them.
594
00:25:45,760 --> 00:25:48,240
If your workload involves continuous ingestion,
595
00:25:48,240 --> 00:25:51,200
like new support tickets every hour or policy updates every week,
596
00:25:51,200 --> 00:25:54,640
SQL Server, disk NN creates operational debt that piles up fast.
597
00:25:54,640 --> 00:25:57,920
Cosmos DB disk N was engineered specifically for frequent changes.
598
00:25:57,920 --> 00:26:00,800
So for dynamic data, it is the right implementation.
599
00:26:00,800 --> 00:26:03,840
If your team is currently running SQL Server and thinking they will just
600
00:26:03,840 --> 00:26:08,480
add the vector index there, you need to answer the update frequency question before you make that move.
601
00:26:08,480 --> 00:26:12,800
Latency requirements are the third axis and this is the one most likely to tip a borderline decision
602
00:26:12,800 --> 00:26:18,080
back toward HNSW. If your application has a hard SLA for sub five millisecond query response
603
00:26:18,080 --> 00:26:20,400
and I mean the vector retrieval steps specifically,
604
00:26:20,400 --> 00:26:23,360
not the total pipeline, HNSW is the answer.
605
00:26:23,360 --> 00:26:25,280
There are workloads where this actually matters,
606
00:26:25,280 --> 00:26:29,040
such as real time personalization or high frequency trading intelligence.
607
00:26:29,040 --> 00:26:31,360
For those specific interactive search experiences,
608
00:26:31,360 --> 00:26:35,840
the 15 to 20 millisecond range that this guy and operates in is a genuine constraint.
609
00:26:35,840 --> 00:26:39,120
It isn't just an engineering preference, it's a product deal breaker.
610
00:26:39,120 --> 00:26:41,840
For most enterprise rag workloads though,
611
00:26:41,840 --> 00:26:44,640
that sub five millisecond requirement doesn't actually apply.
612
00:26:44,640 --> 00:26:47,040
The user isn't waiting for the retrieval step.
613
00:26:47,040 --> 00:26:50,480
They are waiting for the complete response, which includes retrieval,
614
00:26:50,480 --> 00:26:52,400
re-ranking and generation.
615
00:26:52,400 --> 00:26:56,960
In that context, a 15 millisecond vector retrieval inside a two second end-to-end pipeline
616
00:26:56,960 --> 00:26:58,240
is just a rounding error.
617
00:26:58,240 --> 00:27:01,680
Teams that anchor their decision on retrieval step latency and isolation
618
00:27:01,680 --> 00:27:05,120
are often optimizing a variable that the user can't even see.
619
00:27:05,120 --> 00:27:06,960
Cost sensitivity is the fourth axis,
620
00:27:06,960 --> 00:27:09,920
and at enterprise scale it tends to dominate everything else.
621
00:27:09,920 --> 00:27:13,680
Ram on Azure costs significantly more per gigabyte than managed SSD,
622
00:27:13,680 --> 00:27:16,720
and that ratio stays the same across different tiers and regions.
623
00:27:16,720 --> 00:27:20,560
When you choose between memory optimized VM clusters for HNSW
624
00:27:20,560 --> 00:27:23,040
and SSD backed partitions for disk A&N,
625
00:27:23,040 --> 00:27:25,280
the pricing difference compounds with every replica
626
00:27:25,280 --> 00:27:27,360
and every percentage point of growth.
627
00:27:27,360 --> 00:27:30,640
The most sophisticated architectures running in 2026
628
00:27:30,640 --> 00:27:32,080
don't force a binary choice.
629
00:27:32,080 --> 00:27:34,880
They use both algorithms for different parts of the workload.
630
00:27:34,880 --> 00:27:38,240
Frequently accessed vectors, the working set that handles most queries,
631
00:27:38,240 --> 00:27:40,960
live in an HNSW style in memory cache
632
00:27:40,960 --> 00:27:43,200
where sub ten millisecond response is possible.
633
00:27:43,200 --> 00:27:46,800
The full corpus lives in a disk and backed service for deep retrieval
634
00:27:46,800 --> 00:27:48,560
when the hot cache comes up empty.
635
00:27:48,560 --> 00:27:52,000
This tiered pattern lets you get the best performance from both algorithms
636
00:27:52,000 --> 00:27:54,800
without paying the full RAM cost for the entire corpus.
637
00:27:54,800 --> 00:27:56,960
That hybrid architecture is a real option,
638
00:27:56,960 --> 00:27:58,640
but it also adds complexity.
639
00:27:58,640 --> 00:28:02,000
Whether that complexity is worth it depends on whether your hot working set
640
00:28:02,000 --> 00:28:04,000
is measurably smaller than your full corpus.
641
00:28:04,000 --> 00:28:07,840
You also have to decide if the latency difference between those tiers is acceptable
642
00:28:07,840 --> 00:28:09,600
for your specific use case.
643
00:28:09,600 --> 00:28:13,040
The real cost model, RAM versus SSD, and as you're pricing,
644
00:28:13,040 --> 00:28:15,520
to understand why the memory wall is so expensive,
645
00:28:15,520 --> 00:28:18,800
you have to look at how Azure AI Search actually builds you.
646
00:28:18,800 --> 00:28:21,840
The pricing model has a structure that makes the RAM problem much worse
647
00:28:21,840 --> 00:28:23,280
than it looks on the surface.
648
00:28:23,280 --> 00:28:26,080
Azure AI Search builds through search units.
649
00:28:26,080 --> 00:28:30,160
A search unit combines compute and storage into a single block of capacity.
650
00:28:30,160 --> 00:28:33,040
You scale the service by adding partitions to hold more data
651
00:28:33,040 --> 00:28:36,160
or replicas to handle more queries and provide availability.
652
00:28:36,160 --> 00:28:40,640
The key reality here is that vector search itself has no extra per query fee.
653
00:28:40,640 --> 00:28:42,880
You aren't built every time a user hits Search.
654
00:28:42,880 --> 00:28:45,920
Instead you pay for the infrastructure capacity you have provisioned
655
00:28:45,920 --> 00:28:49,440
and it runs continuously whether anyone is using it at 3am or not.
656
00:28:49,440 --> 00:28:53,840
That flat capacity model sounds great until you realize what forces you to add more capacity.
657
00:28:53,840 --> 00:28:58,240
For H&SW indexing, the primary driver is the memory needed to keep the index resident.
658
00:28:58,240 --> 00:29:02,320
As your vector count grows, you need more memory per partition to hold that graph.
659
00:29:02,320 --> 00:29:06,720
More memory means moving up skew tiers, which means a higher hourly rate for every search unit.
660
00:29:06,720 --> 00:29:10,720
That rate applies to every partition and every replica every hour of every day.
661
00:29:10,720 --> 00:29:13,440
Let's look at a concrete scenario to see how this stacks up.
662
00:29:13,440 --> 00:29:17,680
20GB of data with embeddings on an S1 tier Azure AI Search service
663
00:29:17,680 --> 00:29:20,000
costs about $245 per month.
664
00:29:20,000 --> 00:29:21,360
That is just the base service.
665
00:29:21,360 --> 00:29:27,040
Embedding generation adds another $100 to $650, depending on the model you use from Azure Open AI.
666
00:29:27,040 --> 00:29:30,320
If you add semantic re-ranking, the first thousand requests are free,
667
00:29:30,320 --> 00:29:32,080
but then metered usage starts.
668
00:29:32,080 --> 00:29:36,800
For a mid-size enterprise, a fully configured pipeline usually sits between $409
669
00:29:36,800 --> 00:29:39,760
and $900 per month for modest data volumes.
670
00:29:39,760 --> 00:29:43,280
That isn't alarming, but the alarm goes off when the data volume grows.
671
00:29:43,280 --> 00:29:46,240
At 500 million vectors, the math for H&SW memory
672
00:29:46,240 --> 00:29:49,120
creates an infrastructure build that belongs in a completely different conversation.
673
00:29:49,120 --> 00:29:51,920
You would need roughly three terabytes of RAM just for the index,
674
00:29:51,920 --> 00:29:55,280
which means you are building a cluster of many memory-optimized nodes.
675
00:29:55,280 --> 00:29:58,000
If you replicate that for availability, you've doubled the number.
676
00:29:58,000 --> 00:29:59,920
If you add a second region, you've tripled it.
677
00:29:59,920 --> 00:30:02,400
The cost isn't measured in hundreds of dollars anymore.
678
00:30:02,400 --> 00:30:04,880
It is measured in multiples of what the team budgeted
679
00:30:04,880 --> 00:30:07,200
when the project was approved at 10 million vectors.
680
00:30:07,200 --> 00:30:10,080
This can end via Cosmos DB changes that cost structure entirely.
681
00:30:10,080 --> 00:30:13,280
Instead of a RAM-heavy virtual machines sized to hold a graph,
682
00:30:13,280 --> 00:30:17,920
you provision SSD-backed partitions and pay for requests units based on your actual query volume.
683
00:30:17,920 --> 00:30:20,800
The per-gigabyte cost of NVMe SSD on Azure
684
00:30:20,800 --> 00:30:23,680
is a tiny fraction of the cost of memory-optimized RAM.
685
00:30:23,680 --> 00:30:27,680
At 500 million vectors, the discount RAM requirement is only about 50 gigabytes,
686
00:30:27,680 --> 00:30:29,520
which fits on standard infrastructure.
687
00:30:29,520 --> 00:30:31,680
The bulk of the storage cost moves to SSD,
688
00:30:31,680 --> 00:30:34,320
where capacity is cheap and scales horizontally,
689
00:30:34,320 --> 00:30:36,080
without that massive per-unit premium.
690
00:30:36,080 --> 00:30:39,120
There is also an operational multiplier in the H&SW model
691
00:30:39,120 --> 00:30:41,360
that makes these numbers even sharper.
692
00:30:41,360 --> 00:30:45,760
Every availability replica needs its own complete copy of the index in memory.
693
00:30:45,760 --> 00:30:47,280
That isn't an optional setting.
694
00:30:47,280 --> 00:30:49,680
It is just how distributed in memory search works.
695
00:30:49,680 --> 00:30:53,360
Every regional deployment adds another full copy,
696
00:30:53,360 --> 00:30:55,520
and every time you scale for a traffic spike,
697
00:30:55,520 --> 00:30:58,720
you provision more memory-dense capacity that stays running at full cost
698
00:30:58,720 --> 00:31:00,160
even after the traffic drops.
699
00:31:00,160 --> 00:31:02,160
Disca-N doesn't get rid of replication costs,
700
00:31:02,160 --> 00:31:05,600
but it replicates data at SSD pricing rather than RAM pricing.
701
00:31:05,600 --> 00:31:08,560
That difference, multiplied across months and regions,
702
00:31:08,560 --> 00:31:11,520
is where the massive cost gap actually comes from.
703
00:31:11,520 --> 00:31:13,440
The pricing model makes one thing very clear.
704
00:31:13,440 --> 00:31:16,960
The Azure invoice does not care which algorithm had better recall in a benchmark,
705
00:31:17,200 --> 00:31:19,920
it only cares how much memory you needed to deliver that recall
706
00:31:19,920 --> 00:31:21,840
under your real operating conditions.
707
00:31:21,840 --> 00:31:24,560
That is the cost model that should drive your architecture,
708
00:31:24,560 --> 00:31:27,440
and it is rarely on the table early enough in the process.
709
00:31:27,440 --> 00:31:32,080
The mutability problem, updates, deletes, and index drift.
710
00:31:32,080 --> 00:31:34,080
Cost is usually the first thing people look at
711
00:31:34,080 --> 00:31:36,560
when choosing between H&SW and Disca-N,
712
00:31:36,560 --> 00:31:40,160
but in a live environment, it isn't always the biggest headache.
713
00:31:40,160 --> 00:31:43,200
For any team running a RAC system that pulls in new data every day,
714
00:31:43,200 --> 00:31:44,880
the real problem is operational.
715
00:31:44,880 --> 00:31:47,280
You have to ask what actually happens to your index
716
00:31:47,280 --> 00:31:49,280
when the data behind it starts changing.
717
00:31:49,280 --> 00:31:53,040
H&SW is actually pretty good at handling new inserts as they come in.
718
00:31:53,040 --> 00:31:55,920
When you embed a new document and need to add it to the index,
719
00:31:55,920 --> 00:31:59,120
the algorithm just wires it into the existing graph as a new node
720
00:31:59,120 --> 00:32:00,640
without needing a full rebuild.
721
00:32:00,640 --> 00:32:03,680
This incremental approach is why H&SW feels so comfortable
722
00:32:03,680 --> 00:32:05,920
for teams that need to keep their knowledge bases current.
723
00:32:05,920 --> 00:32:07,920
You can publish a new policy on Monday morning,
724
00:32:07,920 --> 00:32:09,200
have it embedded by lunch,
725
00:32:09,200 --> 00:32:12,320
and see it show up in support ticket results by the end of the day.
726
00:32:12,320 --> 00:32:15,120
But when you need to delete something, things get a lot messier.
727
00:32:15,120 --> 00:32:18,080
Most H&SW setups don't actually pull nodes out of the graph.
728
00:32:18,080 --> 00:32:20,080
The moment a document is deleted or replaced,
729
00:32:20,080 --> 00:32:22,480
instead they just mark that node with a tombstone flag
730
00:32:22,480 --> 00:32:23,840
to hide it from search results
731
00:32:23,840 --> 00:32:26,880
while leaving all its physical connections in the graph structure.
732
00:32:26,880 --> 00:32:29,120
Over time these tombstone nodes start to pile up
733
00:32:29,120 --> 00:32:31,360
and waste memory without helping your search at all.
734
00:32:31,360 --> 00:32:33,280
They can even hurt the quality of your graph
735
00:32:33,280 --> 00:32:36,080
by hogging connections that should belong to live data.
736
00:32:36,080 --> 00:32:38,320
You can run periodic compaction to clean them out,
737
00:32:38,320 --> 00:32:42,240
but doing that on a massive H&SW index is a heavy lift for your hardware.
738
00:32:42,240 --> 00:32:44,480
The process usually requires holding the old graph
739
00:32:44,480 --> 00:32:46,320
and the new one in memory at the same time,
740
00:32:46,320 --> 00:32:47,920
which can easily crash a system
741
00:32:47,920 --> 00:32:49,760
that was already running near its limit.
742
00:32:49,760 --> 00:32:52,880
A SQL Server disk A& has a completely different way of failing.
743
00:32:52,880 --> 00:32:54,400
If you look at the official documentation
744
00:32:54,400 --> 00:32:56,560
for building a vector index in SQL Server,
745
00:32:56,560 --> 00:32:58,240
it's very honest about the trade-offs.
746
00:32:58,240 --> 00:33:00,480
The source table might be locked in read-only mode
747
00:33:00,480 --> 00:33:02,640
for the entire time the index is building.
748
00:33:02,640 --> 00:33:03,680
On a large data set,
749
00:33:03,680 --> 00:33:05,600
we aren't talking about a few minutes of downtime
750
00:33:05,600 --> 00:33:06,880
during a maintenance window.
751
00:33:06,880 --> 00:33:08,240
We are talking about hours.
752
00:33:08,240 --> 00:33:11,040
If you have other services that need to write to that table,
753
00:33:11,040 --> 00:33:13,600
they are completely blocked until the build finishes.
754
00:33:13,600 --> 00:33:16,400
For an enterprise knowledge base that needs constant updates,
755
00:33:16,400 --> 00:33:17,760
this isn't just a minor annoyance,
756
00:33:17,760 --> 00:33:19,680
it's a massive architectural wall.
757
00:33:19,680 --> 00:33:21,440
The limitations don't stop there either.
758
00:33:21,440 --> 00:33:23,680
Once you build a SQL Server disk and index,
759
00:33:23,680 --> 00:33:25,360
it stays effectively static.
760
00:33:25,360 --> 00:33:27,120
You can't just add new vectors one by one,
761
00:33:27,120 --> 00:33:28,720
like you can with H&SW.
762
00:33:28,720 --> 00:33:30,080
If you have new documents to index,
763
00:33:30,080 --> 00:33:32,240
the standard move is to copy the table,
764
00:33:32,240 --> 00:33:33,360
add the new content,
765
00:33:33,360 --> 00:33:34,560
build a fresh index,
766
00:33:34,560 --> 00:33:36,000
and then swap it into production.
767
00:33:36,000 --> 00:33:39,440
That works fine if you only update your data once a month or once a quarter,
768
00:33:39,440 --> 00:33:42,960
but if your Ragsystem needs to show new information within an hour of it being written,
769
00:33:42,960 --> 00:33:45,360
there is a fundamental mismatch between what you need
770
00:33:45,360 --> 00:33:47,360
and what this technology can actually do.
771
00:33:47,360 --> 00:33:50,480
Cosmos DBDISCAN was designed from a totally different perspective.
772
00:33:50,480 --> 00:33:53,040
The 2025 research paper on this implementation
773
00:33:53,040 --> 00:33:55,040
shows that search accuracy stays stable
774
00:33:55,040 --> 00:33:56,560
even as your data shifts around.
775
00:33:56,560 --> 00:33:58,720
You can add, change, or remove vectors
776
00:33:58,720 --> 00:34:00,160
without the index falling apart
777
00:34:00,160 --> 00:34:01,680
or needing a total reset.
778
00:34:01,680 --> 00:34:06,240
The system integrates the disk and graph directly into the existing Cosmos DB partition structure
779
00:34:06,240 --> 00:34:08,880
so that everything stays in sync as data flows through.
780
00:34:08,880 --> 00:34:11,920
Having your content available for search in near real time
781
00:34:11,920 --> 00:34:13,600
wasn't just a bonus feature for them.
782
00:34:13,600 --> 00:34:16,480
It was a core requirement they built the entire system around.
783
00:34:16,480 --> 00:34:19,440
The lesson here for how you manage your data is very clear.
784
00:34:19,440 --> 00:34:22,400
If your Ragsystem is constantly ingesting new product docs,
785
00:34:22,400 --> 00:34:23,680
regulatory updates,
786
00:34:23,680 --> 00:34:25,440
or customer support tickets,
787
00:34:25,440 --> 00:34:27,840
your choice of disk and N-version matters.
788
00:34:27,840 --> 00:34:29,840
It's the difference between a pipeline
789
00:34:29,840 --> 00:34:31,920
that runs quietly in the background
790
00:34:31,920 --> 00:34:35,280
and one that turns into a recurring emergency for your ops team.
791
00:34:35,280 --> 00:34:37,680
The math behind the algorithm might be the same,
792
00:34:37,680 --> 00:34:41,760
but the service wrapped around it is what actually determines your experience in production.
793
00:34:41,760 --> 00:34:44,160
How the index choice affects your Rags pipeline.
794
00:34:44,160 --> 00:34:45,840
By the time we get to 2026,
795
00:34:45,840 --> 00:34:48,640
Rags retrieval isn't just a simple database call anymore.
796
00:34:48,640 --> 00:34:50,480
It has turned into a multi-stage pipeline
797
00:34:50,480 --> 00:34:52,000
where queries get rewritten,
798
00:34:52,000 --> 00:34:53,760
hybrid searches run in parallel,
799
00:34:53,760 --> 00:34:55,680
and a re-ranker scores the best results
800
00:34:55,680 --> 00:34:57,440
before the LLM even sees them.
801
00:34:57,440 --> 00:34:59,680
Every choice you make at the vector index stage
802
00:34:59,680 --> 00:35:02,240
is going to ripple forward through that entire process.
803
00:35:02,240 --> 00:35:04,480
The algorithm itself lives in the retrieval stage
804
00:35:04,480 --> 00:35:07,200
but its impact shows up in two places you might not expect.
805
00:35:07,200 --> 00:35:09,360
It changes your total pipeline latency
806
00:35:09,360 --> 00:35:12,560
and it changes the quality of the data your re-ranker has to work with.
807
00:35:12,560 --> 00:35:16,320
We have to start by admitting that pure vector search isn't a silver bullet.
808
00:35:16,320 --> 00:35:18,080
Embeddings are great at capturing meaning,
809
00:35:18,080 --> 00:35:20,160
but they often struggle with the specific details
810
00:35:20,160 --> 00:35:21,920
a production system needs to get right.
811
00:35:21,920 --> 00:35:23,920
Things like SKU numbers, contract IDs,
812
00:35:23,920 --> 00:35:26,640
or specific ticket references often get blurred
813
00:35:26,640 --> 00:35:28,160
during the embedding process.
814
00:35:28,160 --> 00:35:31,520
The model tries to turn character-level differences into general similarity,
815
00:35:31,520 --> 00:35:32,560
which is usually helpful,
816
00:35:32,560 --> 00:35:34,400
but sometimes exactly what you don't want.
817
00:35:34,400 --> 00:35:35,760
Negation is another big problem
818
00:35:35,760 --> 00:35:38,400
because a sentence about who is eligible for a benefit
819
00:35:38,400 --> 00:35:40,720
looks almost identical to a sentence about who isn't.
820
00:35:40,720 --> 00:35:42,640
The model just doesn't have a reliable way
821
00:35:42,640 --> 00:35:45,440
to turn logical opposites into a large distance in vector space.
822
00:35:45,440 --> 00:35:48,240
This is why hybrid search is now the standard for enterprise rag
823
00:35:48,240 --> 00:35:50,000
instead of just being an optional upgrade.
824
00:35:50,000 --> 00:35:53,120
You use BM25 to handle exact matches and keywords
825
00:35:53,120 --> 00:35:55,680
while the vector side handles the concepts and meaning.
826
00:35:55,680 --> 00:35:58,640
Azure AI search runs both of these at the same time
827
00:35:58,640 --> 00:36:01,920
and merges them into one list that covers all your bases.
828
00:36:01,920 --> 00:36:03,760
If your system needs to handle the messy,
829
00:36:03,760 --> 00:36:06,480
specific questions that real users actually ask,
830
00:36:06,480 --> 00:36:09,360
you really can't afford to skip either side of that equation.
831
00:36:09,360 --> 00:36:13,360
This is where the specific index algorithm finally plugs into the rest of the pipeline.
832
00:36:13,360 --> 00:36:16,560
H&SW pulls results from RAM at incredible speeds,
833
00:36:16,560 --> 00:36:18,480
usually in less than 10 milliseconds.
834
00:36:18,480 --> 00:36:21,120
This can take a bit longer because of its two-stage process,
835
00:36:21,120 --> 00:36:24,160
usually landing between 15 and 20 milliseconds at scale.
836
00:36:24,160 --> 00:36:26,320
Both of those results go into the same fusion process
837
00:36:26,320 --> 00:36:28,240
and then move on to the re-ranking step.
838
00:36:28,240 --> 00:36:30,080
Once you hit that re-ranking stage,
839
00:36:30,080 --> 00:36:33,120
the speed difference between the two algorithms basically vanishes.
840
00:36:33,120 --> 00:36:37,360
A standard re-ranker looking at the top 100 candidates takes about 30 to 60 milliseconds
841
00:36:37,360 --> 00:36:38,640
to run on most hardware.
842
00:36:38,640 --> 00:36:41,360
When you add the overhead of the semantic ranker
843
00:36:41,360 --> 00:36:44,560
and the hundreds of milliseconds it takes for the LLM to generate a response,
844
00:36:44,560 --> 00:36:45,920
the gap disappears.
845
00:36:45,920 --> 00:36:48,960
The 15 millisecond difference between H&SW and DISCAN
846
00:36:48,960 --> 00:36:51,680
doesn't matter when the whole pipeline takes two full seconds.
847
00:36:51,680 --> 00:36:54,640
Your users won't feel it and your performance logs won't even care.
848
00:36:54,640 --> 00:36:56,880
It's a difference that only exists on a benchmark sheet,
849
00:36:56,880 --> 00:36:58,160
not in the real world.
850
00:36:58,160 --> 00:37:01,600
What the index choice actually changes is the quality of the candidates
851
00:37:01,600 --> 00:37:03,600
that the re-ranker has to sort through.
852
00:37:03,600 --> 00:37:08,880
Both H&SW and DISCAN are aiming for the same high accuracy rate of over 95%.
853
00:37:08,880 --> 00:37:12,960
This means the list of results they hand off should be basically the same at the query level.
854
00:37:12,960 --> 00:37:16,160
If a re-ranker gets 50 candidates from a DISCAN and index,
855
00:37:16,160 --> 00:37:19,680
it's working with the same quality of material it would get from H&SW.
856
00:37:19,680 --> 00:37:22,880
The algorithm changes how much it costs you to find those candidates,
857
00:37:22,880 --> 00:37:24,880
but it doesn't change how relevant they are.
858
00:37:24,880 --> 00:37:27,680
Your chunking strategy is the final piece of this puzzle.
859
00:37:27,680 --> 00:37:30,400
The way you split your documents before you ever embed them
860
00:37:30,400 --> 00:37:33,120
is what really determines how precise your answers will be.
861
00:37:33,120 --> 00:37:36,800
If you create semantically coherent chunks that follow section headings and logical units,
862
00:37:36,800 --> 00:37:39,600
your retrieval will improve no matter which algorithm you use.
863
00:37:39,600 --> 00:37:41,760
A well-organized set of data running on DISCAN
864
00:37:41,760 --> 00:37:45,760
will always outperform a messy, poorly chunked data set running on H&SW.
865
00:37:45,760 --> 00:37:49,680
At the end of the day, the index algorithm tells you what the infrastructure costs to run,
866
00:37:49,680 --> 00:37:53,440
but the chunking tells you if the content was even worth finding in the first place.
867
00:37:53,440 --> 00:37:56,160
You have to look at the pipeline as a single system.
868
00:37:56,160 --> 00:37:58,640
If you spend all your time optimizing the index algorithm
869
00:37:58,640 --> 00:38:02,320
without looking at re-ranking or chunking, you are tuning the wrong part of the engine.
870
00:38:02,320 --> 00:38:05,360
The index choice is important, but it's a decision about cost and scale,
871
00:38:05,360 --> 00:38:08,320
not about the tiny millisecond differences you see in a lab report.
872
00:38:08,320 --> 00:38:12,400
Enterprise rag failure modes and how index choice relates.
873
00:38:12,400 --> 00:38:16,080
Most teams get it backwards when a rag system starts spitting out wrong answers.
874
00:38:16,080 --> 00:38:17,920
The assumption is that the language model broke,
875
00:38:17,920 --> 00:38:20,080
the model hallucinated, the model made something up,
876
00:38:20,080 --> 00:38:22,000
you think you need to swap it for a better one.
877
00:38:22,000 --> 00:38:26,320
But in reality, the evidence we see heading into 2006 points somewhere else entirely.
878
00:38:26,320 --> 00:38:29,920
The main reason enterprise rag fails is retrieval, not generation.
879
00:38:29,920 --> 00:38:32,240
When your model gives a plausible but wrong answer,
880
00:38:32,240 --> 00:38:34,560
it usually isn't because the model invented a lie.
881
00:38:34,560 --> 00:38:37,760
It's because the right context never actually made it into the prompt.
882
00:38:37,760 --> 00:38:40,320
The model just answered with the data it was given,
883
00:38:40,320 --> 00:38:43,440
and that data was wrong, incomplete or totally outdated.
884
00:38:43,440 --> 00:38:44,800
That is a retrieval problem.
885
00:38:44,800 --> 00:38:47,280
No amount of model upgrades will fix a system
886
00:38:47,280 --> 00:38:49,680
that feeds the wrong information to the generator.
887
00:38:49,680 --> 00:38:52,240
These failure modes follow very predictable patterns.
888
00:38:52,240 --> 00:38:53,760
Negation is one of the biggest ones.
889
00:38:53,760 --> 00:38:58,080
If you ask about employees not eligible for leave versus employees eligible for leave,
890
00:38:58,080 --> 00:39:01,360
those two queries sit right next to each other in a vector space.
891
00:39:01,360 --> 00:39:06,080
The vocabulary is almost identical even though the logical distance between the correct answers is huge.
892
00:39:06,080 --> 00:39:07,760
The semantic distance is tiny.
893
00:39:07,760 --> 00:39:11,920
HR systems and compliance tools built on pure vector search fail this test all the time.
894
00:39:11,920 --> 00:39:13,520
The worst part is they don't fail loudly.
895
00:39:13,520 --> 00:39:17,280
They just give you a confident, well formatted answer about the wrong group of people.
896
00:39:17,280 --> 00:39:19,920
Exact identifiers are another place where the system breaks.
897
00:39:19,920 --> 00:39:23,200
Think about contract numbers, invoice IDs or SKU codes.
898
00:39:23,200 --> 00:39:27,520
The embedding process takes those character level differences and squashes them into similarity scores.
899
00:39:27,520 --> 00:39:30,640
A search for contract number 2024 EU7731
900
00:39:30,640 --> 00:39:35,280
might pull up 2024 EU7732 because the model sees them as basically the same thing.
901
00:39:35,280 --> 00:39:38,080
In procurement or legal work, that kind of near miss is a disaster.
902
00:39:38,080 --> 00:39:43,200
It's the wrong document being handed to a decision maker who trusts the system to be right.
903
00:39:43,200 --> 00:39:46,160
Then you have temporal failures which are quieter but get worse over time.
904
00:39:46,160 --> 00:39:49,040
Vector similarity doesn't understand the concept of new.
905
00:39:49,040 --> 00:39:52,720
An old policy and the one that replaced it look almost identical to a vector index
906
00:39:52,720 --> 00:39:55,440
because they use the same language to describe the same ideas.
907
00:39:55,440 --> 00:39:59,120
When both exist in your data the system just grabs whichever one ranks higher.
908
00:39:59,120 --> 00:40:02,160
Without metadata filters to check for version status or dates,
909
00:40:02,160 --> 00:40:05,280
stale content competes with current content on a level playing field.
910
00:40:05,280 --> 00:40:07,760
In a regulated environment that isn't just a bug.
911
00:40:07,760 --> 00:40:11,520
It's a structural risk you built into the architecture the moment you ingested the data.
912
00:40:11,520 --> 00:40:14,080
The most important thing to realize is that none of these
913
00:40:14,080 --> 00:40:16,160
are problems with your index algorithm.
914
00:40:16,160 --> 00:40:19,920
Switching from H&SW to disk and won't fix how the system handles negation.
915
00:40:19,920 --> 00:40:23,200
It won't help with ID matching. It won't make the index aware of time.
916
00:40:23,200 --> 00:40:26,480
These failures live in the retrieval architecture itself.
917
00:40:26,480 --> 00:40:28,720
They depend on whether you're running hybrid search,
918
00:40:28,720 --> 00:40:30,240
whether your metadata filters are right,
919
00:40:30,240 --> 00:40:34,000
and whether you have a re-ranker cleaning up the results before they hit the model.
920
00:40:34,000 --> 00:40:38,640
If your pipeline has these gaps, H&SW and disk and NN will fail in exactly the same way.
921
00:40:38,640 --> 00:40:42,320
What the algorithm actually changes is the scale where these errors start to matter.
922
00:40:42,320 --> 00:40:45,200
If you have five million documents, you have fewer chances to mess up
923
00:40:45,200 --> 00:40:46,640
than if you have five hundred million.
924
00:40:46,640 --> 00:40:50,560
The odds of an old document outranking a new one are lower when the pile is small and clean.
925
00:40:50,560 --> 00:40:54,960
But at a billion vector scale, every single weakness in your architecture shows up more often.
926
00:40:54,960 --> 00:40:56,960
disk and makes that scale affordable to run,
927
00:40:56,960 --> 00:40:58,400
but it doesn't make the scale safe.
928
00:40:58,400 --> 00:41:00,320
You still need the right architecture around it.
929
00:41:00,320 --> 00:41:01,840
The lesson here is pretty simple.
930
00:41:01,840 --> 00:41:03,440
disk and gives you the scale,
931
00:41:03,440 --> 00:41:06,720
but hybrid search and metadata filtering are what give you the right answers.
932
00:41:06,720 --> 00:41:08,160
You can't have one without the other.
933
00:41:08,160 --> 00:41:11,680
If a company moves to disk and to save money but ignores hybrid retrieval,
934
00:41:11,680 --> 00:41:15,040
they just end up with cheaper infrastructure that makes the same mistakes
935
00:41:15,040 --> 00:41:16,800
across a much larger data set.
936
00:41:16,800 --> 00:41:18,560
The algorithm choice handles the cost.
937
00:41:18,560 --> 00:41:20,240
The pipeline design handles the quality.
938
00:41:20,240 --> 00:41:21,520
These are two different problems.
939
00:41:21,520 --> 00:41:25,520
If you confuse them, you'll end up with an architecture that is either too expensive to run
940
00:41:25,520 --> 00:41:26,880
or too wrong to use.
941
00:41:26,880 --> 00:41:27,920
Neither of those is a win.
942
00:41:27,920 --> 00:41:31,200
Metadata filtering, the underused lever.
943
00:41:31,200 --> 00:41:34,800
Most teams spend months obsessing over embedding models and chunk sizes.
944
00:41:34,800 --> 00:41:36,960
They spend weeks benchmarking re-rankers.
945
00:41:36,960 --> 00:41:40,000
Then they design the metadata schema in a single afternoon
946
00:41:40,000 --> 00:41:43,040
because they just use whatever fields seem obvious at the time.
947
00:41:43,040 --> 00:41:46,320
That lopsided focus is exactly where production failures come from.
948
00:41:46,320 --> 00:41:50,720
Azure AI Search uses hybrid search to apply all data filters across both keyword
949
00:41:50,720 --> 00:41:52,240
and vector retrieval at the same time.
950
00:41:52,240 --> 00:41:54,800
This is the specific mechanism that keeps an intern in marketing
951
00:41:54,800 --> 00:41:56,800
from seeing documents meant for the legal team.
952
00:41:56,800 --> 00:41:59,280
It's how you make sure a policy search pulls the active version
953
00:41:59,280 --> 00:42:01,040
instead of something that expired two years ago.
954
00:42:01,040 --> 00:42:04,640
It's also how you keep one customer's data away from another in a shared system.
955
00:42:04,640 --> 00:42:07,120
But none of that works if you didn't design the metadata
956
00:42:07,120 --> 00:42:09,120
before you started ingesting data.
957
00:42:09,120 --> 00:42:12,000
The first rule is that metadata has to be granular.
958
00:42:12,000 --> 00:42:14,800
It needs to live at the chunk level, not the document level.
959
00:42:14,800 --> 00:42:18,160
A single policy might have different sections for different regions
960
00:42:18,160 --> 00:42:19,520
or different levels of sensitivity.
961
00:42:19,520 --> 00:42:22,720
If you only tag the parent document and then chop it into 50 pieces,
962
00:42:22,720 --> 00:42:24,240
every piece gets the same tags.
963
00:42:24,240 --> 00:42:26,640
That means your filters will either hide the whole document
964
00:42:26,640 --> 00:42:29,840
when they shouldn't or show sensitive sections to people who don't have access.
965
00:42:29,840 --> 00:42:32,800
You have to attach the right metadata to each individual chunk
966
00:42:32,800 --> 00:42:35,280
even if that makes the ingestion process more work.
967
00:42:35,280 --> 00:42:37,040
If you want high quality retrieval,
968
00:42:37,040 --> 00:42:39,360
you need to focus on a few specific categories.
969
00:42:39,360 --> 00:42:41,680
You need department and business unit for scope.
970
00:42:41,680 --> 00:42:43,920
You need region and jurisdiction for regulations.
971
00:42:43,920 --> 00:42:47,200
You need document types to tell a policy apart from a reference guide.
972
00:42:47,200 --> 00:42:49,440
You also need version status for recent C,
973
00:42:49,440 --> 00:42:51,520
sensitivity labels for security,
974
00:42:51,520 --> 00:42:53,280
and tenant IDs for isolation.
975
00:42:53,280 --> 00:42:56,160
Most enterprise systems already track these things.
976
00:42:56,160 --> 00:42:58,880
The problem is that teams forget to carry those details
977
00:42:58,880 --> 00:43:00,720
through the pipeline and into the index.
978
00:43:00,720 --> 00:43:02,560
They leave them sitting in the source system
979
00:43:02,560 --> 00:43:04,320
where the search engine can't see them.
980
00:43:04,320 --> 00:43:07,600
Azure AI Search recently added a parameter called filter override.
981
00:43:07,600 --> 00:43:10,240
This lets you apply one filter to the vector search
982
00:43:10,240 --> 00:43:13,280
and a different one to the keyword search within the same query.
983
00:43:13,280 --> 00:43:16,320
This is huge because sometimes you want your semantic search
984
00:43:16,320 --> 00:43:18,560
to cast a wide net across all departments
985
00:43:18,560 --> 00:43:21,840
while your keyword search stays locked onto official manuals.
986
00:43:21,840 --> 00:43:23,120
Or maybe you want the opposite.
987
00:43:23,120 --> 00:43:25,520
You might want strict security filtering on the vectors
988
00:43:25,520 --> 00:43:27,760
but a broader lexical sweep on the text.
989
00:43:27,760 --> 00:43:30,320
The two sides of a search often need different boundaries
990
00:43:30,320 --> 00:43:32,720
and this parameter stops you from having to compromise.
991
00:43:32,720 --> 00:43:35,360
Security is the biggest reason to get this right.
992
00:43:35,360 --> 00:43:38,000
In Azure AI Search, the filter happens on the server
993
00:43:38,000 --> 00:43:40,240
before any results ever leave the search layer.
994
00:43:40,240 --> 00:43:42,240
This is much safer than trying to filter results
995
00:43:42,240 --> 00:43:44,560
in your application code after the search is done.
996
00:43:44,560 --> 00:43:46,160
Application layer security is brittle.
997
00:43:46,160 --> 00:43:49,040
It requires every single API and every single developer
998
00:43:49,040 --> 00:43:50,880
to get the logic right every time.
999
00:43:50,880 --> 00:43:53,520
If one person misses a check, you have a security breach.
1000
00:43:53,520 --> 00:43:55,760
Server-side filtering puts the lock on the data layer
1001
00:43:55,760 --> 00:43:58,080
where it can't be bypassed by a mistake in the app.
1002
00:43:58,080 --> 00:43:59,680
Your index design is what determines
1003
00:43:59,680 --> 00:44:01,680
if filtering is fast or even possible.
1004
00:44:01,680 --> 00:44:04,400
When you create an index, you have to mark specific fields
1005
00:44:04,400 --> 00:44:05,520
as filterable.
1006
00:44:05,520 --> 00:44:07,920
This tells the system to build the internal structures
1007
00:44:07,920 --> 00:44:10,240
it needs to handle those queries efficiently.
1008
00:44:10,240 --> 00:44:12,400
If you forget to mark a field as filterable,
1009
00:44:12,400 --> 00:44:14,160
you can't use it in a filter at all.
1010
00:44:14,160 --> 00:44:17,600
To fix that later, you have to rebuild the entire index from scratch.
1011
00:44:17,600 --> 00:44:19,600
On a massive dataset, that means downtime
1012
00:44:19,600 --> 00:44:21,120
or a very expensive parallel build.
1013
00:44:21,120 --> 00:44:23,120
You don't pay for a bad schema when you design it.
1014
00:44:23,120 --> 00:44:25,200
You pay for it months later when the system is live
1015
00:44:25,200 --> 00:44:26,560
and your under pressure to fix the gap
1016
00:44:26,560 --> 00:44:28,400
the index wasn't built to handle.
1017
00:44:28,400 --> 00:44:30,560
Multitannant rag governance at scale.
1018
00:44:30,560 --> 00:44:32,800
Most enterprise rag deployments don't actually serve
1019
00:44:32,800 --> 00:44:34,080
one single group of people.
1020
00:44:34,080 --> 00:44:36,560
In reality, they serve HR, legal engineering
1021
00:44:36,560 --> 00:44:37,600
and finance all at once.
1022
00:44:37,600 --> 00:44:39,840
And every one of those departments expects the system
1023
00:44:39,840 --> 00:44:42,560
to show them only what they are authorized to see.
1024
00:44:42,560 --> 00:44:44,080
These systems handle business units
1025
00:44:44,080 --> 00:44:46,080
in different countries with unique regulations
1026
00:44:46,080 --> 00:44:47,760
and they manage partner organizations
1027
00:44:47,760 --> 00:44:49,920
with specific data sharing agreements.
1028
00:44:49,920 --> 00:44:51,600
We should call it what it is.
1029
00:44:51,600 --> 00:44:53,920
These systems are multi-tenant by nature,
1030
00:44:53,920 --> 00:44:57,040
even if the teams who built them never plan for it to be that way.
1031
00:44:57,040 --> 00:44:58,640
That accidental multi-tenancy
1032
00:44:58,640 --> 00:45:01,040
is exactly where governance starts to fall apart.
1033
00:45:01,040 --> 00:45:02,480
A system that was originally designed
1034
00:45:02,480 --> 00:45:05,040
for one department eventually gets expanded to five
1035
00:45:05,040 --> 00:45:07,840
and the metadata schema that worked for a single access level
1036
00:45:07,840 --> 00:45:11,200
gets stretched across a dozen different sensitivity labels.
1037
00:45:11,200 --> 00:45:12,640
The query routing logic was simple
1038
00:45:12,640 --> 00:45:13,920
when there was only one index
1039
00:45:13,920 --> 00:45:16,800
but it becomes a constant source of bugs once you have many.
1040
00:45:16,800 --> 00:45:19,200
Those architectural decisions that felt like options
1041
00:45:19,200 --> 00:45:20,880
during a pilot program suddenly become
1042
00:45:20,880 --> 00:45:23,040
load-bearing walls in a production system.
1043
00:45:23,040 --> 00:45:25,040
Two specific patterns handle this isolation
1044
00:45:25,040 --> 00:45:26,240
in Azure AI search.
1045
00:45:26,240 --> 00:45:27,680
The first is a single global index
1046
00:45:27,680 --> 00:45:29,520
that uses rich metadata filters
1047
00:45:29,520 --> 00:45:31,920
which means one index holds content from every tenant
1048
00:45:31,920 --> 00:45:34,000
while every query is restricted by filters
1049
00:45:34,000 --> 00:45:36,400
to keep users inside their own documents.
1050
00:45:36,400 --> 00:45:38,240
The second approach uses separate indexes
1051
00:45:38,240 --> 00:45:39,840
for every tenant or domain.
1052
00:45:39,840 --> 00:45:42,080
These are isolated instances with their own schemas,
1053
00:45:42,080 --> 00:45:44,720
their own capacity and their own specific performance tuning.
1054
00:45:44,720 --> 00:45:47,360
The single global index approach offers some real advantages
1055
00:45:47,360 --> 00:45:48,640
for your operations team.
1056
00:45:48,640 --> 00:45:50,080
You only have one index to manage
1057
00:45:50,080 --> 00:45:51,280
and one schema to maintain
1058
00:45:51,280 --> 00:45:53,920
which makes it much easier to monitor your ingestion pipelines.
1059
00:45:53,920 --> 00:45:55,360
If you need to run analytics to see
1060
00:45:55,360 --> 00:45:57,520
how the whole organization is using the system,
1061
00:45:57,520 --> 00:45:59,120
the data is already in one place.
1062
00:45:59,120 --> 00:46:00,880
From the perspective of the application layer,
1063
00:46:00,880 --> 00:46:01,920
the logic is very simple
1064
00:46:01,920 --> 00:46:03,920
because you just send the query with a tenant filter
1065
00:46:03,920 --> 00:46:06,080
and get back the right results.
1066
00:46:06,080 --> 00:46:07,280
But here is the problem.
1067
00:46:07,280 --> 00:46:09,520
The risk is just as big as the convenience.
1068
00:46:09,520 --> 00:46:10,960
Your entire security model
1069
00:46:10,960 --> 00:46:13,760
depends on those metadata filters being applied correctly
1070
00:46:13,760 --> 00:46:15,360
every single time a query is made.
1071
00:46:15,360 --> 00:46:17,440
If a filter gets dropped in one API endpoint
1072
00:46:17,440 --> 00:46:19,680
or if a developer takes a shortcut during testing
1073
00:46:19,680 --> 00:46:21,600
that accidentally makes it into production,
1074
00:46:21,600 --> 00:46:24,880
you have a massive data exposure incident on your hands.
1075
00:46:24,880 --> 00:46:26,080
The system won't fail loudly
1076
00:46:26,080 --> 00:46:27,920
or throw an error when a filter is missing.
1077
00:46:27,920 --> 00:46:29,920
It will just silently return results
1078
00:46:29,920 --> 00:46:31,600
and you might not realize there is a problem
1079
00:46:31,600 --> 00:46:34,480
until an audit happens or a user sees something they shouldn't.
1080
00:46:34,480 --> 00:46:36,320
The separate index approach trades
1081
00:46:36,320 --> 00:46:39,280
that simplicity for much better architectural isolation
1082
00:46:39,280 --> 00:46:40,960
because each tenant has their own index,
1083
00:46:40,960 --> 00:46:42,880
a query for tenant A physically
1084
00:46:42,880 --> 00:46:45,040
cannot return documents from tenant B.
1085
00:46:45,040 --> 00:46:47,520
The security is enforced by the structure of the system
1086
00:46:47,520 --> 00:46:49,400
rather than the logic of a filter.
1087
00:46:49,400 --> 00:46:51,680
This also makes it easier to tune your performance
1088
00:46:51,680 --> 00:46:54,080
as you can give more capacity to high volume tenants
1089
00:46:54,080 --> 00:46:56,960
without those decisions affecting anyone else in the system.
1090
00:46:56,960 --> 00:46:58,640
The trade-off here is that your routing logic
1091
00:46:58,640 --> 00:46:59,840
becomes more complex
1092
00:46:59,840 --> 00:47:02,400
and running cross tenant analytics gets a lot harder.
1093
00:47:02,400 --> 00:47:05,440
Your application needs to know exactly which index to talk to
1094
00:47:05,440 --> 00:47:06,640
for every single request
1095
00:47:06,640 --> 00:47:09,520
and that routing has to stay updated as your tenants change.
1096
00:47:09,520 --> 00:47:12,320
DiscaiNN through Cosmos DB has a massive structure advantage
1097
00:47:12,320 --> 00:47:14,080
if you choose the single global index pattern
1098
00:47:14,080 --> 00:47:16,160
because it uses automatic partitioning,
1099
00:47:16,160 --> 00:47:18,560
the index scales alongside your tenant data
1100
00:47:18,560 --> 00:47:20,640
without anyone having to step in and fix it.
1101
00:47:20,640 --> 00:47:23,200
If one tenant grows much faster than the others,
1102
00:47:23,200 --> 00:47:27,200
Cosmos DB rebalances that data across partitions automatically.
1103
00:47:27,200 --> 00:47:31,040
In an HNSW deployment, that same growth would eat up more memory
1104
00:47:31,040 --> 00:47:32,960
across every single node in the index.
1105
00:47:32,960 --> 00:47:36,000
That means your total RAM requirement is dictated by your heaviest tenant,
1106
00:47:36,000 --> 00:47:37,120
not your average one,
1107
00:47:37,120 --> 00:47:41,280
and one fast growing department can push your entire bill into a higher tier.
1108
00:47:41,280 --> 00:47:43,280
One rule applies to both of these patterns.
1109
00:47:43,280 --> 00:47:46,000
You must enforce access control at the moment of retrieval
1110
00:47:46,000 --> 00:47:47,360
inside the search layer.
1111
00:47:47,360 --> 00:47:49,360
You can't wait for the API gateway
1112
00:47:49,360 --> 00:47:51,360
or try to filter things in the application
1113
00:47:51,360 --> 00:47:53,280
after the results have already arrived.
1114
00:47:53,280 --> 00:47:55,520
The vector index itself should never surface a document
1115
00:47:55,520 --> 00:47:59,280
the user isn't allowed to see no matter how relevant that document is to the search.
1116
00:47:59,280 --> 00:48:02,480
Semantic similarity is a great tool for finding information
1117
00:48:02,480 --> 00:48:04,800
but it is not an authorization mechanism.
1118
00:48:04,800 --> 00:48:06,000
The business case.
1119
00:48:06,000 --> 00:48:08,400
When to invest in DiscaiN infrastructure?
1120
00:48:08,400 --> 00:48:11,760
The business case for DiscaiN isn't actually about how the algorithm works.
1121
00:48:11,760 --> 00:48:14,160
Executives don't sign off on new infrastructure
1122
00:48:14,160 --> 00:48:15,760
because an algorithm is clever.
1123
00:48:15,760 --> 00:48:19,360
They fund it because the current path is becoming too expensive to sustain.
1124
00:48:19,360 --> 00:48:22,000
That is the conversation where DiscaiN actually belongs.
1125
00:48:22,000 --> 00:48:23,440
It isn't for the technical review.
1126
00:48:23,440 --> 00:48:27,200
It's for the budget review that happens six months before your team hits the memory wall.
1127
00:48:27,200 --> 00:48:30,000
The way to explain this to a finance team is pretty simple.
1128
00:48:30,000 --> 00:48:31,920
H&SW is a great algorithm
1129
00:48:31,920 --> 00:48:35,440
but it has a cost structure that scales directly with your ambitions.
1130
00:48:35,440 --> 00:48:37,280
Every time you add more documents,
1131
00:48:37,280 --> 00:48:39,920
onboard more tenants or move into new regions,
1132
00:48:39,920 --> 00:48:42,240
your RAM requirements grow right along with them.
1133
00:48:42,240 --> 00:48:45,600
Eventually the cost of that growth is more than the project can justify.
1134
00:48:45,600 --> 00:48:48,080
And the conversation changes from how do we build this to
1135
00:48:48,080 --> 00:48:49,840
can we even afford to run this.
1136
00:48:49,840 --> 00:48:53,360
DiscaiN changes that dynamic by un-coupling your costs from your growth
1137
00:48:53,360 --> 00:48:55,920
When you look at the numbers for 500 million vectors,
1138
00:48:55,920 --> 00:48:57,760
the problem becomes very clear.
1139
00:48:57,760 --> 00:49:01,120
At that scale, H&SW needs about three terabytes of RAM
1140
00:49:01,120 --> 00:49:05,200
just to hold the index and that doesn't even include your raw data or your application.
1141
00:49:05,200 --> 00:49:08,640
On Azure, three terabytes of RAM requires expensive,
1142
00:49:08,640 --> 00:49:12,160
memory-optimized virtual machines across several different nodes.
1143
00:49:12,160 --> 00:49:14,560
If you want high availability that costs doubles
1144
00:49:14,560 --> 00:49:17,360
and if you need resilience in a second region, it triples.
1145
00:49:17,360 --> 00:49:20,400
At this scale, your search layer stops being a small line item
1146
00:49:20,400 --> 00:49:23,360
and starts becoming a headline in your quarterly budget review.
1147
00:49:23,360 --> 00:49:27,200
DiscaiN only needs about 50 gigabytes of RAM for that same scale.
1148
00:49:27,200 --> 00:49:30,400
While SSD costs are real, they are a tiny fraction of what you would pay
1149
00:49:30,400 --> 00:49:32,000
for memory-optimized compute.
1150
00:49:32,000 --> 00:49:33,440
This gap isn't just a theory,
1151
00:49:33,440 --> 00:49:37,200
it is exactly what you see when you compare the two in a pricing calculator.
1152
00:49:37,200 --> 00:49:40,640
There is also an ROI argument that goes way beyond just the infrastructure bill.
1153
00:49:40,640 --> 00:49:46,160
Data from 2026 shows that hybrid search setups reduce hallucinations by 62%
1154
00:49:46,160 --> 00:49:47,920
compared to using only vectors.
1155
00:49:47,920 --> 00:49:50,400
Fewer hallucinations mean fewer mistakes,
1156
00:49:50,400 --> 00:49:55,520
which leads to fewer hours spent by experts trying to figure out why a copilot gave a user the wrong answer.
1157
00:49:55,520 --> 00:49:56,800
Those costs are very real,
1158
00:49:56,800 --> 00:50:01,120
but they usually show up in support, cues and compliance reviews instead of the IT budget.
1159
00:50:01,120 --> 00:50:05,840
The link between your index architecture and these outcomes is a direct cause and effect relationship.
1160
00:50:05,840 --> 00:50:09,600
The timeline for seeing a return on your investment follows that same logic.
1161
00:50:09,600 --> 00:50:14,320
Organizations using hybrid retrieval reach their goals three and a half times faster than those who don't.
1162
00:50:14,320 --> 00:50:16,320
This isn't because the tech is faster to set up,
1163
00:50:16,320 --> 00:50:21,120
but because bad retrieval creates a never-ending cycle of debugging that kills productivity.
1164
00:50:21,120 --> 00:50:26,000
A system that gives the wrong answer with total confidence is actually worse than a system that gives no answer at all.
1165
00:50:26,000 --> 00:50:31,200
The cost of a user acting on a wrong answer is almost always higher than the cost of them getting no result.
1166
00:50:31,200 --> 00:50:34,480
The biggest hidden cost is the price of not investing early enough.
1167
00:50:34,480 --> 00:50:38,800
Rage systems don't usually break in a way that triggers an alarm, they just slowly get worse.
1168
00:50:38,800 --> 00:50:42,560
When a system gives a plausible answer based on the wrong context,
1169
00:50:42,560 --> 00:50:44,080
it doesn't create an error log.
1170
00:50:44,080 --> 00:50:49,280
Instead it changes how people behave, leading them to ignore the results or stop using the system entirely.
1171
00:50:49,280 --> 00:50:51,840
By the time you see low adoption numbers in your metrics,
1172
00:50:51,840 --> 00:50:53,760
the trust has been broken for months,
1173
00:50:53,760 --> 00:50:56,480
and fixing that takes more than just a software update.
1174
00:50:56,480 --> 00:50:59,760
You can usually see the decision trigger coming long before the crisis actually hits.
1175
00:50:59,760 --> 00:51:02,480
If your team is already talking about the cost of the search layer,
1176
00:51:02,480 --> 00:51:05,920
or if scaling the index keeps coming up in meetings that is your signal,
1177
00:51:05,920 --> 00:51:10,560
the memory wall isn't a sudden crash but a slow increase in the cost of things that used to be easy.
1178
00:51:10,560 --> 00:51:15,040
Architects who look at disk and before they actually need it can make a calm, smart decision.
1179
00:51:15,040 --> 00:51:19,840
The ones who wait until they have no choice end up making an expensive move under a lot of pressure.
1180
00:51:19,840 --> 00:51:21,600
The algorithm stays the same in both cases,
1181
00:51:21,600 --> 00:51:24,720
but the options you have available to you definitely do not.
1182
00:51:24,720 --> 00:51:27,520
The hybrid architecture using both where each wins.
1183
00:51:27,520 --> 00:51:33,040
The most sophisticated enterprise architectures in 2026 don't actually try to settle the H&SW versus disk
1184
00:51:33,040 --> 00:51:37,360
and debate. They simply sidestep the conflict by using both algorithms at the same time,
1185
00:51:37,360 --> 00:51:40,960
applying each one to the specific part of the workload where it actually makes sense.
1186
00:51:40,960 --> 00:51:44,800
In database circles, this pattern is known as tiered storage.
1187
00:51:44,800 --> 00:51:48,480
The logic is straightforward because hot data stays in fast expensive storage,
1188
00:51:48,480 --> 00:51:51,120
while cold data sits in slower cheaper tiers.
1189
00:51:51,120 --> 00:51:55,280
The system handles the movement between these layers automatically based on how people actually use
1190
00:51:55,280 --> 00:51:59,840
the data, and it turns out that vector search follows the same lopsided access patterns that make
1191
00:51:59,840 --> 00:52:02,080
tiered storage work everywhere else.
1192
00:52:02,080 --> 00:52:06,960
In a typical company deployment, a tiny sliver of the total content handles the vast majority of the
1193
00:52:06,960 --> 00:52:12,000
traffic. People check the same HR policies over and over, or they look for documentation on the five
1194
00:52:12,000 --> 00:52:17,200
products that drive 80% of the support tickets. While the compliance team might reference specific
1195
00:52:17,200 --> 00:52:21,280
regulatory guidance in every single review, the rest of the library just sits there.
1196
00:52:21,280 --> 00:52:25,360
Archived contracts and old incident reports stay in the index, but they almost never actually
1197
00:52:25,360 --> 00:52:30,080
surface in a real production query. H&SW is the right choice for that hot working set because it
1198
00:52:30,080 --> 00:52:35,280
keeps those high traffic documents in a compact in memory graph. This allows retrieval to happen
1199
00:52:35,280 --> 00:52:40,480
at memory speeds without ever touching an SSD, making sub 10 millisecond responses a reality.
1200
00:52:40,480 --> 00:52:45,120
For the small group of documents that handles most of your traffic, the cost of the RAM is worth it
1201
00:52:45,120 --> 00:52:49,440
because you aren't trying to shove the entire massive corpus into memory at once.
1202
00:52:49,440 --> 00:52:53,600
Disc-Anne takes over for the cold tier, which includes the full library of documents that people rarely
1203
00:52:53,600 --> 00:52:57,360
ask for. You can't leave these out of the index because any one of them might be the exact answer
1204
00:52:57,360 --> 00:53:01,680
a user needs someday, but you don't want to pay to keep them in RAM. If the hot cache returns a
1205
00:53:01,680 --> 00:53:05,920
low confidence match, the system roots the query to the disk and service for a deeper search.
1206
00:53:05,920 --> 00:53:11,280
The latency jumps to about 15 or 20 milliseconds, but since the user is already waiting for a more thorough
1207
00:53:11,280 --> 00:53:15,520
answer, that extra time usually doesn't even register against their expectations.
1208
00:53:15,520 --> 00:53:19,680
If you look at an Azure implementation, this usually means pairing readers and H&SW for the hot
1209
00:53:19,680 --> 00:53:24,640
tier with Cosmos DB and disk-Anne for the full corpus. The application layer checks the hot cache
1210
00:53:24,640 --> 00:53:29,360
first, and if the top match is strong enough, it returns the answer immediately. If the match is weak,
1211
00:53:29,360 --> 00:53:34,400
the system fans out to the cold tier to search the complete index. These two layers scale and fail
1212
00:53:34,400 --> 00:53:39,120
independently, so a sudden spike in cold tier traffic won't ever slow down the high speed hot tier.
1213
00:53:39,120 --> 00:53:43,280
There is a real engineering cost here that you shouldn't ignore. Running two index tiers means
1214
00:53:43,280 --> 00:53:47,840
you have to manage two ingestion pipelines and two sets of parameters, all while keeping the
1215
00:53:47,840 --> 00:53:52,560
routing logic updated as your data evolves. You have to decide which documents deserve to be in
1216
00:53:52,560 --> 00:53:57,120
the hot tier and when to demote files that nobody is reading anymore, and those operational choices
1217
00:53:57,120 --> 00:54:01,440
won't just make themselves. The system is definitely more complex than just picking one algorithm
1218
00:54:01,440 --> 00:54:05,760
and sticking with it, but that complexity pays off once you reach a certain scale. When your
1219
00:54:05,760 --> 00:54:10,080
hot working set is measurably smaller than your full library, the money you save by not buying massive
1220
00:54:10,080 --> 00:54:14,720
amounts of RAM easily covers the cost of the extra engineering. It's a trade off that pays for
1221
00:54:14,720 --> 00:54:19,760
itself very quickly. The decision to build this should be based on data, not a hunch.
1222
00:54:19,760 --> 00:54:24,080
Pull your query logs and see which documents show up in the top results for 90% of your traffic.
1223
00:54:24,080 --> 00:54:28,560
If that list of documents is small compared to your total index, then you have a hot tier that is
1224
00:54:28,560 --> 00:54:33,920
worth building. Evaluating retrieval quality. Matrix that actually matter. Most teams build a rag
1225
00:54:33,920 --> 00:54:37,920
system and run a few test queries on topics they already know by heart. When the answers look
1226
00:54:37,920 --> 00:54:42,480
reasonable, they ship the product to users. But this creates a massive evaluation gap. It isn't a
1227
00:54:42,480 --> 00:54:47,280
matter of laziness, but rather the result of working under tight deadlines without a real baseline
1228
00:54:47,280 --> 00:54:51,440
to measure against. The problem won't show up on the day you launch, but it will show up six months
1229
00:54:51,440 --> 00:54:56,080
later when users have quietly stopped using the tool because they no longer trusted. The metrics
1230
00:54:56,080 --> 00:55:00,560
that actually define retrieval quality are split into two groups. One group tells you if you found
1231
00:55:00,560 --> 00:55:05,120
the right documents and the other tells you if those documents actually help the model give a grounded
1232
00:55:05,120 --> 00:55:09,520
answer. You need both to see the full picture because looking at either one in isolation will lead
1233
00:55:09,520 --> 00:55:13,440
you to the wrong conclusion. Precision at K measures what percentage of your top results are
1234
00:55:13,440 --> 00:55:18,320
actually useful. If the system pulls 10 documents and seven of them help answer the question,
1235
00:55:18,320 --> 00:55:24,080
your precision at 10 is 0.7. This metric is vital because it punishes systems that fill the results with
1236
00:55:24,080 --> 00:55:28,320
nearby documents that don't actually help. High precision matters because irrelevant documents
1237
00:55:28,320 --> 00:55:32,160
take up space in the context window, which makes it much more likely that the model will focus on
1238
00:55:32,160 --> 00:55:36,160
the wrong information. Recall at K asks the opposite question by looking at all the relevant
1239
00:55:36,160 --> 00:55:40,000
documents in your library and seeing how many actually made it into the top results. A system
1240
00:55:40,000 --> 00:55:44,160
might have great precision but terrible recall, which means it finds good documents but misses the
1241
00:55:44,160 --> 00:55:49,520
most important ones. In a corporate setting, missing a specific policy or a legal requirement isn't just
1242
00:55:49,520 --> 00:55:54,000
a minor inconvenience. It is a massive gap in the model's knowledge that leads to incomplete answers
1243
00:55:54,000 --> 00:55:58,880
even if the few documents it did find were technically correct. Mean reciprocal rank tracks exactly
1244
00:55:58,880 --> 00:56:03,360
where that first right answer appears in the list. A relevant document in the first slot is much
1245
00:56:03,360 --> 00:56:07,840
better than one in the fifth slot and both are better than a result at rank 20. Most systems give
1246
00:56:07,840 --> 00:56:12,160
more weight to the first few results they see. So this metric captures whether the right information
1247
00:56:12,160 --> 00:56:17,440
is showing up early enough to actually influence the final answer. Normalized discounted cumulative gain
1248
00:56:17,440 --> 00:56:21,920
takes this a step further by using graded scores for relevance. It recognizes that an official
1249
00:56:21,920 --> 00:56:26,960
policy is more important than a random FAQ and it discounts the value of a result if it shows up too
1250
00:56:26,960 --> 00:56:31,280
low in the ranking. This is the best metric to use when your library contains documents that have
1251
00:56:31,280 --> 00:56:35,760
different levels of authority or importance. Those four metrics tell you how the retrieval layer is
1252
00:56:35,760 --> 00:56:40,880
doing but you also need to bridge the gap to the generation phase. Context precision looks at the
1253
00:56:40,880 --> 00:56:45,040
snippet sent to the prompt and asks how many were actually needed. A high score here means your
1254
00:56:45,040 --> 00:56:49,200
pipeline is efficient and clean while a low score means you are drowning the model in noise
1255
00:56:49,200 --> 00:56:53,760
and hoping it can figure things out on its own. Context recall is the inverse measuring if the
1256
00:56:53,760 --> 00:56:58,400
information needed for the answer was actually present in the retrieve text. When this score is low
1257
00:56:58,400 --> 00:57:01,840
the risk of hallucination goes through the roof because the model will try to fill in the blanks
1258
00:57:01,840 --> 00:57:06,080
using its own internal memory. Faithfulness happens at the very end of the process
1259
00:57:06,080 --> 00:57:10,800
but it is really a reflection of how well retrieval did its job. It checks if every single claim in
1260
00:57:10,800 --> 00:57:15,760
the final answer is backed up by the retrieve text. If a system fails faithfulness test consistently
1261
00:57:15,760 --> 00:57:20,080
you are usually looking at a retrieval failure that is just wearing a generation costume. The tools to
1262
00:57:20,080 --> 00:57:24,720
measure all of this are already available. You can use Raga so RS for open source evaluation
1263
00:57:24,720 --> 00:57:29,680
or look at Langsmith and Azure AI Studio for managed pipelines that handle these metrics at scale.
1264
00:57:29,680 --> 00:57:34,800
Deepivol is also great for looking at specific components like contextual precision. However, none of
1265
00:57:34,800 --> 00:57:40,240
these tools will work unless you build a golden dataset which is a curated list of 50 to 100
1266
00:57:40,240 --> 00:57:45,440
real queries with known correct answers. A one-time evaluation is a good start but it isn't enough to
1267
00:57:45,440 --> 00:57:50,560
keep a system healthy. Models get updated in content changes and user behavior will eventually shift
1268
00:57:50,560 --> 00:57:55,040
away from what you originally planned for. Quality that looks great at launch can drift downward for
1269
00:57:55,040 --> 00:57:59,360
months before anyone notices a problem. To stay ahead of it your measurement has to be a continuous
1270
00:57:59,360 --> 00:58:04,800
part of the process rather than a one-time ceremony. Observability and drift keeping the system
1271
00:58:04,800 --> 00:58:09,440
honest. A Raga system without observability is just a black box that produces answers you can't
1272
00:58:09,440 --> 00:58:14,000
verify. Your standard logs won't tell you if the model actually used the retrieved context or if
1273
00:58:14,000 --> 00:58:18,400
it just made something up based on its own internal training. The dashboards might show you request
1274
00:58:18,400 --> 00:58:22,240
counts and speed but they tell you absolutely nothing about whether the information fed into the
1275
00:58:22,240 --> 00:58:27,360
response was relevant to the user's question. You have plenty of outputs but you have zero visibility
1276
00:58:27,360 --> 00:58:32,000
into how those outputs were created or why they should be trusted. This isn't just a small gap in
1277
00:58:32,000 --> 00:58:37,200
your operations. It is the architectural equivalent of running a bank with no audit trail where you can
1278
00:58:37,200 --> 00:58:41,760
see the final balance but have no way to trace how the money got there. Your observability stack needs
1279
00:58:41,760 --> 00:58:46,400
to capture the full life cycle of every single query. That means tracking the original question
1280
00:58:46,400 --> 00:58:50,720
any rewriting that happened before the search, the specific documents pulled from each part of the
1281
00:58:50,720 --> 00:58:55,680
hybrid search and the similarity scores for every candidate. You also need the re-ranker scores
1282
00:58:55,680 --> 00:59:00,160
that decided the final order and the actual answer that came out at the end. Every piece of that chain
1283
00:59:00,160 --> 00:59:05,040
is a signal and any missing link is a dangerous blind spot for your team. These traces serve two
1284
00:59:05,040 --> 00:59:09,520
different masters at the same time. For your engineers they are the diagnostic tools that make it
1285
00:59:09,520 --> 00:59:13,760
possible to analyze a failure. When a user complains that the system gave them the wrong advice,
1286
00:59:13,760 --> 00:59:17,360
the trace shows you exactly which documents were found and what their scores were. If the right
1287
00:59:17,360 --> 00:59:21,520
document was found but ranked eighth instead of first you have a re-ranker problem. If the right
1288
00:59:21,520 --> 00:59:26,000
document wasn't found at all you have a retrieval problem. Without that trace both of those failures
1289
00:59:26,000 --> 00:59:30,880
look exactly the same from the outside and fixing them becomes total guesswork. For your compliance
1290
00:59:30,880 --> 00:59:35,680
teams those same traces are the official audit record. They need to know which user asked what when
1291
00:59:35,680 --> 00:59:40,240
they asked it which documents they saw and what the system said in response. That is the chain of
1292
00:59:40,240 --> 00:59:44,720
custody that regulators and internal auditors are going to ask for eventually. If you design your
1293
00:59:44,720 --> 00:59:48,880
observability for engineering needs from the start you satisfy those governance requirements
1294
00:59:48,880 --> 00:59:53,200
automatically but that only works if you build it into the foundation rather than trying to patch it
1295
00:59:53,200 --> 00:59:57,760
on after a compliance officer knocks on your door. Embedding drift is the specific type of failure
1296
00:59:57,760 --> 01:00:02,240
that good observability catches before your users even notice a problem. When an embedding model
1297
01:00:02,240 --> 01:00:06,960
gets an update even a tiny version change the way it represents data shifts documents that were
1298
01:00:06,960 --> 01:00:11,360
indexed with the old model and queries coming in through the new model now live in slightly different
1299
01:00:11,360 --> 01:00:15,840
worlds. Similarity scores that meant one thing yesterday mean something else today. The index
1300
01:00:15,840 --> 01:00:20,000
won't look broken but it will start returning results that are close in the new space but aren't
1301
01:00:20,000 --> 01:00:24,000
the right answers anymore the way you catch this is actually pretty simple. You just need to track
1302
01:00:24,000 --> 01:00:28,960
the distribution of your similarity scores over time. Under normal conditions the scores for your
1303
01:00:28,960 --> 01:00:33,120
top results should follow a very consistent pattern when that pattern shifts like when average
1304
01:00:33,120 --> 01:00:37,760
scores suddenly drop or when scores that used to be high start clustering much lower. You know
1305
01:00:37,760 --> 01:00:42,480
something has changed it could be the index the model or even just the way users are asking questions
1306
01:00:42,480 --> 01:00:46,880
whatever it is you need to investigate it immediately instead of letting the quality degrade
1307
01:00:46,880 --> 01:00:51,840
for weeks. You can take this a step further with document position analysis this means tracking
1308
01:00:51,840 --> 01:00:56,240
which parts of the retrieved context the model actually uses to write its answer. If the model is
1309
01:00:56,240 --> 01:01:01,360
constantly ignoring the top three results and pulling its answers from documents ranked 7 through 10
1310
01:01:01,360 --> 01:01:06,400
your re-ranca isn't doing its job. Your retrieval metrics might look fine because the right info is in
1311
01:01:06,400 --> 01:01:11,360
the set but the order is wrong for what the generator actually needs that is a problem you can fix
1312
01:01:11,360 --> 01:01:16,400
but only if you have the data to see it. Finally cost observability helps you make real infrastructure
1313
01:01:16,400 --> 01:01:20,640
decisions you need to know the cost per query the cost per correct answer and the cost for each
1314
01:01:20,640 --> 01:01:25,440
specific tenant using the system if you have a massive index tier that isn't seeing much traffic you
1315
01:01:25,440 --> 01:01:30,240
are just burning your budget for no reason. On the other hand if one tenant is constantly hitting
1316
01:01:30,240 --> 01:01:35,280
expensive slow storage it might be time to move them to a dedicated fast cache these are business
1317
01:01:35,280 --> 01:01:40,320
decisions and they require the kind of data that only a full observability stack can provide.
1318
01:01:40,320 --> 01:01:45,040
Implementation roadmap from decision to production in a rag project the order of operations matters
1319
01:01:45,040 --> 01:01:50,000
much more than the specific tools you buy. Most teams get this completely backward by picking an
1320
01:01:50,000 --> 01:01:55,040
algorithm and setting up servers before they even know if their data is ready for that setup. To avoid
1321
01:01:55,040 --> 01:01:58,880
the most expensive mistakes you have to start with a deep assessment of your documents and work
1322
01:01:58,880 --> 01:02:03,040
forward one step at a time. You have to start with a corpus assessment because your choice of algorithm
1323
01:02:03,040 --> 01:02:07,680
depends entirely on the size of your data. Don't just look at where you are today but look at where
1324
01:02:07,680 --> 01:02:12,320
you will be in six months or two years. A system that starts with 8 million vectors and grows to
1325
01:02:12,320 --> 01:02:16,880
100 million is a completely different animal than one that stays small. If your two-year plan puts
1326
01:02:16,880 --> 01:02:21,360
you over the 50 million mark starting with a basic in-memory index means you'll be forced to migrate
1327
01:02:21,360 --> 01:02:25,680
while the system is live. That is always more expensive and more stressful than just designing for
1328
01:02:25,680 --> 01:02:30,800
that scale on day one. The second big decision is your chunking strategy and this is where most
1329
01:02:30,800 --> 01:02:35,280
teams get lazy. You should use structure aware chunking that respects things like section boundaries
1330
01:02:35,280 --> 01:02:40,080
and heading hierarchies instead of just cutting text at random intervals. A good starting point is
1331
01:02:40,080 --> 01:02:45,680
usually between 300 and 800 tokens per chunk with a bit of overlap. This gives the embedding model a
1332
01:02:45,680 --> 01:02:50,400
coherent piece of information to work with. Remember that this is just a baseline and you'll need to adjust
1333
01:02:50,400 --> 01:02:55,440
those numbers based on how well the system actually finds information in your specific documents.
1334
01:02:55,440 --> 01:03:00,160
Metadata schema design is another area where early choices become either a huge help or a massive
1335
01:03:00,160 --> 01:03:05,360
headache later on. You need to define your filterable fields like department, region, date,
1336
01:03:05,360 --> 01:03:10,160
and security level before you start indexing anything. If you decide you need a new filter after
1337
01:03:10,160 --> 01:03:14,640
the index is built you usually have to rebuild the entire thing from scratch. On a large scale that
1338
01:03:14,640 --> 01:03:19,280
means significant downtime. It is much better to include a field you might not use yet than to
1339
01:03:19,280 --> 01:03:23,680
realize you're missing one once you're already in production. Once your schema is set you can finally
1340
01:03:23,680 --> 01:03:27,920
pick your embedding model. The most important rule here is consistency. The model you use to
1341
01:03:27,920 --> 01:03:31,600
index your documents and the model you use for user queries must be exactly the same version.
1342
01:03:31,600 --> 01:03:35,520
If they get out of sync your retrieval quality will fall off a cliff without ever throwing an error
1343
01:03:35,520 --> 01:03:40,560
message. You should actually document the model version inside the index schema itself so that
1344
01:03:40,560 --> 01:03:44,960
anyone working on it later can see exactly what the dependencies are. Next comes the actual
1345
01:03:44,960 --> 01:03:49,360
configuration of your retrieval pipeline. You should use hybrid search as your default setting
1346
01:03:49,360 --> 01:03:54,000
which means running traditional keyword search and vector search at the same time. You then use
1347
01:03:54,000 --> 01:03:58,560
something like rrf to merge those lists and a semantic ranca to polish the top 50 results.
1348
01:03:58,560 --> 01:04:04,080
If you have the time in your latency budget adding a cross encoder re-ranca can significantly improve
1349
01:04:04,080 --> 01:04:08,560
how the system handles complex questions. You should also set a similarity threshold to filter
1350
01:04:08,560 --> 01:04:13,440
out low confidence results so you don't clutter the final steps with useless noise. Before you ever
1351
01:04:13,440 --> 01:04:18,800
launch you need to run an evaluation using a golden dataset. This is a collection of 50 to 100
1352
01:04:18,800 --> 01:04:22,720
real world questions where you already know what the right answer and the right documents are.
1353
01:04:22,720 --> 01:04:26,880
You measure the system against these and use those numbers as your baseline for performance.
1354
01:04:26,880 --> 01:04:30,960
These aren't just random stats. They become your official performance targets.
1355
01:04:30,960 --> 01:04:34,960
When things start to feel off later you'll have these numbers to prove whether the system is actually
1356
01:04:34,960 --> 01:04:40,240
drifted or if it's just a one off issue. Finally remember that observability is not something you add
1357
01:04:40,240 --> 01:04:45,360
after the launch. Logging, dashboards and drift monitoring are core parts of a production ready system.
1358
01:04:45,360 --> 01:04:49,120
If you go live without full trace logging you won't be able to fix the system when it breaks or
1359
01:04:49,120 --> 01:04:53,760
prove what happened when an auditor asks for proof. It is much easier to build these tools while
1360
01:04:53,760 --> 01:04:58,080
you're setting up the pipeline than it is to try and squeeze them in once the system is already
1361
01:04:58,080 --> 01:05:03,920
carrying live traffic. Making the call a decision guide for Azure Architects. Every architectural decision
1362
01:05:03,920 --> 01:05:08,000
eventually comes down to one question with one specific answer for your unique situation.
1363
01:05:08,000 --> 01:05:12,160
We have spent this episode building a framework of variables but this section is where we turn
1364
01:05:12,160 --> 01:05:16,880
those variables into a decision tree. If your vector corpus stays under 50 million embeddings
1365
01:05:16,880 --> 01:05:22,880
and low latency is your biggest priority, Azure AI search with hnsw indexing is your best starting point.
1366
01:05:22,880 --> 01:05:28,000
This isn't because hnsw is a perfect solution but at this specific scale the memory costs stay
1367
01:05:28,000 --> 01:05:32,400
manageable and the operational model is easy for teams to understand. You get a latency profile that
1368
01:05:32,400 --> 01:05:37,520
disk-backed systems simply cannot match and the ecosystem support is incredibly deep. You will find
1369
01:05:37,520 --> 01:05:42,880
endless documentation, community examples and tuning guides for parameters like EF Search and M
1370
01:05:42,880 --> 01:05:47,360
across the entire Azure stack. You aren't making a mistake by starting here you are simply making
1371
01:05:47,360 --> 01:05:52,160
the right choice for where your project sits today. When your corpus grows to between 50 and 100
1372
01:05:52,160 --> 01:05:57,040
million vectors the right move isn't to migrate immediately it is to start planning. The memory
1373
01:05:57,040 --> 01:06:01,200
wall doesn't just appear out of nowhere on a specific calendar date. It approaches slowly as you
1374
01:06:01,200 --> 01:06:06,320
index new content on board more tenants and add new regions to your deployment. The teams that move
1375
01:06:06,320 --> 01:06:11,280
to disk-an proactively do so before infrastructure costs become the main headline of an architecture
1376
01:06:11,280 --> 01:06:15,440
review. They make the switch on their own schedule with plenty of time to test for recall parity
1377
01:06:15,440 --> 01:06:20,400
and cut over without any outside pressure. Other teams wait until they are forced to migrate during an
1378
01:06:20,400 --> 01:06:24,960
uncomfortable budget conversation on a compressed timeline. The algorithm stays the same in both
1379
01:06:24,960 --> 01:06:29,360
scenarios but the actual experience of the migration is completely different if you are dealing with
1380
01:06:29,360 --> 01:06:34,960
over 100 million vectors or building a multi-tenant system for several organizations. Cosmos DB with
1381
01:06:34,960 --> 01:06:40,240
disk-an-n is the only architecture to consider. This shouldn't be a future exploration item it needs to
1382
01:06:40,240 --> 01:06:44,560
be your architecture right now. The cost savings at this scale are structural and the managed
1383
01:06:44,560 --> 01:06:48,720
benefits of automatic partitioning and elastic scaling will compound on top of your infrastructure
1384
01:06:48,720 --> 01:06:53,760
savings. There is no way to configure hnsw indexing to close the gap once you hit this size.
1385
01:06:53,760 --> 01:06:58,160
The ram requirements are what they are and as your prices it's memory optimized compute resources
1386
01:06:58,160 --> 01:07:03,520
accordingly. SQL server disk-an is a viable option if your data is static or only changes every few
1387
01:07:03,520 --> 01:07:07,760
months and you need vector search without adding new service dependencies. You have to respect the
1388
01:07:07,760 --> 01:07:12,240
immutability constraint here so you should plan for batch updates and scheduled rebuilds rather
1389
01:07:12,240 --> 01:07:17,360
than trying to stream data inconsistently. If that operational model fits how your content lives
1390
01:07:17,360 --> 01:07:21,840
then staying within your existing s-sql server infrastructure will reduce your complexity in ways
1391
01:07:21,840 --> 01:07:26,800
that actually matter. But if your pipeline expects to push updates at any time, SQL server disk-an
1392
01:07:26,800 --> 01:07:31,120
will likely cause recurring operational headaches no matter how well you set it up initially.
1393
01:07:31,120 --> 01:07:35,120
For data that changes every single minute like new documents from a knowledge base or support ticket
1394
01:07:35,120 --> 01:07:40,160
resolutions, Cosmos DB disk-an is the only managed Azure option that works without complex work
1395
01:07:40,160 --> 01:07:44,880
around. This dynamic update behavior isn't just a secondary feature, it is the core architectural
1396
01:07:44,880 --> 01:07:49,280
difference between Cosmos DB and the SQL server version. They use the same underlying algorithm
1397
01:07:49,280 --> 01:07:53,280
but they offer fundamentally different operational contracts for your team. If you want to build a
1398
01:07:53,280 --> 01:07:57,760
tiered architecture where different queries have different latency needs, you can pair H&SW
1399
01:07:57,760 --> 01:08:02,400
in a reddit cache with disk-an and in Cosmos DB. This pattern allows you to keep your working set in
1400
01:08:02,400 --> 01:08:07,360
memory while the full core per sits on disk, letting you scale from millions to billions of vectors
1401
01:08:07,360 --> 01:08:11,680
without a total redesign. You should build your routing logic cleanly from the very beginning.
1402
01:08:11,680 --> 01:08:16,320
The confidence threshold that decides which tier handles a query is a parameter you will want to
1403
01:08:16,320 --> 01:08:21,200
tune as you learn your real world traffic patterns. There is one question that cuts through every
1404
01:08:21,200 --> 01:08:25,840
scenario and every qualification. What does your vector corpus look like two years from now and can
1405
01:08:25,840 --> 01:08:30,080
you actually afford the RAM to hold it all in memory at that scale? If you can answer yes with
1406
01:08:30,080 --> 01:08:35,520
total confidence, then H&SW is a reasonable long term choice for your project. If the answer is yes
1407
01:08:35,520 --> 01:08:40,080
but you feel a bit uncomfortable about the price, you should start planning your transition today.
1408
01:08:40,080 --> 01:08:44,960
If the answer is no or if your two year roadmap is still a bit blurry, you should start with disk-an.
1409
01:08:44,960 --> 01:08:49,680
Moving from disk-an into H&SW later if your needs change is much less painful than trying to
1410
01:08:49,680 --> 01:08:54,560
escape H&SW once you've already hit the memory wall. What's coming? The vector search horizon in
1411
01:08:54,560 --> 01:09:00,880
Azure. By early 2026, the enterprise AI search market hit 12.4 billion dollars as organizations
1412
01:09:00,880 --> 01:09:05,920
moved from small experiments to full hybrid rag architectures. This number isn't just a fun
1413
01:09:05,920 --> 01:09:10,640
statistic for a slide deck, it is a clear signal of where the big investment is actually going.
1414
01:09:10,640 --> 01:09:15,040
Every infrastructure decision you make today will be judged against a competitive landscape that
1415
01:09:15,040 --> 01:09:19,120
is moving at an incredible speed. The team's building solid retrieval foundations right now
1416
01:09:19,120 --> 01:09:23,280
aren't actually ahead of the curve, they are just reaching the new baseline that the rest of the
1417
01:09:23,280 --> 01:09:28,880
industry is moving toward. Microsoft 365 co-pilot had 15 million paid seats by the second quarter of
1418
01:09:28,880 --> 01:09:34,320
fiscal year 2026, which is the most credible proof of concept for disk-an at scale. Every single one of
1419
01:09:34,320 --> 01:09:38,560
those users relies on a semantic index that runs globally across hundreds of millions of documents
1420
01:09:38,560 --> 01:09:43,600
with high quality results. That entire infrastructure is backed by disk-an. This internal deployment at
1421
01:09:43,600 --> 01:09:47,760
Microsoft is your reference implementation. If you are wondering if disk-an can handle your
1422
01:09:47,760 --> 01:09:51,920
specific workload, the answer has already been proven at a scale your organization will probably
1423
01:09:51,920 --> 01:09:56,400
never even reach. There are two major developments currently reshaping how we build and maintain the
1424
01:09:56,400 --> 01:10:00,800
retrieval layer. The first is self-optimizing retrieval, which uses reinforcement learning from
1425
01:10:00,800 --> 01:10:06,000
user feedback to automatically balance, lexical and vector search. When a user clicks a result or
1426
01:10:06,000 --> 01:10:10,800
rephrases a query, that signal goes right back into the system configuration. The balance that
1427
01:10:10,800 --> 01:10:15,600
worked for your team last quarter might not be the best fit as your users change how they search.
1428
01:10:15,600 --> 01:10:19,600
Systems that tune themselves against real human behavior will always get better results than those
1429
01:10:19,600 --> 01:10:24,560
tuned manually against static datasets. This pattern is still in the early stages, but we expect to see
1430
01:10:24,560 --> 01:10:30,400
broad adoption by 2027. The second big shift is a "agentic rag" and it changes the retrieval problem
1431
01:10:30,400 --> 01:10:35,840
at a structural level. In a standard architecture, you send one query and get one set of results back.
1432
01:10:35,840 --> 01:10:40,560
In an agentic system, a complex query gets broken down into several subqueries that might pull from
1433
01:10:40,560 --> 01:10:45,200
different indexes at the same time. Your choice of index algorithm becomes just one small part of a
1434
01:10:45,200 --> 01:10:49,760
larger orchestration layer that handles rooting and synthesis. The choice between hnsd and
1435
01:10:49,760 --> 01:10:54,080
disk_an doesn't go away, it just becomes a nested decision within a much larger workflow,
1436
01:10:54,080 --> 01:10:58,800
getting your foundation right matters even more. When an AI agent is calling the index without waiting
1437
01:10:58,800 --> 01:11:03,040
for a human to check the work, we are also seeing steady improvements in vector compression and storage
1438
01:11:03,040 --> 01:11:07,840
optimization that specifically benefit the disk_an model. You can now fit more vectors into each
1439
01:11:07,840 --> 01:11:12,400
partition at a lower cost because better quantization reduces the memory footprint of the navigation
1440
01:11:12,400 --> 01:11:17,200
graph. These technical advances are lowering the price floor for disk-based retrieval without
1441
01:11:17,200 --> 01:11:21,840
requiring you to change your fundamental architecture. The gap between the cost of RAM and the cost
1442
01:11:21,840 --> 01:11:26,400
of disk is actually widening rather than closing. The trend to what integrated vectorizers makes
1443
01:11:26,400 --> 01:11:31,440
implementation easier but adds a new layer of complexity to your billing. Azure AI Search can now
1444
01:11:31,440 --> 01:11:36,080
call an embedding model automatically when a query comes in which removes the need for your
1445
01:11:36,080 --> 01:11:40,640
application to manage that process. The query arrives as plain text and the service handles the
1446
01:11:40,640 --> 01:11:45,520
rest which is much simpler for your developers. However, every one of those automatic calls adds an
1447
01:11:45,520 --> 01:11:50,800
AI metering charge to your per query cost. While that cost is tiny at low volumes it can turn into a
1448
01:11:50,800 --> 01:11:55,680
massive budget line item once you reach enterprise level traffic. The direction of the entire ecosystem
1449
01:11:55,680 --> 01:12:00,640
is now very clear. Hybrid retrieval is the new baseline, managed re-ranking is a standard requirement
1450
01:12:00,640 --> 01:12:05,440
and continuous evaluation is an operational necessity. The organization's building this foundation
1451
01:12:05,440 --> 01:12:10,320
today aren't doing advanced work, they are simply doing the work that everyone will expect from
1452
01:12:10,320 --> 01:12:16,960
them in 12 months. H&SW is fast and reliable and it remains the right choice for workloads where
1453
01:12:16,960 --> 01:12:21,920
keeping the index in RAM actually makes financial sense but this can and changes the entire math of
1454
01:12:21,920 --> 01:12:28,240
vector search by moving that index to SSD which slashes your memory needs by about 60 times at scale.
1455
01:12:28,240 --> 01:12:32,240
This shift fundamentally alters the economics for any deployment that is eventually going to
1456
01:12:32,240 --> 01:12:36,640
outgrow what your memory can affordably hold. The decision here is financial and operational
1457
01:12:36,640 --> 01:12:41,360
long before it becomes a technical one. You need to audit your current vector corpus and project
1458
01:12:41,360 --> 01:12:45,280
where you'll be in two years because if you are approaching 50 million vectors you need to start
1459
01:12:45,280 --> 01:12:49,360
the disk and conversation now. You want to solve this before you hit the memory wall not after it
1460
01:12:49,360 --> 01:12:54,640
happens. Next episode we are diving into how to build the hybrid retrieval pipeline on top of these
1461
01:12:54,640 --> 01:13:00,320
indexes including chunking, rf waiting and semantic rank attuning. If this episode changed how you
1462
01:13:00,320 --> 01:13:04,800
think about vector infrastructure costs leave a review. It helps more architects find this before
1463
01:13:04,800 --> 01:13:08,800
they hit the wall. Connect with Mirko Peters on LinkedIn to shape what we cover next.

Founder of m365.fm, m365.show and m365con.net
Mirko Peters is a Microsoft 365 expert, content creator, and founder of m365.fm, a platform dedicated to sharing practical insights on modern workplace technologies. His work focuses on Microsoft 365 governance, security, collaboration, and real-world implementation strategies.
Through his podcast and written content, Mirko provides hands-on guidance for IT professionals, architects, and business leaders navigating the complexities of Microsoft 365. He is known for translating complex topics into clear, actionable advice, often highlighting common mistakes and overlooked risks in real-world environments.
With a strong emphasis on community contribution and knowledge sharing, Mirko is actively building a platform that connects experts, shares experiences, and helps organizations get the most out of their Microsoft 365 investments.









