Durable Functions for Indexing in RAG: A Practical Python Approach
Have you ever tried building an indexing pipeline for a Retrieval-Augmented Generation (RAG) app and struggled to choose between “push” and “pull”? Both have their advantages, but each comes with its own challenges. In this article, we’ll introduce a hybrid approach that combines the best of both while keeping overhead low—using Azure Durable Functions (in Python) to get the right balance. We’ll have a look at the strengths and weaknesses of existing methods before diving into how Durable Functions can address common issues like scaling, state management, and retries—without adding unnecessary complexity.
Setup, Deployment, and Prerequisites
Before you dive in, note that all setup details are provided in our sample repository. We use azd and Bicep to set up the entire infrastructure—including AI Search, Azure Document Intelligence, OpenAI embeddings, and more. A dev container is provided to simplify the setup steps significantly, so you don’t need to worry about specific package versions or manual configurations.
Why Indexing Matters for RAG
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses in external enterprise data. The accuracy of the generated output depends entirely on the quality of the retrieved data—garbage in, garbage out. While prompt engineering can fine-tune model behavior, it cannot compensate for irrelevant, outdated, or poorly structured data in the index.
That’s why a robust indexing pipeline is critical. Instead of relying solely on prompt tuning, focus on ensuring the right data is retrieved at the right time. A well-structured RAG pipeline ensures that:
- Enterprise documents (like PDFs or reports) are ingested properly.
- Content is extracted cleanly and structured for retrieval.
- Chunks are meaningful and contextually relevant.
- Embeddings are high-quality, ensuring accurate retrieval.
- The index is continuously refined to reflect the latest ground truth.
This is where Azure Durable Functions come into play. By automating document ingestion, processing, and indexing, you ensure the most relevant and up-to-date information is always available for retrieval. This leads to more reliable, fact-based responses from the LLM, reducing the risk of hallucinations or outdated results.
By investing in indexing, rather than overfitting prompts to imperfect data, you create a scalable, adaptable RAG system that remains robust as your enterprise data evolves.
Push vs. Pull: The Usual Approaches
Push Method
- How it works: You write code or scripts that directly send your documents to AI Search.
- Pros:
- Full control over the process.
- Ability to handle custom processing as needed.
- Cons:
- You must handle code retries and error handling yourself.
- Scaling can be tricky if you have a large volume of documents.
- It can be hard to track the status of each document if something fails.
Pull Method
- How it works: AI Search uses a built-in indexer that pulls documents from a data source (like Blob Storage) on a schedule or trigger.
- Pros:
- Automatic retries and production-ready features.
- Less code to write.
- Cons:
- Limited configuration options, which can be frustrating.
- Debugging is harder because logs can be minimal.
- Extending beyond the built-in features is challenging.
- Service limits that might constrain what you can do.
My team and I tried both approaches. Push sometimes feels too bare-bones, while pull can be overly rigid. So, we started wondering if there’s a way to combine the flexibility of push with the production-ready capabilities of pull.
Enter Azure Durable Functions (in Python)
Durable Functions enable you to write stateful workflows in code. In Python, you can use the azure-functions
and
azure-durable-functions
libraries to do things like:
- Track State: Durable Functions remember where they left off for each document, so you know exactly what’s happening.
- Scale: They can fan out and process multiple documents in parallel, then fan back in when done.
- Retry: If a step fails, you can retry it without losing your entire workflow.
This approach brings together the flexibility of push (it’s your own Python code) with many of the production-ready features you usually only get with the pull method. You can find an example of indexing with the push method using regular Azure Functions in this previous article.
A Sample Python Workflow
Note: The following code snippets are illustrative. They’re meant to help you grasp the overall workflow, not serve as fully executable code.
-
Trigger the Orchestrator An HTTP call or a blob event can start the indexing. For example, uploading a new file to Blob Storage can trigger the workflow automatically.
-
List Documents in Blob Storage The orchestrator function lists all files in the relevant container.
Important: Since Durable Functions replay your orchestrator code to maintain state consistency, ensure that this function is replay-safe. That means the function should return the next files in storage consistently when restarted.
-
Ensure the Index Exists Check if your index is present in AI Search. If not, create it.
-
Fan Out to Index Each Document A “sub-orchestrator” is launched for each document. This sub-orchestrator handles document cracking, chunking, embedding, and uploading to AI Search. Running them in parallel lets you process many documents quickly.
-
Handle Failures Gracefully If one document fails, it won’t bring down the entire pipeline. You’ll see exactly which file failed and why, so you can retry it when ready.
Python Code Snippet Highlights
Orchestrator (Main function, reduced to show significant pieces):
from azure.durable_functions import DurableOrchestrationContext
from application.app import app
import os
@app.function_name(name="index")
@app.orchestration_trigger(context_name="context")
def index(context: DurableOrchestrationContext):
while True:
# List blobs from Blob Storage in batches (ensure replay-safe code)
blob_list_result = yield context.call_activity("list_blobs_batch")
if not blob_list_result["blob_names"]:
break
# Ensure that the index exists
yield context.call_activity("ensure_index_exists")
task_list = []
for blob_name in blob_list_result["blob_names"]:
task_list.append(
context.call_sub_orchestrator(
"index_document",
{"blob_url": blob_name},
instance_id=context.new_uuid()
)
)
yield context.task_all(task_list)
Sub-Orchestrator:
@app.function_name(name="index_document")
@app.orchestration_trigger(context_name="context")
def index_document(context: DurableOrchestrationContext):
data = context.get_input()
# Document cracking (extract text)
document = yield context.call_activity("document_cracking", data["blob_url"])
# Chunking
chunks = yield context.call_activity("chunking", document)
# Create embeddings
chunks_with_embeddings = yield context.call_activity("embedding", chunks)
# Upload chunks and embeddings to AI Search
yield context.call_activity("add_documents", {
"chunks": chunks_with_embeddings
})
Each activity (e.g., document_cracking
, chunking
, etc.) is a separate function that handles one piece of work. Durable Functions
automatically track input and output for every step, so you can see the status of each document.
Visualizing the Workflow
Seeing the State in Real Time
One of the biggest advantages of Durable Functions is built-in state tracking. When a run starts, it generates an instance ID,
which you can use to check progress, seeing which steps have completed, which failed, and what’s still in progress. You can
retrieve this information through the /status/:id
endpoint or directly from blob storage, which serves as the persistence
layer for the durable function. This makes debugging much easier; if a chunking step fails on a specific file, you’ll know
exactly which file it was and why it failed.
{
"name": "index_document",
"instanceId": "283f3d47-6b75-5f18-b6d9-5e6dg8caab59",
"createdTime": "2025-02-14T09:01:33.000000Z",
"lastUpdatedTime": "2025-02-14T09:02:10.000000Z",
"output": null,
"input": "\"{\\\"blob_url\\\": \\\"example.blob.core.windows.net/source/Agentic%20Frameworks%20Research.pdf\\\", \\\"index_name\\\": \\\"other-index\\\"}\"",
"runtimeStatus": "Completed",
"customStatus": null,
"history": null,
"historyEvents": [
{
"EventType": "ExecutionStarted",
"Input": "\"{\\\"blob_url\\\": \\\"example.blob.core.windows.net/source/Agentic%20Frameworks%20Research.pdf\\\", \\\"index_name\\\": \\\"other-index\\\"}\"",
"Timestamp": "2025-02-14T09:01:33.6014815Z",
"FunctionName": "index_document"
},
{
"EventType": "TaskCompleted",
"Timestamp": "2025-02-14T09:01:46.345134Z",
"FunctionName": "document_cracking"
},
{
"EventType": "TaskCompleted",
"Timestamp": "2025-02-14T09:01:47.8006035Z",
"FunctionName": "chunking"
},
{
"EventType": "TaskCompleted",
"Timestamp": "2025-02-14T09:01:51.468373Z",
"FunctionName": "embedding"
},
{
"EventType": "TaskCompleted",
"Timestamp": "2025-02-14T09:02:10.2548588Z",
"FunctionName": "add_documents"
},
{
"EventType": "ExecutionCompleted",
"OrchestrationStatus": "Completed",
"Timestamp": "2025-02-14T09:02:10.4357025Z"
}
]
}
This JSON shows a successful run. Each step can be evaluated and tracked for each document without the overhead of external tracking systems. When extended or changed, the system automatically tracks progress.
Pros and Cons of Using Durable Functions
Pros
- Flexibility: Write Python code without strict configuration limitations.
- State Management: Automatically stores the input/output of each step, so you know what’s happening.
- Scalability: Fan out to handle multiple documents in parallel.
- Retries: Built-in retry mechanisms mean you don’t have to code extensive retry logic.
Cons
- Learning Curve: Durable Functions can be new to some teams, especially if you’re used to simple Azure Functions.
- Service Limits: If your document-cracking or embedding service has rate limits, it could still be overwhelmed if you scale too fast.
- Extra Overhead: You need a function app and some storage for orchestration history.
- Missing Integrations: More code is required to integrate other inputs (like SharePoint or databases) compared to the pull method.
Conclusion
If you’re frustrated by the push or pull methods for indexing in a RAG application, Azure Durable Functions (in Python) might be a breath of fresh air. You get clear visibility into each document’s progress, scalability is built-in, and Durable Functions handle the tricky parts of state management and retries. You still keep full control over how you chunk, embed, or otherwise process your documents, while benefiting from production-ready features.
I’ve shared a sample repository that demonstrates this approach. Feel free to clone it, explore, and adapt it to your own use case. It’s still early, so if you spot any issues or have ideas for improvement, please let me know. And if your scenario doesn’t quite fit Durable Functions, there are many other ways to tackle indexing—just reach out and we can chat.
Happy indexing!