Index data from Azure Blob Storage

Important

These features and functionality support connections to other Microsoft services and third-party services. Use of these services is subject to their respective terms and might result in data processing or storage outside of the Azure compliance boundary, as well as data flowing into the Azure compliance boundary.

It's your responsibility to manage whether your data will flow outside of your organization's compliance and geographic boundaries and any related implications, and that appropriate permissions, boundaries, and approvals are provisioned.

You're responsible for carefully reviewing and testing applications you build in the context of your specific use cases and making all appropriate decisions and customizations. This includes implementing your own responsible AI mitigations, such as metaprompts, content filters, or other safety systems, and ensuring your applications meet appropriate quality, reliability, security, and trustworthiness standards. For more information, see the Azure AI Search Transparency Note.

In this article, you learn how to configure an indexer that imports content from Azure Blob Storage and makes it searchable in Azure AI Search. The indexer receives blobs in a single container as input. The output is a search index that stores searchable content and metadata in individual fields.

This article uses the Search Service REST APIs to demonstrate how to configure and run the indexer. However, you can also use:

An Azure SDK package (any version)
Import data wizard in the Azure portal

Note

Azure AI Search can ingest role-based access control (RBAC) scope during indexing and transfer those permissions to indexed content in a search index. For more information, see Use a blob indexer or knowledge source to ingest RBAC scopes metadata.

Prerequisites

Azure Blob Storage, Standard performance (general-purpose v2).
Access tiers include hot, cool, cold, and archive. Indexers can retrieve blobs on hot, cool, and cold access tiers.
Blobs providing text content and metadata. If blobs contain binary content or unstructured text, consider adding AI enrichment for image and natural language processing. Blob content can't exceed the indexer limits for your pricing tier.
A supported network configuration and data access. At a minimum, you need read permissions in Azure Storage. A storage connection string that includes an access key gives you read access to storage content. If instead you're using Microsoft Entra logins and roles, make sure the search service's managed identity has Storage Blob Data Reader permissions.

By default, both search and storage accept requests from public IP addresses. If network security isn't an immediate concern, you can index blob data using just the connection string and read permissions. When you're ready to add network protections, see Indexer access to content protected by Azure network security features for guidance about data access.
Use a REST client to formulate REST calls similar to the ones shown in this article.

Supported tasks

You can use this indexer for the following tasks:

Data indexing and incremental indexing: The indexer can index files and associated metadata from blob containers and folders. It detects new and updated files and metadata through built-in change detection. You can configure data refresh on a schedule or on demand.
Deletion detection: The indexer can detect deletions through native soft delete or through custom metadata.
Applied AI through skillsets: The indexer fully supports skillsets. This support includes key features like integrated vectorization, which adds data chunking and embedding.
Parsing modes: The indexer supports JSON parsing modes if you want to parse JSON arrays or lines into individual search documents. It also supports Markdown parsing mode.
Compatibility with other features: The indexer works seamlessly with other indexer features, such as debug sessions, indexer cache for incremental enrichments, and knowledge store.

Supported document formats

The blob indexer can extract text from the following document formats:

CSV (see Indexing CSV blobs)
EML
EPUB
GZ
HTML
JSON (see Indexing JSON blobs)
KML (XML for geographic representations)
Markdown
Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
Open Document formats: ODT, ODS, ODP
PDF
Plain text files (see also Indexing plain text)
RTF
XML
ZIP

Determine which blobs to index

Before you set up indexing, review your source data to determine whether you need to make any changes. An indexer can index content from one container at a time. By default, the indexer processes all blobs in the container. You have several options for more selective processing:

Place blobs in a virtual folder. An indexer data source definition includes a query parameter that can take a virtual folder. If you specify a virtual folder, the indexer indexes only those blobs in the folder.
Include or exclude blobs by file type. The supported document formats list can help you determine which blobs to exclude. For example, you might want to exclude image or audio files that don't provide searchable text. You control this capability through configuration settings in the indexer.

Include or exclude arbitrary blobs. To skip a specific blob, add the following metadata properties and values to blobs in Azure Blob Storage. When an indexer encounters this property, it skips the blob or its content in the indexing run.

Property name	Property value	Explanation
`AzureSearch_Skip`	`true`	Instructs the blob indexer to completely skip the blob. The indexer doesn't attempt to extract metadata or content. This property is useful when a particular blob fails repeatedly and interrupts the indexing process.
`AzureSearch_SkipContent`	`true`	The indexer skips content and extracts just the metadata. This property is equivalent to the `"dataToExtract": "allMetadata"` setting described in configuration settings, but it's scoped to a particular blob.

If you don't set up inclusion or exclusion criteria, the indexer reports an ineligible blob as an error and moves on. If enough errors occur, processing might stop. You can specify error tolerance in the indexer configuration settings.

An indexer typically creates one search document per blob, where the text content and metadata are captured as searchable fields in an index. If blobs are whole files, you can potentially parse them into multiple search documents. For example, you can parse rows in a CSV file to create one search document per row.

A compound or embedded document (such as a ZIP archive, a Word document with embedded Outlook email containing attachments, or an .MSG file with attachments) is also indexed as a single document. For example, all images extracted from the attachments of an .MSG file are returned in the normalized_images field. If you have images, consider adding AI enrichment to get more search utility from that content.

The indexer extracts the textual content of a document into a string field named content. You can also extract standard and user-defined metadata.

Indexing blob metadata

You can index blob metadata alongside content. The indexer extracts metadata properties and stores them in index fields, which is useful for creating filters and queries.

Define fields in your search index for the metadata properties you want to capture. You don't need to define every available standard or custom metadata property. Just capture the properties you need for your application.

Currently, this indexer doesn't support indexing blob index tags.

Important

The indexer populates only metadata fields that are already defined in your search index. This requirement applies to both standard blob metadata and custom metadata. If a field isn't defined, the metadata value is extracted during indexing but silently discarded, so it doesn't appear in search results. This is the most common source of null metadata fields in search results. For more information, see Metadata fields are null in search results.

Standard blob metadata properties

For standard blob metadata, define fields in your index using the same underscored names. For step-by-step field definition examples, see Add search fields to an index.

For indexer configuration, including the dataToExtract setting that controls which metadata to extract, see Configure and run the blob indexer.

The indexer recognizes and can map these standard metadata properties if you define corresponding fields in your index:

metadata_storage_name (Edm.String) is the file name of the blob. For example, if you have a blob /my-container/my-folder/subfolder/resume.pdf, the value of this field is resume.pdf.
metadata_storage_path (Edm.String) is the full URI of the blob, including the storage account. For example, https://myaccount.blob.core.windows.net/my-container/my-folder/subfolder/resume.pdf. Use this property to include blob URLs in search results for navigation or source attribution.
metadata_storage_content_type (Edm.String) is the content type as specified by the code you used to upload the blob. For example, application/octet-stream.
metadata_storage_last_modified (Edm.DateTimeOffset) is the last modified timestamp for the blob. Azure AI Search uses this timestamp to identify changed blobs and avoid reindexing everything after the initial indexing.
metadata_storage_size (Edm.Int64) is the blob size in bytes.
metadata_storage_content_md5 (Edm.String) is the MD5 hash of the blob content, if available.
metadata_storage_sas_token (Edm.String) is a temporary SAS token that custom skills can use to get access to the blob. Don't store this token for later use, as it might expire.

Custom and content-specific metadata

For custom or user-defined blob metadata, define a field with the exact same name as the blob's metadata key. For example, if your blobs have a metadata key Sensitivity with value High, define a field named Sensitivity of type Edm.String in your index.

You can also represent metadata properties specific to the document format of the blobs you're indexing. For more information, see Content metadata properties.

Define the data source

The data source definition specifies the data to index, credentials, and policies for identifying changes in the data. A data source is defined as an independent resource so that it can be used by multiple indexers.

Create or update a data source to set its definition:

{
    "name" : "my-blob-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
    "container" : { "name" : "my-container", "query" : "<optional-virtual-directory-name>" }
}

Set type to azureblob (required).
Set credentials to an Azure Storage connection string. The next section describes the supported formats.
Set container to the blob container, and use query to specify any subfolders.

You can also include soft deletion policies in a data source definition if you want the indexer to delete a search document when the source document is flagged for deletion.

Supported credentials and connection strings

Indexers can connect to a blob container by using the following connections.

Full access storage account connection string
`{ "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>;" }`
You can get the connection string from the Storage account page in Azure portal by selecting Access keys in the left pane. Make sure to select a full connection string and not just a key.

Managed identity connection string
`{ "connectionString" : "ResourceId=/subscriptions/<your subscription ID>/resourceGroups/<your resource group name>/providers/Microsoft.Storage/storageAccounts/<your storage account name>/;" }`
This connection string doesn't require an account key, but you must have previously configured a search service to connect using a managed identity.

Storage account shared access signature (SAS) connection string
`{ "connectionString" : "BlobEndpoint=https://<your account>.blob.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rl;" }`
The SAS should have the list and read permissions on containers and objects (blobs in this case).

Container shared access signature
`{ "connectionString" : "ContainerSharedAccessUri=https://<your storage account>.blob.core.windows.net/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rl;" }`
The SAS should have the list and read permissions on the container. For more information, see Grant limited access to Azure Storage resources using shared access signatures (SAS).

Note

If you use SAS credentials, you need to update the data source credentials periodically with renewed signatures to prevent their expiration. If SAS credentials expire, the indexer fails with an error message similar to "Credentials provided in the connection string are invalid or have expired".

Add search fields to an index

In a search index, add fields to accept the content and metadata of your Azure blobs.

Create or update an index to define search fields that store blob content and metadata:

POST https://[service name].search.windows.net/indexes?api-version=2026-04-01
{
    "name" : "my-search-index",
    "fields": [
        { "name": "ID", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false },
        { "name": "metadata_storage_name", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true  },
        { "name": "metadata_storage_size", "type": "Edm.Int64", "searchable": false, "filterable": true, "sortable": true  },
        { "name": "metadata_storage_content_type", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true },        
    ]
}

Create a document key field ("key": true). For blob content, the best candidates are metadata properties.
- metadata_storage_path (default) is the full path to the object or file. The key field (ID in this example) is populated with values from metadata_storage_path because it's the default.
- metadata_storage_name is usable only if names are unique. If you want this field as the key, move "key": true to this field definition.
- A custom metadata property that you add to blobs. This option requires that your blob upload process adds that metadata property to all blobs. Since the key is a required property, any blobs that are missing a value fail to be indexed. If you use a custom metadata property as a key, avoid making changes to that property. Indexers add duplicate documents for the same blob if the key property changes.
Metadata properties often include characters, such as / and -, which are invalid for document keys. However, the indexer automatically encodes the key metadata property, with no configuration or field mapping required.
Add a content field to store extracted text from each file through the blob's content property. You aren't required to use this name, but by using it, you can take advantage of implicit field mappings.
Add fields for standard metadata properties. The indexer can read custom metadata properties, standard metadata properties, and content-specific metadata properties.

Configure and run the blob indexer

After you create the index and data source, create the indexer. Indexer configuration specifies the inputs, parameters, and properties that control runtime behaviors. You can also specify which parts of a blob to index.

Create or update an indexer by giving it a name and referencing the data source and target index:

POST https://[service name].search.windows.net/indexers?api-version=2026-04-01
{
  "name" : "my-blob-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-search-index",
  "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "configuration": {
          "indexedFileNameExtensions" : ".pdf,.docx",
          "excludedFileNameExtensions" : ".png,.jpeg",
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default"
      }
  },
  "schedule" : { },
  "fieldMappings" : [ ]
}

Set batchSize if the default (10 documents) underutilizes or overwhelms available resources. Default batch sizes are data source specific. Blob indexing sets batch size at 10 documents in recognition of the larger average document size.
Under configuration, control which blobs are indexed based on file type, or leave unspecified to retrieve all blobs.

For indexedFileNameExtensions, provide a comma-separated list of file extensions (with a leading dot). Do the same for excludedFileNameExtensions to indicate which extensions the indexer should skip. If the same extension is in both lists, the indexer excludes it from indexing.
Under configuration, set dataToExtract to control which parts of the blobs are indexed:
- contentAndMetadata specifies that the indexer indexes all metadata and textual content extracted from the blob. This is the default value.
- storageMetadata specifies that the indexer indexes only the standard blob properties and user-specified metadata.
- allMetadata specifies that the indexer extracts from the blob content and indexes standard blob properties and any metadata for found content types.
- For a complete list of available standard metadata properties and their field types, see Indexing blob metadata.
Under configuration, set parsingMode. The default parsing mode is one search document per blob. If blobs are plain text, you can get better performance by switching to plain text parsing. If you need more granular parsing that maps blobs to multiple search documents, specify a different mode. One-to-many parsing is supported for blobs consisting of:
- JSON documents
- CSV files
Specify field mappings if there are differences in field name or type, or if you need multiple versions of a source field in the search index.

In blob indexing, you can often omit field mappings because the indexer has built-in support for mapping the content and metadata properties to similarly named and typed fields in an index. For metadata properties, the indexer automatically replaces hyphens - with underscores in the search index.
See Create an indexer for more information about other properties. For the full list of parameter descriptions, see REST API.

An indexer runs automatically when it's created. You can prevent this action by setting disabled to true. To control indexer execution, run an indexer on demand or put it on a schedule.

Index data from multiple Azure Blob containers to a single index

Remember that an indexer can only index data from a single container. If you need to index data from multiple containers and consolidate it into a single AI Search index, configure multiple indexers that all point to the same index. Be aware of the maximum number of indexers available per SKU.

For example, you can use two indexers to pull data from two distinct data sources named my-blob-datasource1 and my-blob-datasource2. Each data source points to a separate Azure Blob container, but both direct to the same index named my-search-index.

First indexer definition example:

POST https://[service name].search.windows.net/indexers?api-version=2026-04-01
{
  "name" : "my-blob-indexer1",
  "dataSourceName" : "my-blob-datasource1",
  "targetIndexName" : "my-search-index",
  "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "configuration": {
          "indexedFileNameExtensions" : ".pdf,.docx",
          "excludedFileNameExtensions" : ".png,.jpeg",
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default"
      }
  },
  "schedule" : { },
  "fieldMappings" : [ ]
}

Second indexer definition that runs in parallel example:

POST https://[service name].search.windows.net/indexers?api-version=2026-04-01
{
  "name" : "my-blob-indexer2",
  "dataSourceName" : "my-blob-datasource2",
  "targetIndexName" : "my-search-index",
  "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "configuration": {
          "indexedFileNameExtensions" : ".pdf,.docx",
          "excludedFileNameExtensions" : ".png,.jpeg",
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default"
      }
  },
  "schedule" : { },
  "fieldMappings" : [ ]
}

Check indexer status

To monitor the indexer status and execution history, send a Get Indexer Status request:

GET https://myservice.search.windows.net/indexers/myindexer/status?api-version=2026-04-01
  Content-Type: application/json  
  api-key: [admin key]

The response includes status and the number of items processed. It should look similar to the following example:

    {
        "status":"running",
        "lastResult": {
            "status":"success",
            "errorMessage":null,
            "startTime":"2022-02-21T00:23:24.957Z",
            "endTime":"2022-02-21T00:36:47.752Z",
            "errors":[],
            "itemsProcessed":1599501,
            "itemsFailed":0,
            "initialTrackingState":null,
            "finalTrackingState":null
        },
        "executionHistory":
        [
            {
                "status":"success",
                "errorMessage":null,
                "startTime":"2022-02-21T00:23:24.957Z",
                "endTime":"2022-02-21T00:36:47.752Z",
                "errors":[],
                "itemsProcessed":1599501,
                "itemsFailed":0,
                "initialTrackingState":null,
                "finalTrackingState":null
            },
            ... earlier history items
        ]
    }

Execution history contains up to 50 of the most recently completed executions. The entries are sorted in reverse chronological order, so the latest execution comes first.

Troubleshooting

Use this section to diagnose missing metadata values and common blob indexing failures.

Metadata fields are null in search results

If you're seeing null or empty values for metadata fields in your search results, use this checklist:

Verify the field exists in your index schema: Check that you defined a field for each metadata property you want to capture. Run a GET request on your index to confirm the field is present.
Use the correct field name: For standard blob properties, use the underscored name, such as metadata_storage_path instead of metadata-storage-path. For custom metadata, the field name must exactly match the blob's metadata key.
Confirm the indexer has dataToExtract set correctly:
- contentAndMetadata (default) extracts both content and standard/custom metadata.
- storageMetadata extracts only standard blob properties and custom metadata.
- allMetadata extracts standard properties plus content-type-specific metadata.
Check your indexer configuration to ensure the setting matches your intent.
Re-run the indexer after schema updates: If you added a new field to your index, run the indexer again so documents are processed against the updated schema. Existing documents might need to be reprocessed before the new field is populated.
Check the indexer execution history: Go to your indexer in the Azure portal or use Get Indexer Status (REST API) to view execution details and any error messages.

Indexer failures and unsupported content types

By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type, such as an audio file. You can use the excludedFileNameExtensions parameter to skip certain content types.

If you want indexing to continue when some documents fail, adjust the following parameters and then investigate individual documents later. For more information, see Indexer troubleshooting guidance and Indexer errors and warnings.

When errors occur, five indexer parameters control the indexer's response:

PUT /indexers/[indexer name]?api-version=2026-04-01
{
  "parameters" : { 
    "maxFailedItems" : 10, 
    "maxFailedItemsPerBatch" : 10,
    "configuration" : { 
        "failOnUnsupportedContentType" : false, 
        "failOnUnprocessableDocument" : false,
        "indexStorageMetadataOnlyForOversizedDocuments": false
      }
    }
}

Parameter	Valid values	Description
`maxFailedItems`	-1, null, or 0, positive integer	Continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. Set this property to the number of acceptable failures. A value of `-1` allows processing no matter how many errors occur. Otherwise, the value is a positive integer.
`maxFailedItemsPerBatch`	-1, null, or 0, positive integer	Same as above, but used for batch indexing.
`failOnUnsupportedContentType`	true or false	If the indexer can't determine the content type, specify whether to continue or fail the job.
`failOnUnprocessableDocument`	true or false	If the indexer can't process a document of an otherwise supported content type, specify whether to continue or fail the job.
`indexStorageMetadataOnlyForOversizedDocuments`	true or false	Oversized blobs are treated as errors by default. If you set this parameter to true, the indexer tries to index its metadata even if the content can't be indexed. For limits on blob size, see Service limits.

Feedback

Was this page helpful?

Last updated on 2026-06-11