Name	Name	Last commit message	Last commit date
parent directory ..
SAMPLE_DATA.md	SAMPLE_DATA.md
auth_init.ps1	auth_init.ps1
auth_init.py	auth_init.py
auth_init.sh	auth_init.sh
auth_update.ps1	auth_update.ps1
auth_update.py	auth_update.py
auth_update.sh	auth_update.sh
checkquota.sh	checkquota.sh
chunk_documents.py	chunk_documents.py
config.json	config.json
data_preparation.py	data_preparation.py
data_utils.py	data_utils.py
embed_documents.py	embed_documents.py
loadenv.ps1	loadenv.ps1
loadenv.sh	loadenv.sh
prepdocs.ps1	prepdocs.ps1
prepdocs.py	prepdocs.py
prepdocs.sh	prepdocs.sh
quota_check_params.sh	quota_check_params.sh
readme.md	readme.md
role_assignment.sh	role_assignment.sh

Data Preparation

Prepare Data Locally

Setup

Install the necessary packages listed in requirements.txt, e.g. pip install --user -r requirements-dev.txt

Configure

Create a .env file similar to the .env.example file. Fill in the values for the environment variables.
Create a config file like config.json. The format should be a list of JSON objects, with each object specifying a configuration of local data path and target search service and index.

[
    {
        "data_path": "<local path or blob URL>",
        "location": "<azure region, e.g. 'westus2'>", 
        "subscription_id": "<subscription id>",
        "resource_group": "<resource group name>",
        "search_service_name": "<search service name to use or create>",
        "index_name": "<index name to use or create>",
        "chunk_size": 1024, // set to null to disable chunking before ingestion
        "token_overlap": 128 // number of tokens to overlap between chunks
        "semantic_config_name": "default",
        "language": "en" // setting to set language of your documents. Change if your documents are not in English. Look in data_preparation.py for SUPPORTED_LANGUAGE_CODES,
        "vector_config_name": "default" // used if adding vectors to index
    }
]

Note: data_path can be a path to files located locally on your machine, or an Azure Blob URL, e.g. of the format "https://<storage account name>.blob.core.windows.net/<container name>/<path>/". If a blob URL is used, the data will first be downloaded from Blob Storage to a temporary directory on your machine before data preparation proceeds.

Create Indexes and Ingest Data

Disclaimer: Make sure there are no duplicate pages in your data. That could impact the quality of the responses you get in a negative way.

Run the data preparation script, passing in your config file. You can set njobs for parallel parsing of your files.

python data_preparation.py --config config.json --njobs=4

Batch creation of index

Refer to the script run_batch_create_index.py to create multiple indexes in batch using one script.

Optional: Use URL prefix

Each document can be associated with a URL that is stored with each document chunk in the Azure Cognitive Search index in the url field. If your documents were downloaded from the web, you can specify a URL prefix to use to construct the document URLs when ingesting your data. Your config file should have an additional url_prefix parameter like so:

[
    {
        "data_path": "<local path or blob URL>",
        "url_prefix": "https://<source website URL>.com/"
        "location": "<azure region, e.g. 'westus2'>", 
        "subscription_id": "<subscription id>",
        "resource_group": "<resource group name>",
        "search_service_name": "<search service name to use or create>",
        "index_name": "<index name to use or create>",
        "chunk_size": 1024, // set to null to disable chunking before ingestion
        "token_overlap": 128 // number of tokens to overlap between chunks
        "semantic_config_name": "default",
        "language": "en" // setting to set language of your documents. Change if your documents are not in English. Look in data_preparation.py for SUPPORTED_LANGUAGE_CODES,
        "vector_config_name": "default" // used if adding vectors to index
    }
]

For each document, the URL stored with chunks from that document will be url_prefix concatenated with the relative path of the document in data_path. For example, if my data_path is mydata containing the following structure:

└───mydata
    │   overview.html
    │
    └───examples
            example1.html
            example2.html

And url_prefix is "https://my-wiki.com/", the resulting URLs will be:

File	URL
overview.html	`"https://my-wiki.com/overview.html"`
example1.html	`"https://my-wiki.com/examples/example1.html"`
example2.html	`"https://my-wiki.com/examples/example2.html"`

These URLs can then be used in the citation display in the web app. See the README for more detail.

If you have documents from multiple source websites, you can specify multiple paths and prefixes following the example in config_multiple_url.json.

[
    {
        "data_paths": [
            {
                "path": "data/source1",
                "url_prefix": "https://<URL for source 1>.com/"
            },
            {
                "path": "data/source2",
                "url_prefix": "https://<URL for source 2>.com/"
            }
        ],
        "subscription_id": "<subscription id>",
        "resource_group": "<resource group name>",
        "search_service_name": "<search service name to use or create>",
        "index_name": "<index name to use or create>",
        "chunk_size": 1024,
        "token_overlap": 128,
        "semantic_config_name": "default",
        "language": "<Language to support for example use 'en' for English. Checked supported languages here under lucene - https://learn.microsoft.com/en-us/azure/search/index-add-language-analyzers"
    }
]

The ingestion script will loop through each path in data_paths and construct the document URLs following the same pattern as described above, using the specific URL prefix for each data path.

You can modify the URL construction logic in process_file() in data_utils.py:

url_path = None
rel_file_path = os.path.relpath(file_path, directory_path)
if url_prefix:
    url_path = url_prefix + rel_file_path
    url_path = convert_escaped_to_posix(url_path)

Optional: Add vector embeddings

Azure Cognitive Search supports vector search in public preview. See the docs for more information.

To add vectors to your index, you will first need an Azure OpenAI resource with an Ada embedding model deployment. The text-embedding-ada-002 model is supported.

Get the endpoint for embedding model deployment. The endpoint will generally be of the format https://<azure openai resource name>.openai.azure.com/openai/deployments/<ada deployment name>/embeddings?api-version=2023-06-01-preview.
Run the data preparation script, passing in your config file and the embedding endpoint and key as extra arguments:
```
`python data_preparation.py --config config.json --embedding-model-endpoint "<embedding endpoint>"`
```

Optional: Crack PDFs to Text

If your data is in PDF format, you'll first need to convert from PDF to .txt format. You can use your own script for this, or use the provided conversion code here.

Setup for PDF Cracking

Create a Form Recognizer resource in your subscription
Make sure you have the Form Recognizer SDK: pip install azure-ai-formrecognizer
Run the following command to get an access key for your Form Recognizer resource: az cognitiveservices account keys list --name "<form-rec-resource-name>" --resource-group "<resource-group-name>"

Copy one of the keys returned by this command.

Create Indexes and Ingest Data from PDF with Form Recognizer

Pass in your Form Recognizer resource name and key when running the data preparation script:

python data_preparation.py --config config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key>

This will use the Form Recognizer Read model by default.

If your documents have a lot of tables and relevant layout information, you can use the Form Recognizer Layout model, which is more costly and slower to run but will preserve table information with better quality. The Layout model will also help preserve some of the formatting information in your document such as titles and sub-headings, which will make the citations more readable. To use the Layout model instead of the default Read model, pass in the argument --form-rec-use-layout.

python data_preparation.py --config config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key> --form-rec-use-layout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

readme.md

Data Preparation

Prepare Data Locally

Setup

Configure

Create Indexes and Ingest Data

Batch creation of index

Optional: Use URL prefix

Optional: Add vector embeddings

Optional: Crack PDFs to Text

Setup for PDF Cracking

Create Indexes and Ingest Data from PDF with Form Recognizer

Files

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

readme.md

Data Preparation

Prepare Data Locally

Setup

Configure

Create Indexes and Ingest Data

Batch creation of index

Optional: Use URL prefix

Optional: Add vector embeddings

Optional: Crack PDFs to Text

Setup for PDF Cracking

Create Indexes and Ingest Data from PDF with Form Recognizer