What is it about?
This is an Integration that downloads parquet files from an Azure Blob Storage accounts and loads them into our platform. This download is done without any data transform, thus requiring the source fields to be in the proper format to be able to import them in our platform.
What is needed to configure an integration?
The pre-requisites to configure an integration are:
- A desired container, which must already exist.
- An account with permissions to read from the desired container. That user will be authorized for the customer to give their credentials to our platform. This user must have read permissions but, as a recommendation security, it must only have permissions into that desired container (and/or other containers meant for similar purposes). A connection string must be provided resembling those credentials.
- A consistent layout on how to store data files (typically this is: a directory under which all the files are contained, and a consistent search pattern, with * wildcards, to enumerate them).
- An internal process that only generates new data files on new data, and never updates existing data files.
- A custom new table name (of the entire user’s choice, but following the typical SQL guidelines to name a table) to store the dumped data.
How do I find the container name?
In the “Storage accounts” section, choose one account and go to the “Containers” section. You can use any of the container names listed there.
How do I find the connection string?
In the “Storage accounts” section, choose one account and go to the “Access keys” section. You need to click “Show” on the connection string to show, and then copy it.
Is there any limitation on databases and data types?
The source data comes in many parquet files (which comes from one or more dataframes). There is no particular restriction to create a DataFrame except that:
- It will typically be created with tools like pandas or dask, and be compliant to the Apache dataframe format as usual. A pretty standard process.
- The dataframe must not be empty. It must have at least one defined column.
- The columns in the dataframe must not be repeated with case-changing variants. For example, it is wrong to have both
x1
and X1
columns in the dataframe, since when imported to BigQuery, the columns will become case-insensitive and there would be a duplicate X1
column in the schema definition on import.
- The names of the columns must not start with a number. Always a letter or underscore. They must typically fit a SQL column’s format.