What is needed to configure an integration?

In order to be able to use BigQuery integrations, a Google account with permissions to at least a BigQuery project is needed (namely: the project containing the dataset this integration will grab data from). Once the account is chosen and authenticated in Datagran's OAuth client, a database needs to be selected as source for the integration.

IMPORTANT: Make sure you choose all permissions otherwise the integration will fail.

Is there any limitation on databases and data types?

Since BigQuery is a cloud service provided by Google, no limitation exists as long as the service and permissions are appropriately configured for the Google user trying to authenticate and make use of this service. However, a subset of BigQuery data types are actually supported (notably: non-scalar types are not supported):

Supported data types

Note: Columns of other types, more complex ones, are not supported and will silently be ignored on data synchronization.

How does the stream replication work?

Only one mode is supported: "Key Based" / "Incremental".

"Key Based" (also called "Incremental") requires a particular field to be selected as "key". That field is meant to be treated as a sequential or "sortable" field which increases by time (i.e. records added in the future will have never descending values in this field), and will be tracked across several extraction executions from the (same) integration's stream.

The first run, the whole table data will be downloaded into our platform, and the maximum value for that tracked key (i.e. the one from the last record retrieved) is kept for future runs. This will be time-consuming and the source servers may undergo a lot of resources consumption, so these runs should be scheduled on what the customer deems as low-activity hourly ranges (e.g. nightly).

When another extraction is run later for this incremental stream, only a subset of the data will be retrieved: Those having values, in the chosen key-field, greater than or equal to the last stored value.

As an example: Let's say the source table contains "sold product units", containing these fields:

sold_at: The date and time the product was sold. We will assume that this field will have increasing values when added further (i.e. records in the past will be never added).
product_id: The id of the product being sold.
amount: The amount of units being sold.
unit_price: The price each product unit was sold at.

Typically, this table will add more values each day, each minute. The appropriate stream mode is key-based in this case, choosing the sold_at field as key. The first extraction will download the whole data that exists so far in the source table, and keep the last, and thus greater, value in the sold_at field. Let's put as example: The very last value retrieved in this key field corresponds to the data: "Monday September 20th, 2021 at 15:00:01hs".

The next extraction that runs will only retrieve records from the same "sold product units" table, having sold_at date greater than or equal to "Monday September 20th, 2021 at 15:00:01hs". This will imply the following things: