Datastream with BigQuery

A guide to loading Datastream data from your Google Cloud Storage bucket into BigQuery.

Prerequisites

To get started on this tutorial, you’ll need:

  • A working datastream feed delivering to your BigQuery bucket. You’ll want the name of this bucket and access credentials handy.

  • A BigQuery and credentials for a user with sufficient permissions

Step 1: Table creation

Start by creating a table on your BigQuery instance. The easiest way to create a table is using the GCP web interface.

In the Google Cloud console, from BigQuery select “Resources” and the dataset you wish to add a table to, then choose “Create a table.” From there, enter the appropriate information:

Important notes:

  • We recommend using schema auto-detection.

  • We recommend partitioning based on either last_ping_timestamp or import time.

Step 2: Loading

Having loaded one folder into BigQuery, you will want to programmatically load future data delivered to your systems.

Step 2a: Loading data in batch

BigQuery offers APIs for loading data for a number of languages, detailed here.

  • Each file comes with a header row, so make sure to load data with skip_leading_rows set to 1.

  • Files are pipe-delimited, so set fieldDelimiter to ‘|’.

Step 2b: Loading streaming data

Alternatively, data can be streamed from Google Cloud Storage to BigQuery. Details of an example streaming architecture can be found here.