To get started on this tutorial, you’ll need:
A working datastream feed delivering to your BigQuery bucket. You’ll want the name of this bucket and access credentials handy.
A BigQuery and credentials for a user with sufficient permissions
Start by creating a table on your BigQuery instance. The easiest way to create a table is using the GCP web interface.
In the Google Cloud console, from BigQuery select “Resources” and the dataset you wish to add a table to, then choose “Create a table.” From there, enter the appropriate information:
Important notes:
We recommend using schema auto-detection.
We recommend partitioning based on either last_ping_timestamp or import time.
Having loaded one folder into BigQuery, you will want to programmatically load future data delivered to your systems.
Step 2a: Loading data in batch
BigQuery offers APIs for loading data for a number of languages, detailed here.
Each file comes with a header row, so make sure to load data with skip_leading_rows
set to 1.
Files are pipe-delimited, so set fieldDelimiter
to ‘|’.
Step 2b: Loading streaming data
Alternatively, data can be streamed from Google Cloud Storage to BigQuery. Details of an example streaming architecture can be found here.