To get started on this tutorial, you’ll need:
A working datastream feed delivering to your BigQuery bucket. You’ll want the name of this bucket and access credentials handy.
A BigQuery and credentials for a user with sufficient permissions
Start by creating a table on your BigQuery instance. The easiest way to create a table is using the GCP web interface.
In the Google Cloud console, from BigQuery select “Resources” and the dataset you wish to add a table to, then choose “Create a table.” From there, enter the appropriate information:
Having loaded one folder into BigQuery, you will want to programmatically load future data delivered to your systems.
Step 2a: Loading data in batch
BigQuery offers APIs for loading data for a number of languages, detailed here.
Each file comes with a header row, so make sure to load data with
skip_leading_rows set to 1.
Files are pipe-delimited, so set
fieldDelimiter to ‘|’.
Step 2b: Loading streaming data
Alternatively, data can be streamed from Google Cloud Storage to BigQuery. Details of an example streaming architecture can be found here.
To ensure that data is delivered in real-time, Datastream guarantees that a record of each pageview is delivered to your BigQuery bucket at least once. While duplicates are rare, they can occur, so for the purposes of accuracy we recommend an additional deduplication step after each data load. In the below deduplication statement we look back for duplicated records for a two hour window and always keep the more recent row:
DELETEFROMEXAMPLE_TABLEWHERETIMESTAMP_SECONDS(last_ping_timestamp) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)AND STRUCT(cookie_id, page_session_id, last_ping_timestamp) NOT IN (SELECTSTRUCT(cookie_id, page_session_id, MAX(last_ping_timestamp) AS max_ts)FROMEXAMPLE_TABLEWHERETIMESTAMP_SECONDS(last_ping_timestamp) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)GROUP BY cookie_id, page_session_id)