Datastream with BigQuery

A guide to loading Datastream data from your Google Cloud Storage bucket into BigQuery.

Prerequisites

To get started on this tutorial, you’ll need:

A working datastream feed delivering to your BigQuery bucket. You’ll want the name of this bucket and access credentials handy.
A BigQuery and credentials for a user with sufficient permissions

Step 1: Table creation

Start by creating a table on your BigQuery instance. The easiest way to create a table is using the GCP web interface.

In the Google Cloud console, from BigQuery select “Resources” and the dataset you wish to add a table to, then choose “Create a table.” From there, enter the appropriate information:

Important notes:

We recommend using schema auto-detection.
We recommend partitioning based on either last_ping_timestamp or import time.

Step 2: Loading

Having loaded one folder into BigQuery, you will want to programmatically load future data delivered to your systems.

Step 2a: Loading data in batch

BigQuery offers APIs for loading data for a number of languages, detailed here.

Each file comes with a header row, so make sure to load data with skip_leading_rows set to 1.
Files are pipe-delimited, so set fieldDelimiter to ‘|’.

Step 2b: Loading streaming data

Alternatively, data can be streamed from Google Cloud Storage to BigQuery. Details of an example streaming architecture can be found here.

Step 3: Deduplication (optional, but strongly recommended)

To ensure that data is delivered in real-time, Datastream guarantees that a record of each pageview is delivered to your BigQuery bucket at least once. While duplicates are rare, they can occur, so for the purposes of accuracy we recommend an additional deduplication step after each data load. In the below deduplication statement we look back for duplicated records for a two hour window and always keep the more recent row:

DELETE
FROM
  EXAMPLE_TABLE
WHERE
  TIMESTAMP_SECONDS(last_ping_timestamp) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
  AND STRUCT(cookie_id, page_session_id, last_ping_timestamp) NOT IN (
  SELECT
    STRUCT(cookie_id, page_session_id, MAX(last_ping_timestamp) AS max_ts)
  FROM
    EXAMPLE_TABLE
  WHERE
    TIMESTAMP_SECONDS(last_ping_timestamp) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
  GROUP BY cookie_id, page_session_id)

PreviousDatastream with Redshift NextBasic Tracking

Last updated 2 months ago

Was this helpful?