# Datastream with BigQuery

## **Prerequisites**

To get started on this tutorial, you’ll need:

* A working datastream feed delivering to your BigQuery bucket. You’ll want the name of this bucket and access credentials handy.
* A BigQuery and credentials for a user with sufficient permissions

### **Step 1: Table creation**

Start by creating a table on your BigQuery instance. The easiest way to create a table is using the GCP web interface.

In the Google Cloud console, from BigQuery select “Resources” and the dataset you wish to add a table to, then choose “Create a table.” From there, enter the appropriate information:

{% hint style="info" %}
**Important notes:**

* We recommend using schema auto-detection.
* We recommend partitioning based on either last\_ping\_timestamp or import time.
  {% endhint %}

### **Step 2: Loading**

Having loaded one folder into BigQuery, you will want to programmatically load future data delivered to your systems.

**Step 2a:  Loading data in batch**

BigQuery offers APIs for  loading data for a number of languages, detailed [**here**](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#python_1).

* Each file comes with a header row, so make sure to load data with `skip_leading_rows` set to 1.
* Files are pipe-delimited, so set `fieldDelimiter` to ‘|’.

**Step 2b: Loading streaming data**

Alternatively, data can be streamed from Google Cloud Storage to BigQuery. Details of an example streaming architecture can be found [**here**](https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions)**.**

### **Step 3: Deduplication (optional, but strongly recommended)**

To ensure that data is delivered in real-time, Datastream guarantees that a record of each pageview is delivered to your BigQuery bucket at least once. While duplicates are rare, they can occur, so for the purposes of accuracy we recommend an additional deduplication step after each data load. In the below deduplication statement we look back for duplicated records for a two hour window and always keep the more recent row:

```sql
DELETE
FROM
  EXAMPLE_TABLE
WHERE
  TIMESTAMP_SECONDS(last_ping_timestamp) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
  AND STRUCT(cookie_id, page_session_id, last_ping_timestamp) NOT IN (
  SELECT
    STRUCT(cookie_id, page_session_id, MAX(last_ping_timestamp) AS max_ts)
  FROM
    EXAMPLE_TABLE
  WHERE
    TIMESTAMP_SECONDS(last_ping_timestamp) >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
  GROUP BY cookie_id, page_session_id)
```
