Google Analytics 360 offers a way to be able to track website traffic and, by using its BigQuery integration, store the detail measures that may be useful from an analytical point of view.

The detail measures stored can include traffic source, content or transactional data in the form of sessions. These sessions are stored in date-sharded tables i.e. a table of sessions stored for each day.

Although copying datasets may seem like a one-click job where one could now copy a dataset from any region just like that * snaps fingers *, migrating this dataset would involve extra precautionary checks when dealing with a large number of tables.

This blog describes the usage of the Beta version (as of Jan 2020) of the BigQuery Data Transfer Service to copy datasets across regions from US to Australia South-East region on Google Cloud Platform and compares the efficiency of the different methods that have previously been used prior to the release of this beta version.

Copying Datasets

The different options to be able to copy datasets across regions are as follows:

  • By using a federated architecture which would involve moving all data to GCS buckets, copying data to the appropriate region wise bucket and then uploading tables:
    • Using Cloud Composer
    • Using Dataflow
      • Write up a pipeline job integrated with BigQuery that copies tables to and fro GCS buckets
  • By using BigQuery’s beta version of cross-region copy:
Screenshot 2020-01-15 at 10.54.02 AM.png
Fig: BigQuery Data Transfer Service

Now, you may be wondering that if there’s a solution to do it all in a few clicks, why bother go through the other options?

Quotas and Limitations

The general limitations for cross-region copy jobs allow a maximum of 1000 tables to be copied per run.

As per Google’s documentation,

“For a cross-region copy of a dataset with 8,000 tables in it, the BigQuery Data Transfer Service will automatically create 8 runs in a sequential manner. The first run will copy 1,000 tables, then 24 hours later another run will copy 1,000 tables, etc., until all tables in the dataset are copied — up to the maximum of 20,000 tables per dataset.”.

Since the case chosen has around 10,000 tables to migrate, this limitation is applicable.

The following points out the comparison between the different methods discussed to be able to copy datasets:

Screenshot 2020-01-15 at 12.14.25 PM.png
Fig: Comparing Dataset copy methods

Though BigQuery Data Transfer Service seems like the simplest choice, the limitations are not obvious on first sight.

Since the limitation on the number of tables per destination dataset is applicable in a case with more than 10,000 tables, the following steps describe a possible workaround to copy tables from US to australia-southeast1.

Data description:

  • A single dataset that contains:
    • ~1000 date-sharded tables, one per day that records GA sessions
    • 10 scheduled queries, that performs analysis queries on the main session table
    • ~1000 date-sharded tables for each query, one per day that records results of the queries

A possible workaround: 

  • Tables from the source dataset are copied to different temporary datasets within the same region (US) so that each temporary dataset contains a maximum of 1000 tables.
    • This is an in-region dataset copy that should allow a maximum of 20,000 jobs.
  • Each of these datasets are simultaneously copied to pre-created datasets at the new region (australia-southeast1).
    • This is a BigQuery Data Transfer Service that should allow a cross-region copy.
  • Finally, the tables from the dataset in the new region (australia-southeast1) are then copied to the final dataset.
    • This is again an in-region dataset copy that should allow a maximum of 20,000 jobs.

For ~10,000 tables to be copied, the above procedure took around 2 hours which could be optimised further to execute the script in parallel over a larger number of copy jobs.

Screenshot 2020-01-15 at 12.39.33 PM.png

Migration and Deployment of GA360

Now that the tables have been migrated to the appropriate region, the last few points to note would be to validate the data, dashboards and the queries.

GA360 tables are noted to have a few nested resources which may be easier to use with Legacy SQL. However, to be able to use BigQuery scheduled queries to populate the tables, it must be noted that it supports only Standard SQL.

Differences to be noted while migrating to Standard SQL from Legacy SQL can be found here and here

At last but not the least, you’re a click away from changing the region at the GA360 dashboard and the steps to be able to do so are described here

Leave a Reply

%d bloggers like this: