Unnesting Legacy to Standard SQL

Unnesting Legacy to Standard SQL

Updating your application to migrate from Legacy to Standard SQL isn’t a straightforward process of replacing colons with dots and brackets with quotes.

There’s a bit more of a nested mess than that. Working with nested resources has its advantage in Legacy SQL to be able easily correlate a resource with its nested attributes by using a ‘.’ operator

This blog addresses the problems that may arise when attempting to convert Legacy to Standard SQL.

What are nested fields?

Nested fields are a struct of arrays that have a set of attributes that are conceptually nested within. They could be imagined as a table within a table. Traditionally, they would be separate tables and to be able to perform an analysis on them, a join would be required.

Screenshot 2020-01-15 at 3.30.29 PM.png
Fig: Google Analytics 360 schema with ‘totals’ as a nested field 

As an example, let’s use the BigQuery public dataset for Google Analytics. Its description for the schema can be found here.

Each row in the ga_session table refers to one session. To be able to address the nested fields within each session, we would need to unnest or unflatten the data per session as explained below.

Nested fields with Legacy SQL

#LegacySQL
SELECT fullVisitorId, visitStartTime, date, hits.hitNumber FROM [bigquery-public-data:google_analytics_sample.ga_sessions_20170801]

This query results in rows that are flattened or normalised by hits. Each row refers to a hit within a session .

A nested field can be addressed using the ‘.’ operator and Legacy SQL magically understands to unflattens the data as per the level of dimensionality required.

Nested fields with Standard SQL

#StandardSQL
SELECT fullVisitorId, visitStartTime, date, count(*) as sessions FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` GROUP BY fullVisitorId, visitStartTime, date 

This query returns the count as the number of sessions per unique visitor where the rows are denormalised or unflattened. To be able to access a nested resource, we would need to unnest the field as shown below,

#StandardSQL 
SELECT fullVisitorId, visitStartTime, date, count(*) as hits FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, UNNEST(hits) as h GROUP BY fullVisitorId, visitStartTime, date

This query returns the count as the number of hits per unique visitor where the rows are normalised or flattened out by unnesting the hits. If we were to conceptually visualise the nested fields as a nested table, the comma in the FROM clause represents a cross-join between the tables. 

Maintaining the level of dimensions when unnesting

Depending on the outcome of query, each field that is to be returned by the query must be validated that it is at the right level of dimension to be addressed.  This can be implemented in a query in the following ways:

  • Create sub-queries at each level of dimension required using WITH statements and join these sub-queries to arrive at the correct level that is required.

For example,

#StandardSQL
WITH 
hit_level AS (
SELECT
CONCAT(fullVisitorId, CAST (visitStartTime AS string)) AS uniqueVisitId,
h.hitNumber

FROM 
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS ga, UNNEST(hits) AS h
),

user_level AS (
SELECT
CONCAT(fullVisitorId, CAST (visitStartTime AS string)) AS user_uniqueVisitId, 
SUM(totals.pageviews) as pageviews

FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS ga
GROUP BY user_uniqueVisitId
)

SELECT * FROM 
user_level LEFT JOIN hit_level
ON user_level.user_uniqueVisitId = hit_level.uniqueVisitId

This allows us to segregate the tables of different levels (where the hits are a nested resource) and perform an explicit join so that the summation of revenue is performed at a user/session level rather than being summed over a duplication of rows when the hits are unnested.

  • Using the analytical functions OVER(PARTITION BY …) .

This allows to select the fields in a window frame of the table that allows a nested field to be returned at the appropriate level 

For example,

#StandardSQL
SELECT DISTINCT
CONCAT(fullVisitorId, CAST (visitStartTime as string)) AS uniqueVisitId,
h.hitNumber,
SUM(totals.pageviews)  OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY date) as pageviews
     
FROM 
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS ga, UNNEST(hits) as h

Using the window function, we’re now able to replicate the previous explicit join by processing over a partition of the table. Note that, the query is over a cross join with the unnested hits as well. By partitioning over the identifier of the level of dimension required, we’re now able to arrive at the number of pageviews over a user level rather than a hit level.

A journey through conversion

#LegacySQL
SELECT  
(fullVisitorID + CAST(visitStartTime AS STRING)) AS visitIdentifier,
SUM(totals.pageviews) AS pageviews,  
COUNT(hits.product.productsku) AS num_products

FROM 
[bigquery-public-data:google_analytics_sample.ga_sessions_20170801]
GROUP BY visitIdentifier

To convert the above Legacy SQL to Standard SQL, the first attempt would be to UNNEST each of the nested fields in the FROM clause. However, from the previous section we’ve understood that UNNESTing the fields would result in a duplicate number of rows which would lead to an incorrect calculation over the session-level fields.

The next attempt would be to group by the keys of the nested columns to be able to group these duplicated rows

#StandardSQL
SELECT  
DISTINCT CONCAT(fullVisitorID , CAST(visitStartTime AS STRING)) AS visitIdentifier,
SUM(totals.pageviews) AS pageviews,  
COUNT(hp.productsku) AS num_products

FROM 
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, UNNEST(hits) as h, UNNEST(h.product) as hp
GROUP BY visitIdentifier, h.hitNumber, hp.productSku

Unfortunately, this grouping is for all the fields and would not lead to the correct result.

The next attempt minimises this group to query the respective fields within a window,

#StandardSQL
SELECT  
CONCAT(fullVisitorID , CAST(visitStartTime AS STRING)) AS visitIdentifier,
SUM(totals.pageviews) AS pageviews,
SUM((select COUNT(hp.productsku)  FROM  UNNEST(hits) as h, UNNEST(h.product) as hp)) as num_products,

FROM 
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
GROUP BY visitIdentifier

This results with the correct answer with each field being processed at the right level.

To be able to validate the results of the query, it must be remembered that Standard and Legacy SQL have different implementations of COUNT(DISTINCT … A list of all differences can be found here

And with that, hopefully this blog helped you untangle an unnested mess and be able to see daylight once again.

Thanks to Aaron Brown for his moral support and invaluable help to get through the unnest.  

tanya.srinidhi@shinesolutions.com
No Comments

Leave a Reply