
01 Nov 2024 Snowflake World Tour – Sydney 2024
It’s no secret that when it comes to building data platforms, I’ve spent a fair amount of time building cloud provider native solutions, both in AWS and GCP. So when I saw that Snowflake had a conference showcasing all their latest and greatest features, I thought it was high time to catch up on what was going on in the wider Data Platform eco-system. Whilst I’ve worked on other platforms like databricks, I’ve seen more and more customers are choosing Snowflake.
Would you like a side of AI with that?
The keynote was as you’d expect (particularly with the featured image for this post), and heavily featured AI and what Snowflake termed the ‘AI data cloud’. They defined ‘AI Data Cloud’ as using AI to support the platform through generation of metadata and optimisation; and to provide insights into the data stored within the data warehouse. They also highlighted their newer AI offering: Cortex AI, which includes a studio, Cortex Analyst and Cortex Search, which allow you to use LLMs to query structured and unstructured data respectively.
The demos were impressive, particularly seeing how Cortex Analyst provides the end user the SQL it runs against your structured data to generate its answers. For me that was nice to see, so I could check the LLM’s homework. There was also a short demo of how you could easily fine-tune LLMs to get improved performance on specific tasks, all through a notebook without much effort. These showcases highlighted some of the challenges that companies face when first starting to use AI, such as complexity, controlling costs, and security and privacy. They neatly laid out how Snowflake’s approach drives simplicity and ease of governance of data, and how that is integrated throughout their offerings and feature sets. They did note that costs can be controlled through the sizing of the warehouses, and Snowflake has built in cost governance; the choice to use a credit system for pricing rather than a straight $/hr is still one of my major concerns.
Open Datalakes and Lake houses
One thing that caught my attention on the agenda was the large number of sessions devoted to their new support for Iceberg tables and catalogues within Snowflake. This has been achieved through the use of Apache Polaris upon which they have built the Snowflake Open Catalog. Polaris provides a standards-compliant Apache Iceberg metastore that can be used to interact with data stored in platforms using other catalogues, such as AWS Glue. They mentioned how this approach also allows them to query data in delta tables without the need for a unity metastore. They explained use-cases for this feature, such as customers who have existing datalakes and want to use Snowflake for processing, and customers who need to support multiple data tools, where Iceberg tables were the common compatibility. They also demoed how Iceberg tables, either stored within Snowflake, or stored elsewhere, were able to be manipulated and managed as if they were native tables, with most features already supported, and those that weren’t were listed as being in private preview.
For me this proved to be one of the more interesting elements of the day; I started to think about how some of the existing data platforms that I’d built could leverage this approach to allow end-users to make use of Snowflake without needing to re-import the data. They could gain the benefits it brings in terms of transformations and data governance.

Central Governance of Data
One of the key features that was talked about in almost every session was Data Governance. For example, in the session about Snowflake native apps, it was highlighted just how fine-grained the permission model is, and how, as a Snowflake administrator, you control exactly what data native applications can access. The key technology underpinning this was Snowflake Horizon, which is their all-encompassing name for all their data governance and security features. These features stretch from some of the more typical items such as data set discovery, row-level and column level security; to live data masking and aggregation policies to enforce de-identification of sensitive data sets. These demos showed clearly some of the real problems that data engineering teams try to solve, addressed in a way that makes it simple to understand and audit.
Data Sharing
One of the unexpected gems was the final session on Data Sharing, and how the features of Snowflake Horizon, such as aggregation policies, can help in data sharing between organisations and even between teams in the same organisation. The session went through how to think about what data to share, and the concept of creating a data product which could be shared (or commercialised), rather than creating yet another file feed. Private and public marketplaces can be used to provide thi,rd parties and other teams, curated views into your datasets, without needing to make new copies or extracts. I may have dived head first into the kool-aid, but a future where I don’t have to chase upstream teams for missing feed files, or an updated schema when a feed breaks because no-one told you….. sign me up. On a more serious note, think about the data provided to external teams and partners with a product mindset. Taking the time to understanding what they would do with it, rather than just throwing a pile of data at them, certainly has benefits, especially when coupled with the fact that there are no feed files to monitor and maintain.
Drawing it to a close
Am I converted? Not quite. I still appreciate the flexibility of some of the cloud provider native data platforms can offer. But it was certainly great to spend the day seeing how Snowflake have solved some of the common problems in Data Engineering, and the cool technology they’re developing. I was more impressed than I expected to be, and will certainly be considering if Snowflake has a place in future data platform and data engineering projects.

No Comments