
22 May 2017 Beam me up Google – porting your Dataflow applications to 2.x
Will this post interest me?
If you use (or intend to use) Google Cloud Dataflow, you’ve heard about Apache Beam, or if you’re simply bored in work today and looking to waste some time, then yes, please do read on. This short post will cover why our team finally took the plunge to start porting some of Dataflow applications (using the 1.x Java SDKs) to the new Apache Beam model (2.x Java SDK). Spoiler – it has something to do with this. It will also highlight the biggest changes we needed to make when making the switch (pretty much just fix some compile errors).
Firstly, some background
Many of you avid readers will already know that the boffins at Google decided to donate the Dataflow SDKs to the Apache foundation last year. If you’re already confused by that sentence, don’t worry. Head over here and here for a quick refresher on that announcement. If you’re still confused as to what that means, then I’d suggest studying this – for at least a few hours, and coming back to this post afterward.
Now, even though they decided to donate the Dataflow SDKs to Apache, they still want users to run Beam pipelines on their cloud, and using their ‘Cloud Dataflow’ runner. As such, Google will now (since 2.x) distribute the most relevant parts of the Beam codebase as the Dataflow SDKs . Yes, I know it’s a little confusing, but stay with me. Apparently, these distributions are more vigorously tested and validated, and I’m guessing they are also tweaked/tuned to run on Google’s hardware.
It all boils down to this; if you previously developed Dataflow pipelines using the SDK (1.x), and now want to upgrade to the newer (Beam based) SDK (2.x), then you’ll need to do some refactoring work in your application. But, because of your high test coverage, that obviously won’t be a problem – now will it?
Making a boo-boo
Shine were very earlier adopters of Dataflow. In fact, we were one of the first to get access to it when it was announced at Google IO back in 2014. We immediately loved it because it helps us solve a lot of the big data processing problems our clients face on a daily basis. Needless to say, over the last few years we’ve built on array of solutions on top of Dataflow, and using the SDK that was available at that time i.e. the original SDK that has now been donated to Apache.
Since the Apache Beam announcement was made last year, we’ve always known that at some point we’d need to make the cutover to the new SDK. But, because it’s still in beta, and more so because we’re lazy, we kept putting it on the long finger, and continued to lean on the 1.x versions.
However, recently we were finally forced to make the move after we developed a new pipeline using Dataflow’s new templating feature. Why? Well, let me explain to you my good friends. When we developed the templated pipeline (which had a BigQuery sink), we missed one glaring caveat in the docs:
That meant we could only execute our pipeline once. Yup, no joking. Not twice. Not thrice. Not…errr…well, you get the idea. Only been able to execute it once made it utterly useless to us. “So what?“, I hear you say. Well, the problem was that we had already rolled out to production. Yay, high fives all round.
Yes, we had made a bit of a boo-boo with this one.
Taking a leap of faith
To recap, we’d deployed a templated Dataflow pipeline that could only be executed once *eye roll*. We obviously needed it to be executed many times. It was built using the 1.x Dataflow SDK, and written in Java. After some frantic Googling and fingernail biting, a few minutes later we stumbled across this merge request in the Beam code.
Parsing the MR, it looked very promising, and appeared to address the limitation of a BigQuery bound pipeline only being allowed to run once. However, this meant that if it worked as intended we’d need to migrate the pipeline from 1.x to 2.x to leverage it.
To be sure, we reached out to our Google friends on the Dataflow team, and got confirmation straight from the horse’s mouth; “Yes Graham, this MR will allow your pipeline to be executed more than once. Now, please stop calling us all the time. We’ve more important things to be doing than answering your annoying questions“[*]
*My (miserable) attempt at humour. Never happened (the get lost part). Google love me calling them all the time. No, really.
Moving right along, there was however a pretty big gotcha with all this. There’s always a gotcha, huh? The SDK is only in beta. But, we quickly came to the conclusion that it couldn’t be any worse then what we had in production at that time (in other words, a big steaming pile of #!@&).
Porting to Beam – at warp speed
The first thing we needed to do was make the good ‘auld switcheroo in our (Gradle) dependancies from 1.x to 2.x. This would effectively move our pipeline onto the new Apache Beam model, and in the process cause the whole thing to break because of backward compatibility issues. Doing this yourself will depend on your project set up, but for us it was absolute walk in the park. We simply pointed to the new SDK in our build.gradle
file like so (old one commented just out for brevity’s sake):
dependencies { //compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.9.0' compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:2.0.0-beta3' }
Once we did that, we were then confronted with a gazillion compile errors. Straight away, we though our initial fears had been realised, and it was going to take a heap of work to port the pipeline. But once we’d wiped away the tears, and looked at the errors properly, it quickly became evident that about 99% of them were simply import errors. Here’s an example of one of the classes that needed to be ported (I’ve pulled up the Git history – left side is 1.x and right side is 2.x):
Note: click images for larger size.
You’ll notice how the core change is in the packages is from com.google.cloud.dataflow
to org.apache.beam
, along with some other minor package renaming/moves. This was easy to fix up, and took all of about 10 minutes. Finally, there were still some other compile errors lying around. Following the docs here, these were also easy to iron out. For example, you now need to add annotate your ParDos
, and remove the named(String)
method:
OK, so once all the compile errors were fixed up, the last thing to do was update the pipeline options i.e. the program arguments. For example, the DataflowPipelineRunner
class has been changed to just DataflowRunner
. For us, we were using the TemplatingDataflowPipelineRunner
, but we couldn’t find it’s new name/class. Then we RTFM again, and there it was staring us in the face (it would appear we have a real problem reading the Google docs properly):
Replaced
TemplatingDataflowPipelineRunner
with--templateLocation
Users affected: All | Impact: Compile error | JIRA Issue: BEAM-551
The functionality in
TemplatingDataflowPipelineRunner
(Dataflow SDK 1.9+ for Java) has been replaced by using the--templateLocation
withDataflowRunner
.
So, our pipeline (program arguments) now looked like this:
--project=foo --runner=DataflowRunner --templateLocation=gs://bar/ --stagingLocation=gs://foobar/
It’s as easy as 1-2-3
And with that, we were done. We had ported our pipeline to the newer Beam 2.x SDK, fixed our problem with templated pipelines, and we had learned (yet again) to always RTFM. The whole process took about an hour. Easy peasy lemon squeezy.
OK, so this pipeline wasn’t particularly large or complex, but even for larger codebases it’s a straight forward affair – swap out the SDKs, fix up some imports and compile errors, and finally update your runners and program arguments. That’s it. Even my manager could do it, seriously.
Pingback:TEL monthly newsletter – May 2017 – Shine Solutions Group
Posted at 11:57h, 09 June[…] another article about Big Data, this time about the changes you need to make to your code to use Apache Beam – the new API for Google’s Dataflow. Graham goes on a lot about Big Data and never […]