Scheduling BigQuery jobs: this time using Cloud Storage & Cloud Functions

Scheduling BigQuery jobs: this time using Cloud Storage & Cloud Functions

Intro

Post update: My good friend Lak over at Google has come up with a fifth option! He suggests using Cloud Dataprep to achieve the same. You can read his blog post about that over here. I had thought about using Dataprep, but because it actually spins up a Dataflow job under-the-hood, I decided to omit it from my list. That’s because it will take a lot longer to run (the cluster needs to spin up and it issues export and import commands to BigQuery), rather than issuing a query job directly to the BigQuery API. Also, there are extra costs involved with this approach (the query itself, the Dataflow job, and a Dataprep surcharge – ouch!). But, as Lak pointed out, this would be a good solution if you want to transform your data, instead of issuing a pure SQL request. However, I’d argue that can be done directly in SQL too 😉

Not so long ago, I wrote a blog post about how you can use Google Apps Script to schedule BigQuery jobs. You can find that post right here. Go have a read of it now. I promise you’ll enjoy it. The post got quite a bit of attention, and I was actually surprised that people actually take the time out to read my drivel.

It’s clear that BigQuery’s popularity is growing fast. I’m seeing more content popping up in my feeds than ever before (mostly from me because that’s all I really blog about). However, as awesome as BigQuery is, one glaring gap in its arsenal of weapons is the lack of a built-in job scheduler, or an easy way to do it outside of BigQuery.

That said however, I’m pretty sure that the boffins over in Googley-woogley-world are currently working on remedying that – by either adding schedulers to Cloud Functions, or by baking something directly into the BigQuery API itself. Or maybe both? Who knows!

But until they do, let’s do a quick recap of the options you have right now to schedule a job in BigQuery, shall we:

  1. Google Apps Script (GAS)
  2. Task scheduling with Google App Engine (GAE)
  3. Spin up a GCE instance and use a cron

GAS is a nice enough way of doing it. You don’t need to maintain anything. It’s – dare I say it – a completely serverless solution (*murmurs from the crowd at the use of the latest buzzword*). However, when you write your GAS application, it’s inherently tied to your personal Google account, and not a service account. That’s a big no-no for production systems.

“using a cron is usually an admission of failure

Using GAE is also a decent workaround. Again, there’s nothing to maintain – yippee! But in my opinion, it’s an overly heavyweight solution to a very simple problem. Kind of like cracking an egg with a sledgehammer. And you need to pay for GAE running 24/7.

The last option, spinning up a GCE instance and using a good ‘auld cron, isn’t that bad either. But, there’s something to maintain, and it’s so 2017. Anyway, as someone eloquently put it to me recently: “using a cron is usually an admission of failure“. Fair enough.

Yet another option to consider (I’m sorry)

Recently, I was working with the Google Cloud Storage (GCS) transfer service. It’s a simple mechanism for GCP users to automate the copying of GCS buckets/objects (and S3 too). I noticed that, in addition to performing a one off execution of the transfer job, you also have the option to schedule daily executions. That’s kind of neat, and it got me thinking – if you can schedule an object to be copied each day from a bucket, could that be used as the trigger for a Cloud Function that runs a BigQuery job?

Well, of course it can silly! And, here’s how it works:

  1. Create a dummy GCS bucket. Let’s call it  faking-it-source
  2. Next, create a folder in there called trigger(you’ll see why later)
  3. Upload a dummy object (file) to trigger (it can be anything you want)
  4. Create a GCS bucket to deploy and trigger a Cloud Function on. Let’s call it faking-it-destination
  5. Write a simple Cloud Function that encapsulates the job/query/SQL that you want BigQuery to execute daily
  6. Create a daily GCS transfer service job that copies faking-it-source to faking-it-destination
  7. Finally, lean back in your chair, crack open a cold one, and inform your boss of your triumphant hard work

“it got me thinking – if you can schedule an object to be copied each day from a bucket, could that be used as the trigger for a Cloud Function that runs a BigQuery job?”

OK, so it’s a rather crude workaround, and misusing a tool isn’t really a cool thing to do. But, why not use the other tools on the stack to plug the gap until a better solution is available. Just like using GAS, there’s nothing to maintain, but it’s got the added advantage of not being tied to your personal Google account. The Cloud Function uses a proper service account to execute. So, when your company gives you the boot for deploying this workaround, at least your scheduled BigQuery job will live on in your honour even after they delete your account. High-five!

Let’s do this thing

I’m going to skip holding your hand through the first four steps (creating the GCS buckets). If you can’t create two buckets, and upload an object, then you probably shouldn’t be reading this post. Or using a computer.

Instead, let’s jump straight to step five, and create a Cloud Function to execute our BigQuery job. You’ll be glad to know you can copy and paste the code below, or if you prefer, grab it from GitHub here. You’ll need an index.js file that uses the Node.js library to run your BigQuery job (the Cloud Function):

exports.fakingItWithGCS = function(event, callback) {
  const file = event.data;
  if (
    file.resourceState === "exists" &&
    file.name &&
    file.name.indexOf("trigger/") !== -1
  ) {

    const bigquery = require('@google-cloud/bigquery')();
    const query =
      "SELECT url FROM `publicdata.samples.github_nested` LIMIT 100";

    bigquery.createQueryJob(
      {
        destination: bigquery.dataset("faking_it").table("my_table_created_from_a_cloud_function"),
        query: query,
        useLegacySql: false,
        allowLargeResults: true,
        writeDisposition: "WRITE_TRUNCATE",
        createDisposition: "CREATE_IF_NEEDED"
      },
      function(err, job) {
        if (err) {
          console.error("Something went wrong with submitting the BigQuery job. Error: ", err);
        }
        console.log("BigQuery job successfully submitted");
        callback();
      }
    );
  } else {
    console.log("Nothing to do here, ignoring.");
    callback();
  }
};
}

Note the first if statement. It’s checking for two things: the first is the type of event that was received. We are only going to be interested either a new or updated object, and not a delete event. The way we check for that is by checking the resource state being equal to exists.

Secondly, remember I said I’d explain why we created that trigger folder? Well, this is it. Because we’re going to use the same bucket to host the Cloud Function and to trigger on, we don’t want to be firing on new versions of the Cloud Function being uploaded to it. So, we check that the name of the object in the event contains triggerin it. Easy-peasy-lemon-squeezy!

You’ll also need a package.json file to declare the dependancy on the BigQuery module:

{
  "main": "index.js",
  "dependencies": {
    "@google-cloud/bigquery": "1.0.0"
  }
}

Now, deploy your cloud function like so (you need to be in same dir as index.js and package.json). Change parameters to suit you sir:

gcloud beta functions deploy fakingItWithGCS --stage-bucket faking-it-destination --trigger-bucket faking-it-destination

“Easy-peasy-lemon-squeezy”

Now you’re ready to create a scheduled transfer service. Navigate to GCS in the console, and select ‘Transfer‘ from the left side menu. Configure the transfer to copy from faking-it-source to faking-it-destination. But pay attention folks! Here’s the trick: make sure you select the option to ‘Overwrite destination with source, even when identical‘:

This tells the transfer service to copy the object even when it already exists in the destination bucket. This is now our trigger for the Cloud Function. Woop!

Select a time that you’d like it to run, or from a few minutes from now so that you can test it without having to wait (you can edit the configuration afterwards to the actual time you wan it to run daily):

If you’ve followed my instructions to-a-tee, you should see everything working when the transfer kicks off:

Now, that only proves that it worked initially when the object was copied over from faking-it-destination to faking-it-source. However, it will still work in subsequent runs. But, don’t just take my word for it. I tend to be wrong about things a lot. So, go ahead and run it again to make sure it triggers when the object already exists.

But if the job fails, how will I Know?

The astute readers of you will most likely have noticed that when the job/query is successfully submitted to BigQuery, the Cloud Function immediately returns. For those not paying attention in class, here’s the snippet where that happens:

function(err, job) { 
    if (err) { 
        console.error("Something went wrong with submitting the BigQuery job. Error: ", err); 
    } 
    console.log("BigQuery job successfully submitted"); 
    callback(); 
}

All jobs run async in BigQuery. But, what if you want to check if your job succeeded or not? Well, you’ve two options:

  1. Poll the status of the job. Sit and wait for the job to finish, and check if the status contains any errors. Cloud Functions have a time limit of 540 seconds (9 minutes), so you should be golden. If you’re queries are taking longer than that, you should probably go and ask someone to review your SQL. Seriously.
  2. Create a Stackdriver metric/alert and get notified if a BigQuery job fails.

The latter is my preferred choice. It’s easy. Here’s how:

First create a metric from the following filter (you filter by the service account email so you don’t get spammed when you’re writing dodgy SQL in the UI and running lots of queries) :

resource.type="bigquery_resource"
severity="ERROR"
protoPayload.authenticationInfo.principalEmail="<your-project-id>@appspot.gserviceaccount.com"
protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error.message=*

Save it as a metric, and then create a Stackdriver Alert from it:

Boom! You’re done. You’ll now be notified when your BigQuery jobs fail first thing in the morning as you’re sipping your skinny-soy-latte.

Time for some contemplation

That’s now four options you have for scheduling BigQuery jobs. I’ll be the first to admit that it’s not a beautiful solution (what workaround is though). However, it’s serverless, and an absolute walk in the park to set up. It’s also fully automated, and can scale too. You’d need a bucket anyway to host your Cloud Function, so having one more to host the dummy object/trigger isn’t the end of the world, now is it. Just make sure you don’t go deleting it (you could set up a Stackdriver alert to fire in case someone does accidentally).

Now, before I run off to try and think of a fifth way of doing this, I should probably mention this. Annoyingly, you can only configure the transfer to run once a day. If you need to run it N times a day, then your only option is to create N transfer configurations. Luckily, there’s an API, so you can script something up and not need to make a gazillion clicks in the console to make this happen. This gives you more time to waste e.g. surfing the internet and reading my other blog posts.

Finally, when scheduling for Cloud Functions does arrive then all you’ll need to do is delete the GCS  dummy bucket (and transfer service job), and because you’ll already have your Cloud Function written, then setting a scheduler on it will be a doddle.

Have fun my fellow BigQuery’ians.

 

 

server.admin@shinesolutions.com
7 Comments
  • Lucas Arruda
    Posted at 23:21h, 22 January Reply

    Interesting approach! I have a question though: what you would do if your queries run for a few minutes or hours? Cloud Functions expect to return in 60 seconds. I mean, the only possible thing would be to trigger the job on BigQuery’s API but don’t have a wait to know for its failure or success. Please advise.

    • Graham Polley
      Posted at 00:12h, 24 January Reply

      I’ve updated the post 😉

  • Graham Polley
    Posted at 23:26h, 23 January Reply

    All jobs run async when submitted to BigQuery. You’re right that you will not know if it succeeded or not – unless you wait and poll the job status. The time limit of Cloud Functions is 540 seconds (9 minutes: https://cloud.google.com/functions/quotas#time_limits). So, you should be alright to sit and poll. If you’re queries take longer than that, then you’ve probably something wrong in your SQL. Alternatively, use Stackdriver logging to create an alert on failed BigQuery jobs. Easy!

  • Lucas Arruda
    Posted at 10:46h, 29 January Reply

    I have some queries that takes hours to finish, because they process a *lot* of data and joins many tables. We will certainly optimize this in the future, but that’s how it works now, mostly because the queries have been ported from other on-prem database solutions. If we take into consideration that BigQuery allows for queries to run for 6h, we should account for that in our solutions too. I liked your approach of using StackDriver for handling failure notifications. Thank you for the update on the article too!!

  • SW
    Posted at 05:13h, 04 July Reply

    Any thoughts on sequencing via this method?

    E.g. SQL A runs. SQL B cannot run until SQL A is finished. SQL C runs with A & C are finished.

  • Vladimir Yushkevich
    Posted at 02:48h, 20 August Reply

    I have noticed that transfer event sometimes triggered 2 times (and even 3 times) per minute. I tried to find any differences in event’s payload but they are always the same. Looks like something is still missed in this approach for not idempotent function calls. Conditions in lines 4-6 can’t safe from multiple functions calls on same event.

  • Alberto
    Posted at 19:56h, 09 January Reply

    Thanks! Exactly what I was looking for.

Leave a Reply

%d bloggers like this: