Dask and TensorFlow in Production at Grubhub

Hugo Bowne-Anderson
August 3, 2020

We recently caught up with Alex Egg, Senior Data Scientists at Grubhub, about modern data science and machine learning methods to understand the intent of someone using Grubhub Search. As Alex told us,

“Search is the top-of-funnel at Grubhub. That means when a user interacts with the Grubhub search engine, they want to be able to service their request with high precision and recall. One way to do that is to understand the intent of the user. Do they have a favourite restaurant in mind or are they just browsing cuisines? Are they looking for an obscure dish?”

In this post, we’ll summarize the key takeaways from the stream. We covered

  • Classic distributed ETL pipelines with Dask Dataframes,
  • Weak-supervision (labeling) w/ Snorkel & Dask (what a modern combo!),
  • Language Modeling and deployment w/ Tensorflow.

You can find the code for the session here. You can also check out the YouTube video here:


The task

Figuring out user intent is highly non-trivial, particularly when you don’t have labelled data of search queries mapped to actual intent. A question Alex asked us straight away as motivation, if a user types in “French” into the search field, what do you think they’re looking for?

Grubhub search results for "french" with "Le French Tart" as the first result.

There are several approaches to deal with the challenge of not having labeled data. A common approach is hand-labelling but this can be both time and resource intensive: you can get employees to do this, friends and family, you can outsource to crowd-sourcing platforms such as MTurk, and you can use products like ScaleAI.

A more modern approach is to use weak supervision, where “where noisier or higher-level supervision is used as a more expedient and flexible way to get supervision signal, in particular from subject matter experts (SMEs).”

In Grubhub’s case, they provide a set of target labels and the options for weak supervision include utilizing

  • a pre-existing knowledge base (Distance Supervision)
  • pre-trained models
  • Keyword lists
  • Heuristics (such as using regular expressions to identify addresses!)

Such weak supervision is embedded in a broader production pipeline, schematized here:

A diagram depicting the flow from data platform to repository to production to deployment (as well as client to deployment).

We see therein

  • ETL, powered by Dask,
  • Weak supervision (Snorkel and Dask),
  • Modeling, using Keras and Tensorflow,
  • Nbdev, a key infrastructural piece infra piece that allows Grubhub to integrate quickly and make deployment easy, and
  • Model deployment and maintenance, using TensorFlow

Let’s now go through each of these in a bit more detail!

ETL, Dask DataFrames, and Weak Supervision with Snorkel

A diagram depicting the flow from task graph through multi-task model.

Image from Ratner et al., 2019.

At Grubhub, Alex and his team have got so many queries (millions per day) they want to build a basic model from that pandas DataFrames aren’t enough. For this reason, they use Dask DataFrames

For this session, Alex used an Amazon EC2 P2 instance: he had a GPU and 4 CPUs and the plan was to get a Dask local cluster up and running on them (and he succeeded!). For a bit more context, all ETL was from a data lake and powered by Dask.

As stated above, for the weak supervision, he required a label set to map searches queries to:


He also required a set of rules / heuristics to interrogate the most likely label for each search. Here you can see the function that determines how likely it is for a query to be an address

//def address(x, futures):
   normd_query = x["query"].lower()
   exp = '\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)'
   regex = re.compile(exp, re.IGNORECASE)
   if regex.search(normd_query):
       return LABEL_MAPPING['ADDRESS']
       return ABSTAIN//]]>

Note that the function can also ABSTAIN. This speaks to the fact that, for any given query, we essentially get all the functions for the different labels to vote on what the most likely label is.

There were several steps involved in getting this up and running and we got to see the Dask dashboards in action while extracting the search query texts from the data lake here. Hearing Alex and Matt geek over the realtime value of the dashboards was a lot of fun (for Hugo, anyway):

Alex Egg from Grubhub using the Dask Status plot.

Then Alex performed weak supervision with Snorkel, the rules / function (called “Labeling Functions” in the Snorkel docs) vote and then label prediction were made – note that the outputs were soft predictions / probabilities and that you need to threshold to get the actual prediction.

We saw rules / heuristics such as the gibberish detector (can you guess what this does?) and entropy detector (detects accidental searches such as ‘cdcssdcads’).

We also saw the labeler in action with the example “sushi”, as a sanity check:

test_labelers("sushi") code and results in a Jupyter Notebook cell.

Note that there is disagreement (conflict) between labeling functions here and that this is fine and natural. How do you deal with this? Well, Snorkel has sophisticated methods to do this for you, which you can find out more about here (the relevant section starts at “Our goal is now to convert the labels from our LFs into a single noise-aware probabilistic label per data point”)!

Determining User Intent across a huge dataset

It was then time to use our labeling functions to determine user intent across a large number of queries. Essentially, the task here is to map these labeling functions across our huge dataset with Dask! (Side note: Matt was pleasantly surprised to find that Snorkel has a Dask sub-module!). In the process, we once again saw the Dask workers in action again here and it was hypnotic as ever.

Matt also showed us the profile plot and pointed out where you can see SSL, decompression, boto, stuff from aws, and s3 data (likely parquet):

A screenshot of Alex Egg from Grubhub using the Dask Profile.

We wrapped up this section by converting the output soft labels / probabilities to hard labels in order to then train the model and deploy it.

Training the model using Keras

Before jumping into training immediately, like any good data scientists, Alex checked out the data to see what was going on. First he noticed that there was a serious class imbalance: the data was skewed, likely according to some sort of power law.

We discussed methods for dealing with such imbalances, including downsampling and weighting the loss function (for example, according to class weights and/or counts). Alex used a Dask utility truncate_tail that he wrote to downsample the data, mentioning that we would weight the loss function later also.

We discussed the type of models we could use:

  • Naive Bayes generally works pretty well out of the box but doesn’t care about order,
  • Subword tokenization (creating sets of n-grams) is relatively lightweight and considers character order so Alex went with this, combined with a logistic regression model.

He used Keras to pass the n-gram embeddings into a max_pooling layer, which essentially averages together all of the embeddings, and then we watched the epochs roll on in as he trained the model and plotted the learning curves:

Training and validation accuracy and training and validation loss plots side-by-side.

We also saw, for example, that the model gave perfect predictions for ‘Tobacco’ searches, even though there was relatively little data in the training set with this label!

We tested a variety of queries and saw that, even when ‘cigarettes’ was misspelt ‘cigaretes’, the model worked correctly (the model is robust to spelling!):

A screenshot of a Jupyter Notebook cell with a matplotlib plot depicting "cigarettes" query results.

Deploying the model using TensorFlow

Now we came up against a pretty serious challenge: how to get this model in production. For productionzed prediction, you would need to reimplement a lot of things in Java and a great deal would get lost in translation.

An alternative to this is to do everything in TensorFlow or, as Alex said, in a “nice hermetically sealed TensorFlow runtime.” Alex also made the key point that this is not merely a technical question, but inherently a social, cultural, and political challenge: if you want to get your ML models deployed within your org, you had better make it as easy as possible for your colleagues to do so.

To this end,

  • Alex reimplemented the pipeline in TensorFlow Estimators API,
  • exported the model using the SavedModel format,
  • walked us through the Python client and then the Java client,
  • mentioned that he can then easily turn this in to a PyPi package using nbdev (the point of deploying this as a package is so that they can instrumentalize it using their internal job system), and
  • told us that this is the model that is actually running in production on Grubhub: Dask, Snorkel, TensorFlow and many other packages for the win!
TensorFlow diagram with raw input, preprocessing, and model sections.

If you enjoyed this recap, check out the video and sign up to our newsletter for updates on future #ScienceThursday live streams!

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud