Skip to content

High-performance Inferencing with Transformer Models on Spark

transformers_gpu_spark_feat_img_no_logo

A tutorial with code using PySpark, Hugging Face, and AWS GPU instances

Do you want up to 100x speed-up and get 50% cost savings with Hugging Face or Tensorflow models? With GPU instances and Spark, we can run inferencing on just two or up to hundreds of GPUs concurrently, giving us even more performance effortlessly.

Overview

  • Setup Driver and Worker instances
  • Partitioning data for parallelization
  • Inferencing with Transformer Models
  • Discussion

 

Setup Driver and Worker instances

For this tutorial, we will be using DataBricks, and you may sign up for a free account if you still need access to one. DataBricks must connect to a cloud hosting provider like AWS, Google Cloud Platform, or Microsoft Azure to run GPU instances.

This exercise will use the AWS GPU instances of type “g4dn.large.” However, you may still follow these instructions using Google Cloud or Microsoft Azure and select an equivalent GPU instance from them.

Once you have set up your DataBricks account, log in and create a cluster with the configuration shown below:

1_leEyKCnXrEGsJOlIDH4D1A

Next, create a notebook and attach it to the cluster by selecting it in the dropdown menu:

1_GWbYUNZ4NV8D_ShYGc4baQ

Now, we are all set to code.

 

Installing Hugging Face Transformers

Firstly, let us install the Hugging Face Transformers to the cluster.

Run this in the first cell of the notebook:

%pip install transformers==4.2

1_i94dxgfnQei1_yz5msFBrA

Hugging Face Transformers Python library installed on the cluster

Libraries installed this way are called Notebook-scoped Python libraries. They are convenient and must be run at the start of a session before other code because they reset the Python interpreter.

At this point, we start with actual Python code. In the next cell, run:

Congratulations! Hugging Face Transformers will be successfully installed if the above line runs without errors.

 

Partitioning data for parallelization

The easiest way to create data that Spark can process in parallel is by creating a Spark DataFrame. For this exercise, a DataFrame with two rows of data will suffice:

We get this DataFrame:
1_8RZNoSV4BgvRppTuLoqkcA

The Transformer model for this exercise takes in two text inputs per row. We name them “title” and “abstract” here.

For the curious, here is an excellent article by Laurent Leturgez that dives into Spark partitioning strategies:
On Spark Performance and partitioning strategies

 

Inferencing with Transformer Models

We shall use the fantastic Pandas UDF for PySpark to process the Spark DataFrame in memory-efficient partitions. Each partition in the DataFrame is presented to our code as a Pandas DataFrame, which you will see below, is called “df” as a parameter of the function “embed_func.” A Pandas DataFrame makes it convenient to process data in Python.

Code for embed_func():

You might have noticed two things from the code above:

  • The code further splits the input text found in the Pandas DataFrame into chunks of 20 as defined by the variable “batch_size.”
  • We use Spectre by AllenAI — a pre-trained language model to generate document-level embedding of documents (pre-print here.) We can easily swap this for another Hugging Face model like BERT.
Working around GPU memory limits

The inputs and outputs are stored in the GPU memory when the GPU makes inferences with this Hugging Face Transformer model. GPU memory is limited, especially since a large transformer model requires a lot of GPU memory to store its parameters, leaving comparatively little memory to hold the inputs and outputs.

Hence, we control the memory usage by inferencing just 20 rows at a time. Each time 20 rows are processed, we copy the output from the GPU to a NumPy array, which resides in CPU memory (which is more abundant) with the code at Line 21 with “.cpu().detach().numpy()”.

Finally - Transformer model inferencing on GPU

As mentioned above, this is where the execution of Pandas UDF for PySpark happens. In this case, the Pandas UDF is “embed_func” itself. Read the link above to learn more about this powerful PySpark feature.

The resulting output of this exercise is a DataFrane containing rows of floating-point arrays that represent document embeddings. Spectre creates embeddings that are vectors of 768 floats.

1_yIKTNJqEagJSj-NGAClLiQ

 

Discussion

I hope you see how Spark, DataBricks, and GPU instances make scaling up inferencing with large transformer models relatively trivial.

The technique shown here allows for running inference on millions of rows and completing it in hours instead of days or weeks. Thus, processing big data sets on large transformer models becomes feasible in more situations.

Cost Savings

But wait, there is more. Despite costing 5 to 20 times a CPU instance, inferencing done by a GPU instance can be more cost-effective because it is 30 to 100 times faster.

Since we pay for the instance on an hourly basis, time is money here.

Less time spent on plumbing

Data is easily imported into DataBricks and saved as Parquet files on AWS S3 buckets or, even better, Data Lake tables (a.k.a. Hive tables on steroids). Then, we can manipulate the data via Spark Dataframes, which, as this article shows, is trivial to parallelize for transformation and inferencing.

All data, code, and computing are accessible and managed in one place “on the cloud.” This neat solution is also intrinsically scalable as the data grows from gigabytes to terabytes, making it more “future-proof.”

Collaborate Seamlessly

Being a cloud-based solution means that as the team grows, we can add more people to the project to securely access data and code on notebooks. We can create charts for reports to share with other teams with just a few clicks.

Stay tuned for a Tensorflow take on this article, and for more articles, I plan to write. If you found this helpful, please follow me. I am a new writer, and I need your help. Please post your thoughts and questions if you have any.