A tutorial with code using PySpark, Hugging Face, and AWS GPU instances
Do you want up to 100x speed-up and get 50% cost savings with Hugging Face or Tensorflow models? With GPU instances and Spark, we can run inferencing on just two or up to hundreds of GPUs concurrently, giving us even more performance effortlessly.
For this tutorial, we will be using DataBricks, and you may sign up for a free account if you still need access to one. DataBricks must connect to a cloud hosting provider like AWS, Google Cloud Platform, or Microsoft Azure to run GPU instances.
This exercise will use the AWS GPU instances of type “g4dn.large.” However, you may still follow these instructions using Google Cloud or Microsoft Azure and select an equivalent GPU instance from them.
Once you have set up your DataBricks account, log in and create a cluster with the configuration shown below:
Next, create a notebook and attach it to the cluster by selecting it in the dropdown menu:
Now, we are all set to code.
Firstly, let us install the Hugging Face Transformers to the cluster.
Run this in the first cell of the notebook:
%pip install transformers==4.2
Hugging Face Transformers Python library installed on the cluster
Libraries installed this way are called Notebook-scoped Python libraries. They are convenient and must be run at the start of a session before other code because they reset the Python interpreter.
At this point, we start with actual Python code. In the next cell, run:
Congratulations! Hugging Face Transformers will be successfully installed if the above line runs without errors.
The easiest way to create data that Spark can process in parallel is by creating a Spark DataFrame. For this exercise, a DataFrame with two rows of data will suffice:
We get this DataFrame:
The Transformer model for this exercise takes in two text inputs per row. We name them “title” and “abstract” here.
For the curious, here is an excellent article by Laurent Leturgez that dives into Spark partitioning strategies:
On Spark Performance and partitioning strategies
We shall use the fantastic Pandas UDF for PySpark to process the Spark DataFrame in memory-efficient partitions. Each partition in the DataFrame is presented to our code as a Pandas DataFrame, which you will see below, is called “df” as a parameter of the function “embed_func.” A Pandas DataFrame makes it convenient to process data in Python.
Code for embed_func():
You might have noticed two things from the code above:
The inputs and outputs are stored in the GPU memory when the GPU makes inferences with this Hugging Face Transformer model. GPU memory is limited, especially since a large transformer model requires a lot of GPU memory to store its parameters, leaving comparatively little memory to hold the inputs and outputs.
Hence, we control the memory usage by inferencing just 20 rows at a time. Each time 20 rows are processed, we copy the output from the GPU to a NumPy array, which resides in CPU memory (which is more abundant) with the code at Line 21 with “.cpu().detach().numpy()”.
As mentioned above, this is where the execution of Pandas UDF for PySpark happens. In this case, the Pandas UDF is “embed_func” itself. Read the link above to learn more about this powerful PySpark feature.
The resulting output of this exercise is a DataFrane containing rows of floating-point arrays that represent document embeddings. Spectre creates embeddings that are vectors of 768 floats.
I hope you see how Spark, DataBricks, and GPU instances make scaling up inferencing with large transformer models relatively trivial.
The technique shown here allows for running inference on millions of rows and completing it in hours instead of days or weeks. Thus, processing big data sets on large transformer models becomes feasible in more situations.
But wait, there is more. Despite costing 5 to 20 times a CPU instance, inferencing done by a GPU instance can be more cost-effective because it is 30 to 100 times faster.
Since we pay for the instance on an hourly basis, time is money here.
Data is easily imported into DataBricks and saved as Parquet files on AWS S3 buckets or, even better, Data Lake tables (a.k.a. Hive tables on steroids). Then, we can manipulate the data via Spark Dataframes, which, as this article shows, is trivial to parallelize for transformation and inferencing.
All data, code, and computing are accessible and managed in one place “on the cloud.” This neat solution is also intrinsically scalable as the data grows from gigabytes to terabytes, making it more “future-proof.”
Being a cloud-based solution means that as the team grows, we can add more people to the project to securely access data and code on notebooks. We can create charts for reports to share with other teams with just a few clicks.
Stay tuned for a Tensorflow take on this article, and for more articles, I plan to write. If you found this helpful, please follow me. I am a new writer, and I need your help. Please post your thoughts and questions if you have any.