We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy. Denying consent may make related features unavailable.
You can consent to the use of such technologies by using the “Accept” button, by closing this notice, by scrolling this page, by interacting with any link or button outside of this notice or by continuing to browse otherwise.
What We Learned from Deploying Fine-tuned LLMs in Production
Rohith Mukku, AI Researcher at Roots Automation; NYU, IIT Kanpur
August 26, 2024
Executive Summary
With advancements in Generative AI happening every day, more organizations are incorporating these models into their services to meet business requirements. As this trend grows, fine-tuning generative models for specific use cases has become increasingly important. In this article, we share our findings from deploying customer-specific fine-tuned LLMs in production.
We evaluated several frameworks for self-hosting large language models, including Hugging Face, NVIDIA Triton, and vLLM. While both NVIDIA Triton and vLLM emerged as leading solutions, we preferred vLLM due to a more favorable experience during our initial testing.
Using a fine-tuned 7B Mistral model, we demonstrate vLLM's performance in production by considering input tokens, output tokens, batch size, and parallel requests. Our results indicate that vLLM, with minimal manual tweaking, achieves a throughput of up to 130 tokens per second on an A100 with large text inputs (averaging 8k tokens). It can handle up to 32 concurrent requests, simulating a real-time workflow where requests are processed in parallel without prior batching. This makes vLLM ideal for hosting models used in real-time applications such as document classification, extraction, and summarization for large documents (averaging 8-10 pages). This setup can process about 20-30 million documents annually at an on-demand cost of $30,000, offering a more cost-effective solution compared to alternatives and reducing dependency on third parties and their API quotas. We also compare performance with a less expensive T4 option ($5000 annually) and a consumer-grade GPU like the RTX 3090 (typically not suitable for most businesses).
While the low cost and high volume of self-hosting are appealing, the main motivation is accuracy. Fine-tuned models consistently outperform GPT-4 models in specialized tasks such as business-specific entity extraction. Further details on the training process and accuracy enhancements will be explored in a subsequent article.
Introduction
In this article, we dive into the practicalities of deploying fine-tuned large language models (LLMs) in production environments. You'll take away three key insights:
Understanding the vLLM framework and its advantages.
Comparing vLLM's performance as a function of input/output tokens, batch size, GPU machines, and more.
Practical advice on selecting GPU configurations for business needs.
Before we start, let's address the question:
Why do we need to fine-tune instead of using other methods like Prompt Engineering [1], RAG [15]?
Each approach has its strengths and is suitable for different scenarios, and sometimes they are used together in a complementary manner. Prompt engineering involves designing prompts to maximize the model's efficiency, while retrieval-augmented generation (RAG) combines LLMs with external knowledge sources to incorporate up-to-date information.
Fine-tuning, on the other hand, is crucial when you need to teach a model "new skills" and capture the nuances of specific use cases and domains, such as healthcare, insurance, finance etc. For example, in the insurance industry, fine-tuning can help an LLM accurately identify business-specific claim numbers and claimant names in documents, achieving a level of accuracy necessary for business operations.
In the following sections, the article will focus on fine-tuning and specifically, on deploying a fine-tuned model in production using vLLM.
vLLM
vLLM [4] (Virtually Large Language Model) is an open-source inference optimization framework for LLMs, developed at UC Berkeley. This framework was introduced in June 2023 and has become a popular framework rivalling the likes of TensorRT-LLM [5] by Nvidia.
Developers of vLLM are also the main authors of PagedAttention [4], a hardware-efficient attention algorithm which mirrors the idea of paging in operating systems. This has significantly contributed to vLLM's current popularity and its status as one of the most efficient frameworks for LLM inference. During inference, a decoder LLM uses the previous context (seen tokens) to generate a new token and continues repeating this step autoregressively until it reaches maximum output length, or it outputs an end-of-sentence (EOS) token or stop sequence.
How does vLLM work?
KV Caching:
When predicting a token, its likelihood is determined by applying softmax on attention scores. These scores are obtained by scaling the dot product between the query vector and key vectors of all previously seen tokens, followed by multiplication with their value vector. Caching these key and value vectors avoids the need to recalculate attention scores and probabilities for each token, thereby speeding up the inference process. Traditional storage methods can lead to inefficient memory usage. To combat this, the vLLM model adopts a Paged Attention mechanism, drawing inspiration from the Virtual Paging system in Operating Systems. This approach segments the KV Cache into blocks, each storing keys and values for a specific number of tokens. By eliminating the need for contiguous memory, this strategy prevents memory fragmentation and enhances the efficiency of the KV cache.
This is another popular technique that is used by almost all inference frameworks. It helps enhance the efficiency and throughput of LLM inference by dynamically managing incoming requests through batching. The server handles incoming requests, which arrive asynchronously and in varying sizes, by grouping them efficiently for next token prediction. This grouping, or batching, can occur in two ways: either by assembling requests in the order they arrive until a batch is complete, or by setting a time limit to wait for additional requests before forming a batch.
Speculative decoding aims to optimize model's generation speed by predicting multiple potential paths of the predictions using a smaller model in parallel. In essence, to boost the performance of a larger model, a smaller counterpart operates alongside it to forecast multiple potential outcomes. These predictions are then assessed through different methods to identify the most accurate sequence of tokens. Once determined, this sequence allows the larger model to bypass generating these tokens itself, effectively speeding up its text generation process.
Why vLLM?
vLLM is easy to install with few additional dependencies.
vLLM includes an OpenAI-style server implementation that can serve as a replacement for OpenAI models.
vLLM also supports RoPE [11] scaling (linear interpolation [12] and dynamic [13]) to extend the context length of models. This can be useful for times when you want to run the model inference on longer texts.
Experiment Setup
Metrics
Throughput (Output tokens generated per second)
Latency (Time taken per request)
Model
We use a Mistral 7B Instruct v2 model as it performs very well for its size and is manageable with ease on a single GPU.
We compare this model with the baseline that is the Huggingface Mistral model, which can be treated as a minimalistic model for inference.
Data
Internal dataset of around 200 diverse samples whose input tokens range from 1000 to over 30000. (This dataset is used for analysis sections: 1, 2, 3, 4 and 5. A different/bigger dataset is used for analysis 6.
This dataset contains documents used for entity extraction. Expected output token length usually varies from 100 to 200 (predicted output length can sometimes go over 200 as per observations).
Most of the documents have less than 20 pages and less than 20000 input tokens.
Analysis 1: vLLM vs Hugging Face models
This comparison studies the throughput of vLLM inference and the performance of the native Hugging Face (HF) model. The results indicate that vLLM improves generation speed by approximately 25 times, even with KV caching enabled on the HF model.
Key Observations:
Across all vLLM and HF configurations, there is a general trend of a slight decrease in generation speed as the number of input tokens increases. This decrease is more pronounced for vLLM compared to HF chat models.
The quantized version of vLLM ("vLLM+awq") consistently shows higher generation speeds compared to the unquantized version.
Surprisingly, the quantized version of HF models ("hf_chat+awq") has a much lower generation speed compared to its unquantized counterpart.
The difference in generation speed between quantized and unquantized versions is much more significant in the HF chat models than in vLLM models.
Note that these results are obtained from offline inference. Therefore, there are no asynchronous requests to the inference server.
Analysis 2: vLLM's Throughput vs Input Context Length
We study the throughput (in tokens/second) of the quantized vLLM model ("vLLM+awq") across varying input token sizes.
Configuration:
Inference method: vLLM + AWQ
Batch Size: 1
Offline inference
GPU machine: 80GB A100
Output tokens range: 100-200
Key Observations
As the number of input tokens increases, there's a noticeable decline in the model's throughput.
A significant drop in throughput occurs around 8k input tokens. The exact reason for this drop is unclear.
Interestingly, the increase in processing time is not linear, suggesting some efficiency in handling larger inputs.
Analysis 3: vLLM's Throughput vs Output Length
Configuration:
Inference method: vLLM + AWQ
Batch Size: 1
Offline inference
GPU machine: 80GB A100
Input tokens range: 4096 to 16384
Key Obervations
We notice a gradual increase in throughput as the number of output tokens increases, which suggests an efficiency gain as the model generates more tokens, potentially due to amortizing fixed overheads over a larger number of tokens.
The performance also depends on the input token range: smaller input token counts yield higher throughput than larger ones.
Analysis 4: vLLM's Throughput vs Batch Size
This experiment analyzes vLLM's performance at different batch sizes.
Configuration:
Inference method: vLLM + AWQ
Max Input Tokens: 16384
Max Out Tokens: 400
Offline inference
GPU machine: 80GB A100
Key Obervations
The total processing time increases non-linearly as a function of batch size, pointing to a complex relationship between batch size and processing efficiency.
Average generation speed increases up to a batch size of 8 or 16. We're not sure why it plateaus after.
Shorter inputs (1024 tokens), larger batch sizes (e.g., 32) significantly improve efficiency. However, as input length increases, the efficiency gains from larger batch sizes become less pronounced.
Identifying the optimal batch size involves balancing throughput maximization against computational costs, especially as input lengths vary. There's a delicate balance between processing speed and computational load, with efficiency gains plateauing or diminishing beyond certain batch sizes.
Out-of-Memory Errors occur when batch sizes exceed 64
Analysis - 5: vLLM's Performance on Different GPU Machines
This study compares vLLM's performance on different GPU machines. It's important to note that the evaluation was conducted under specific conditions: both the maximum input prompt and the maximum output lengths were set to 8192 and 400 tokens, respectively. The version of the model tested is an AWQ-quantized variant, as the non-quantized version does not fit on lower-end GPUs. The vLLM model demonstrates remarkable performance across all tested machines, with the Nvidia A100 GPU delivering the superior performance of the group. For this study, we did not include the H100 GPUs, as they are not utilized for inference in our production workflow, despite being our choice for training all fine-tuned models.
Configuration used:
Inference method: vLLM + AWQ
Max Input Tokens: 16384
Max Out Tokens: 400
Batch Size: 1
Offline inference
GPU Machine
GPU VRAM
Annual Cost (on demand)
Throughput (out tokens/sec)
A100
80GB
~ $30,000 USD
83.00
T4
16GB
~ $10,000 USD
21.96
RTX 3090
24GB
~ $5,000 USD
72.14
Key Observations:
FlashAttention-2 [8] backend is not supported for Volta and Turing GPUs.
The V100 GPUs lack AWQ support, precluding the possibility of running quantized vLLM inference on them. However, the T4 GPUs do not face this issue.
Interestingly, the RTX 3090 GPU achieves performance levels comparable to the A100, which is impressive. On the other hand, the T4 GPU's performance is markedly lower, which is expected due to the absence of flash attention support.
The Nvidia A100 GPU stands out as the leading choice for running the vLLM model based on our tests, with the RTX 3090 and T4 following in performance. It's important to note that the H100, which will likely outperform the others, was not included in our tests. In choosing a GPU, factors such as the specific task demands, including performance, power efficiency, and budget, as well as the operational setting's infrastructure support, such as cooling systems and power supply, should be considered.
Although the A100 provides unmatched performance, its higher cost may not be justifiable for all organizations. The RTX 3090 or T4 may represent more cost-effective alternatives that still meet specific needs for a balance between performance and efficiency. However a consumer-grade 3090 may not be an option for most businesses.
Analysis 6: vLLM's Memory Usage, Throughput on Concurrent Requests
Until now, the analysis has focused on vLLM's offline inference. This section explores vLLM's ability to handle online inference by assessing its throughput with different numbers of concurrent requests. It is important to distinguish concurrent requests from batched requests: unlike batch processing, concurrency involves the server handling multiple simultaneous requests, each with a batch size of one, which is a scenario more reflective of real-time production environments.
Configuration:
Inference method: vLLM + AWQ
Online inference
Max Input Tokens: 16384
Max Out Tokens: 400
Number of parallel requests: 1, 2, 4, 8, 16, 32, 64
Total number of requests: 256
GPU machine: 80GB A100
The vLLM framework exhibits excellent scalability, demonstrating efficient utilization of GPU resources with increased parallel requests.
Compared to the A100, the T4 GPU has limited scalability, handling only up to 4 parallel requests before server errors occur, whereas the A100 can manage up to 32 concurrent requests.
Throughput on the T4 GPU starts at approximately 10 tokens/sec for a single request and rises to 12 tokens/sec with four parallel requests. In contrast, the A100's throughput jumps from 55 tokens/sec for a single request to 130 tokens/sec for 32 requests.
Conclusion
The introduction of the vLLM framework represents a significant advancement in optimizing LLM deployments for efficiency and scalability. The framework enhances memory usage and computational efficiency, particularly through its PagedAttention feature. While NVIDIA Triton is a strong competitor to vLLM, we chose vLLM due to a more favorable experience during the initial testing phase
A100 GPUs, offering high scalability and throughput, are ideal for demanding applications but come with a high cost—up to $30,000 annually—limiting their use to high-volume, high-ROI projects. Conversely, T4 GPUs present a more budget-friendly option at $5,000 to $10,000, suitable for businesses with stricter inference budget constraints. The RTX3090’s performance, closely mirroring that of the A100, suggests the untapped potential of consumer-grade hardware.
The article shifts focus from batch sizes to managing concurrent requests, aligning more closely with real-world production scenarios. We are hoping it helps the user find a balance between computational efficiency and response speed in LLM applications and assist them in selecting the optimal GPU configuration for their needs by considering key factors such as input size, output size, incoming volume of requests, and budget constraints.
The article discusses the vLLM framework's benefits and its transformative role in generative AI, touching on its applications in diverse fields such as customer service and predictive analysis. However, it also notes a gap in research on GPU memory usage for vLLM, marking this as an area for future exploration.
About the Author
Rohith Mukku is an AI Researcher at Roots Automation, where he is developing a universal document understanding model with a focus on optimizing inference for large language and vision models. Prior to joining Roots, he earned his master's in computer science from New York University, with a focus on advancing Behavioral Cloning in robotics and evaluating the effects of red-teaming on large language models (LLMs). Rohith completed his undergraduate degree in computer science at IIT Kanpur and previously worked as a software engineer at Samsung R&D Institute Delhi, focusing on Tizen kernel and Visual Display applications.
References
[1] White, Jules, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382, arXiv, 21 Feb. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2302.11382.
[3] Hu, Edward J., et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, arXiv, 16 Oct. 2021. arXiv.org, https://doi.org/10.48550/arXiv.2106.09685.
[4] Kwon, Woosuk, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180, arXiv, 12 Sept. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2309.06180.
[7] Leviathan, Yaniv, et al. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192, arXiv, 18 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2211.17192.
[8] Frantar, Elias, et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. arXiv:2210.17323, arXiv, 22 Mar. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2210.17323.
[9] Lin, Ji, et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978, arXiv, 23 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2306.00978.
[11] Su, Jianlin, et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864, arXiv, 8 Nov. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2104.09864.
[12] Chen, Shouyuan, et al. Extending Context Window of Large Language Models via Positional Interpolation. arXiv:2306.15595, arXiv, 28 June 2023. arXiv.org, https://doi.org/10.48550/arXiv.2306.15595.
[14] Dao, Tri, et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135, arXiv, 23 June 2022. arXiv.org, https://doi.org/10.48550/arXiv.2205.14135.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Fusce non convallis mi. Curabitur nec rutrum orci. Etiam vitae diam ut tellus venenatis ultricies. Fusce vitae ipsum sed urna tempor tempor et vitae dui.
Fusce vulputate molestie est
Fusce non convallis mi. Curabitur nec rutrum orci. Etiam vitae diam ut tellus venenatis ultricies. Fusce vitae ipsum sed urna tempor tempor et vitae dui. Aliquam nibh ante, tempus vel ultricies nec, tempus sed felis. Nullam et efficitur velit. Aenean odio nulla, facilisis a commodo eu, suscipit at augue.
Aliquam rutrum dui sapien. Aliquam pulvinar lectus accumsan est dictum, et faucibus justo ornare. Mauris placerat placerat consequat. Donec commodo consectetur nunc, et posuere orci lacinia sed. Duis mollis, eros quis porta laoreet, mi est euismod lectus, vitae volutpat quam enim congue tellus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare laoreet consequat. Integer at accumsan lacus, eget ultricies augue. Vestibulum semper sapien at venenatis pretium. Integer nec iaculis lacus. Sed elit nisi, luctus sit amet vehicula nec, mattis nec purus. Nulla facilisi. Nam ornare in justo eget facilisis.
Praesent sit amet lectus quis metus sagittis tempor.
Sed mattis ipsum vitae turpis laoreet condimentum
Sed orci erat, rhoncus efficitur eros a, sollicitudin commodo tortor
Sed accumsan ex viverra est tincidunt bibendum a non nulla curabitur eget ligula mauris
Nam ut sagittis velit suspendisse ullamcorper quis lorem vitae hendrerit
Vivamus diam orci, dignissim ac nulla hendrerit, porttitor posuere risus
Cras vel leo mattis viverra tellus eget vestibulum est
Praesent sit amet lectus quis metus sagittis tempor.
Sed mattis ipsum vitae turpis laoreet condimentum.
Sed orci erat, rhoncus efficitur eros a, sollicitudin commodo tortor.
Sed accumsan ex viverra est tincidunt bibendum a non nulla curabitur eget ligula mauris.
Curabitur sit amet auctor tellus, at scelerisque sem. In sit amet convallis arcu, id vulputate velit. Proin feugiat interdum nulla, eu malesuada massa commodo quis.
Vivamus diam orci, dignissim ac nulla hendrerit, porttitor posuere risus.
Cras vel leo mattis viverra tellus eget vestibulum est
Etiam arcu metus, vestibulum et consequat sit amet, imperdiet at augue donec condimentum risus at consequat sollicitudin.
In sit amet nisi vitae odio tristique posuere integer vel magna dignissim, sodales mauris a, tempus odio nullam orci sapien, posuere non posuere et, laoreet vel velit.
Quisque eleifend tempor eros aenean et tempus neque nam ut porttitor velit maecenas consectetur, lacus at commodo efficitur, est neque tincidunt leo, et dictum nunc lorem a est.
Maecenas viverra turpis vitae eros tempus porttitor nulla tempor nunc eros, eu elementum arcu dapibus a etiam a tristique metus.
100x improvement to claims document processing for Eastern Alliance
Our customer, Eastern Alliance (“Eastern”), a commercial carrier based in the US, specializing in Workers Compensation, identified a strategic need to modernize operations using various technologies, including AI. Claims document processing was a critical use case, so it was selected as the first area to deploy a Digital Coworker.