We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy. Denying consent may make related features unavailable.
You can consent to the use of such technologies by using the “Accept” button, by closing this notice, by scrolling this page, by interacting with any link or button outside of this notice or by continuing to browse otherwise.
No items found.

Part 2: Benchmarking Multimodality with Real-World Insurance Data

Hunter Heidenreich
March 4, 2024

Previously in Part 1 of this article series, we took a tour through the world of multimodality in Document AI, considering text, layout structure, and visual features as separate input signals that can be used to automate document processing.

Here in Part II, we will see how Document AI performs on a real-world task necessary for all automated document processing pipelines: Page Stream Segmentation (PSS).

As a quick recap from Part I, PSS is the task of taking an unstructured collection of document pages and segmenting them into their logical documents.

To perform this comparison, we will use an in-house PSS dataset from the insurance sector. We will perform supervised fine tuning for the Encoder models we discussed in Part I. As a baseline comparison, we’ll also include XGBoost to demonstrate how traditional algorithms compare. Finally, we’ll put ChatGPT and GPT-4 to a head-to-head comparison with the multimodal Encoders to reason about whether the extra modalities are worth the complexity they add.

Table 1: A Summary of Models Considered

Our goal is to find the most effective base model for PSS. For brevity, this comparison is non-exhaustive; we expect to offer a more comprehensive dissection of performance differences following future analysis.

Section 5: In-House Dataset Analysis

Having outlined some processes essential to document segmentation in Part I, we now shift to an evaluation of the different PSS models.

To do this in a relevant context, we’ll use an internal dataset from the insurance industry consisting of 6K document packages (i.e., 12K unique documents comprising 30K pages) that we will look to accurately split.

Table 2: Summary of internal dataset statistics

We evaluate model performance precision, recall, and F1-score of both pages and documents. Document-level metrics are a harsher metric where a document is only correct if every page in it is correctly classified. Document precision only credits the perfectly segmented documents and document recall only credits segmented documents that exist in the ground truth data. To consistently evaluate models, we split this data 80/10/10 training/validation/testing at the package-level.

Section 6: Model Architecture and Methods

Furthermore, we unify DocAI base evaluation by using an identical model architecture between foundation models. For each page we seek to classify as a potential end-of-document, we look one page before and one page after. Each page is encoded using the base model, with its first token vector output used as its summary vector. These three vectors are then concatenated as a long vector and fed into a binary classification layer.

Diagram 5: Supervised models classify a center page by encoding it and neighboring pages

Concatenating these independent page representations, we supervised a classification head to determine if the page ends a document.

All fine-tuned models are trained with the AdamW optimizer with a learning rate of 1e-5, batch size of 32, weight decay of 0.01, and for up to ten epochs. The best model (as measured by validation accuracy during training) is kept for final evaluation. As a traditional machine learning approach, we include XGBoost trained on count and TF-IDF bag-of-word vectors. For zero-shot comparison of ChatGPT and GPT-4, only 10% of the test data was used to evaluate model performance due to the low performance of the zero-shot approach.

Section 7: Results and Discussion

After training our supervised models, we evaluate them on the test split of our in-house dataset and display the results below in Table 3. Immediately, we observe that ChatGPT and GPT-4 fail to achieve satisfactory performance at this task. While GPT-4’s zero-shot performance begins to approach what other models can attain with supervision, it still leaves much to be desired and at a much higher cost (both computational and financial). In contrast, our simple XGBoost baseline is extremely competitive, outperforming almost all the DocAI models and offering a strong alternative to LiLT for a very cheap computational cost.

Table 3: Model performance. *Indicates models evaluated only on the first 10% of the test split.

Comparing our text-only RoBERTa model with the vision-only Donut model, we see a strong difference between the two in their precision versus recall. While the LLM has higher precision (the predicted document ends are true ends), Donut achieves higher recall (more of the true pages that end documents are correctly classified). These trends bubble up from the page-level metrics into the document-level metrics as well, where we see Donut with the highest document recall out of all models considered. Even though its recall is high, the Donut model also suffers the worst precision out of all models as well. Nevertheless, in terms of a unimodal Encoder, Donut appears to surpass RoBERTa.

By contrasting RoBERTa with LiLT, we can reason quite directly about the benefits of the spatial modality since the LiLT model is pre-initialized with RoBERTa’s weights. While the page precision is identical, we can observe that the main benefit is in boosting the model’s recall. Through significantly increasing the recall, LiLT produces a more balanced classifier that is quite competitive with Donut. Additionally, since LiLT does not use images, it avoids the larger memory requirement that Donut suffers from.

The most interesting performance stems from the multimodal LayoutLMv3 model. It slightly outperforms LiLT in both precision and recall at the page-level, but drastically outperforms it at the document level. This could be an artifact of a smaller dataset, and precise evaluation would necessarily involve repetition across random seeds. Another interesting point of contrast is that LayoutLMv3 does not achieve the high recall of the Donut model. This is to do with the different resolutions used by these two models, with Donut achieving a higher recall due to a higher image resolution. That said, LayoutLMv3 has significantly better recall than the text only LLM and reaches the highest precision out of all considered models.


And the Winner is …  

So, for PSS, what is the ideal AI model?

In our case, the results show that the LayoutLMv3 model has the absolute best performance. However, it has a non-commercial license, has a higher cost to train and host due to the long sequence lengths it needs to handle due to visual inputs. In contrast, LiLT offers a close alternative that has a similar computation cost to your average text only LLM Encoder and has a non-restrictive license. Additionally, XGBoost offers a cheap and transparent model for similar performance.

Unimodal Encoders are a suboptimal selection. Where Donut benefits from multimodal pre-training to perform OCR, it has an extreme memory cost. While the high recall of Donut is intriguing enough warrant interest, the fact that LiLT generates a more balanced model that is cheaper to train and serve makes it a more appealing choice in this context.

This journey through the landscape of DocAI models reveals a complex tapestry of trade-offs. This exploration underscores the critical importance of considering not just the technical capabilities, but also the practical implications of deploying these models. As DocAI continues to evolve, it becomes increasingly clear that the most effective solutions may not always be the most advanced ones, but those that align best with the specific needs and constraints of their application. This article, hopefully, has not only illuminated the nuances of selecting the right DocAI for PSS but also inspired you to delve deeper into this fascinating and ever-evolving domain.

About the Author: Hunter Heidenreich is a Machine Learning Researcher at Roots Automation focused on developing their universal document understanding model. He joined Roots after completing his master’s in computer science from Harvard University, where he researched time-series forecasting of physical systems with an emphasis on Transformer models.

Discover how Roots Automation is using LLM to solve insurance businesses' most difficult unstructured data challenges.



[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S. and Avila, R., 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[2] Chen, T. and Guestrin, C., 2016, August. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).

[3] Huang, Y., Lv, T., Cui, L., Lu, Y. and Wei, F., 2022, October. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 4083-4091).

[4] Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D. and Park, S., 2022, October. Ocr-free document understanding transformer. In European Conference on Computer Vision (pp. 498-517). Cham: Springer Nature Switzerland.

[5] Liu, Z., Lin, W., Shi, Y. and Zhao, J., 2021, August. A robustly optimized BERT pre-training approach with post-training. In China National Conference on Chinese Computational Linguistics (pp. 471-484). Cham: Springer International Publishing.

[6] Wang, J., Jin, L. and Ding, K., 2022, May. LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7747-7757).

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Fusce non convallis mi. Curabitur nec rutrum orci. Etiam vitae diam ut tellus venenatis ultricies. Fusce vitae ipsum sed urna tempor tempor et vitae dui.
Fusce vulputate molestie est

Fusce non convallis mi. Curabitur nec rutrum orci. Etiam vitae diam ut tellus venenatis ultricies. Fusce vitae ipsum sed urna tempor tempor et vitae dui. Aliquam nibh ante, tempus vel ultricies nec, tempus sed felis. Nullam et efficitur velit. Aenean odio nulla, facilisis a commodo eu, suscipit at augue.

Aliquam rutrum dui sapien. Aliquam pulvinar lectus accumsan est dictum, et faucibus justo ornare. Mauris placerat placerat consequat. Donec commodo consectetur nunc, et posuere orci lacinia sed. Duis mollis, eros quis porta laoreet, mi est euismod lectus, vitae volutpat quam enim congue tellus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare laoreet consequat. Integer at accumsan lacus, eget ultricies augue. Vestibulum semper sapien at venenatis pretium. Integer nec iaculis lacus. Sed elit nisi, luctus sit amet vehicula nec, mattis nec purus. Nulla facilisi. Nam ornare in justo eget facilisis.

  • Praesent sit amet lectus quis metus sagittis tempor.
  • Sed mattis ipsum vitae turpis laoreet condimentum
  • Sed orci erat, rhoncus efficitur eros a, sollicitudin commodo tortor
  • Sed accumsan ex viverra est tincidunt bibendum a non nulla curabitur eget ligula mauris
  • Nam ut sagittis velit suspendisse ullamcorper quis lorem vitae hendrerit
  • Vivamus diam orci, dignissim ac nulla hendrerit, porttitor posuere risus

Cras vel leo mattis viverra tellus eget vestibulum est

  1. Praesent sit amet lectus quis metus sagittis tempor.
  2. Sed mattis ipsum vitae turpis laoreet condimentum.
  3. Sed orci erat, rhoncus efficitur eros a, sollicitudin commodo tortor.
  4. Sed accumsan ex viverra est tincidunt bibendum a non nulla curabitur eget ligula mauris.
  5. Curabitur sit amet auctor tellus, at scelerisque sem. In sit amet convallis arcu, id vulputate velit. Proin feugiat interdum nulla, eu malesuada massa commodo quis.
  6. Vivamus diam orci, dignissim ac nulla hendrerit, porttitor posuere risus.

Cras vel leo mattis viverra tellus eget vestibulum est

  • Etiam arcu metus, vestibulum et consequat sit amet, imperdiet at augue donec condimentum risus at consequat sollicitudin.
  • In sit amet nisi vitae odio tristique posuere integer vel magna dignissim, sodales mauris a, tempus odio nullam orci sapien, posuere non posuere et, laoreet vel velit.
  • Quisque eleifend tempor eros aenean et tempus neque nam ut porttitor velit maecenas consectetur, lacus at commodo efficitur, est neque tincidunt leo, et dictum nunc lorem a est.
  • Maecenas viverra turpis vitae eros tempus porttitor nulla tempor nunc eros, eu elementum arcu dapibus a etiam a tristique metus.

Share this post


Let's make work more human, together.
Contact Us