We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy. Denying consent may make related features unavailable.
You can consent to the use of such technologies by using the “Accept” button, by closing this notice, by scrolling this page, by interacting with any link or button outside of this notice or by continuing to browse otherwise.
No items found.

Part I: Understanding Multimodality for Document Processing

Hunter Heidenreich
March 4, 2024

Executive Summary—Three Takeaways:

In this two-part article, we will:

  • Explain the different ways Document AI models use multimodality to automate the processing of documents, with a focus on Page Stream Segmentation (PSS) as a vital first step in any automation pipeline (Part I).
  • Explore PSS as a benchmark for assessing Document AI performance using a real-world, in-house dataset from the insurance sector (Part II).
  • Outline the performance advantages of multimodal models like LayoutLMv3, the competitiveness of simple XGBoost baselines, the limitations of unimodal generative models like GPT-4, and the trade-offs that occur between technical capabilities and practical when selecting the right DocAI model for a given application (Part II).

It’s commonplace across the insurance industry to use unstructured documents to transact business. These documents come in many forms, including receipts, contracts, purchase orders, medical records, general correspondence, as well as emails, faxes, and paper.

The sheer volume of documents generated is mind-boggling—and before people can act on a document (e.g., legal teams dissecting dense contracts; insurance analysts navigating policy documents) they must first split pages into coherent documents so they can classify and extract any relevant data.  

Diagram 1: An unstructured collection of documents to be ingested, split, and classified to assure that the right information is extracted from the relevant documents.

Enter the modern Document AI (DocAI) – a fusion of a large language model (LLM) with added modalities found in digital documents (e.g., layouts and images). These pre-trained models are meant to serve as a starting point for automated document processing. Examples of multimodal DocAI that have appeared to tackle aspects of this domain, include Donut [7]) with vision input and text output, and LayoutLM [16] with text, layout, and vision inputs.

Is multimodal DocAI required or is a text based LLM such as ChatGPT sufficient? As a starting point to implementing DocAI, it’s critical to assess need. How much do spatial and visual features matter for document processing? Would a vision-centric model obviate the need for preprocessing digital documents into a language-centric format for LLMs? Do models that fuse all modalities perform best?

Diagram 2: Different modalities offer different kinds of information; while unimodal models must use only one, multimodal models learn to integrate information across modalities to perform their predictions.

PSS (Page Stream Segmentation): Building Order from Chaos

Here we view these questions through the lens of Page Stream Segmentation (PSS) – a vital step in any digital document processing pipeline – where a continuous stream of pages is intelligently grouped into distinct, coherent documents.

Poorly executed or incorrect segmentation could have serious financial implications, e.g. with incorrectly processed documents for the submission of a claim to an insurance company, which might comprise multimodal documentation, including bills, medical records, email exchanges, legal documents, photographs, and more.

To best automate the processing of this claim, an essential step is to segment these individual documents so that the correct extraction and post-processing steps may be applied depending on the type of document. For these applications, PSS isn’t merely a preliminary task; it’s essential to building the foundation for any subsequent classification and processing activities that demand outstanding accuracy and efficiency.

Diagram 3: To maximize classification accuracy, PSS requires a model to faithfully identify document boundaries in a collated stream of pages.

PSS stands out as a unique benchmark for document processing, requiring both an understanding of single pages and their sequences to decide document boundaries. This requirement adds computational complexity, particularly for digital documents where pages can be quite lengthy. As we will solely consider DocAI with a Transformer architecture (with attention operations that scale quadratically in input sequence length), these concerns become central.

PSS Model Functionality

Section 1: Reading Documents with Language-Only Models

The explosion of Large Language Models (LLMs) such as GPT and RoBERTa has revolutionized language processing. Here, we will consider both unidirectional, generative language models (ChatGPT, GPT-4 [1]) as well as bidirectional, encoding models (RoBERTa [11]). For large models like GPT-4, we only consider a zero-shot evaluation scenario where we present a model with text extracted from two pages and ask it to decide if they belong to the same document. In contrast, we will perform a full fine tuning of RoBERTa, using its bidirectional attention to produce useful representations for PSS.

To use a text-based LLM for PSS, you must preprocess digital documents with an Optical Character Recognition (OCR) system. An OCR system takes an image of a page, isolates its lines, and words with bounding boxes, and parses the recognized text into digital characters. This step is pivotal as it transforms the inherent two-dimensional layout of pages – encompassing spatial arrangements and visual cues – into a one-dimensional, linear sequence of text. Without otherwise supplying the identified bounding boxes, this linearization strips away all layout and visual information, presenting a unique challenge: how can a model understand the structure of a document when spatial and visual cues are lost? While LLMs are constrained by their reliance on linearized text, their representational ability for text could still enable them to perform well on PSS. At the very least, they serve as a crucial baseline for what can be achieved using text alone.

Diagram 4: An illustration of the OCR process. On the left, the original PDF page. On the right, the words and locations identified by the OCR software.

Section 2: Taking a Closer Look with Vision-Only Models

All text-based models introduce an OCR dependency to our document processing pipeline. While this gives us text in a usable fashion, our model’s performance becomes limited by the accuracy and efficiency of the underlying OCR system. Because of this, researchers have proposed vision-only approaches like Donut to remove the OCR dependency.

Vision-only models are not just about seeing; they are about understanding the visual language of documents. In the context of Page Stream Segmentation (PSS), these models could detect cues like changes in layout, graphical separators, or variations in font styles. Visual comprehension could be crucial for documents with minimal text or where layout carries more information than text content. For our comparison in Part II, we will consider Donut as our representative of the vision-only model.

Donut [7] is an Encoder-Decoder Transformer. It uses a vision Encoder (a Swin-B Encoder [12]) and a language Decoder, pre-trained to perform OCR. At 143M parameters, its authors show Donut to be faster and more precise than an OCR + RoBERTa combination. For a direct comparison with RoBERTa on PSS, we will drop the Decoder half of Donut and use the vision Encoder directly.

Vision-only models have a natural computation bottleneck as high-resolution images are needed to resolve small text. High-resolution images become long sequences fed into quadratic attention operations. Because of this computational bottleneck, researchers have not focused as much on vision-only models for document processing; other models of note include the Document Image Transformer (DiT [10]). Instead, the common trend has been to seek alternate ways of providing 2D aspects of the data domain to the model without paying the full cost of a vision Encoder.

Section 3: Word-Based Proprioception with Layout Models

How might we give an LLM access to spatial information without giving it full visual access to a document? We can give it information about the relative locations and sizes of words. In other words, word bounding boxes. Luckily, as OCR is already a pre-processing step to generate text for LLMs to train on, we have the bounding boxes as well.

LiLT (Language-Independent Layout Transformer) [15] is a light-weight approach to learning a multimodal DocAI that uses textual and spatial information. LiLT takes a pre-trained LLM Encoder and couples it with spatial information by learning a separate Transformer Encoder tower that operates on bounding box “tokens.” To couple the modalities, LiLT uses a bidirectional attention complementation which forces the towers to share attention weights. LiLT computes identical attention scores for both by summing together the logits generated from the text and spatial components independently. The final output of the LiLT model is the concatenation of the spatial and textual token representations. This process is visualized in Figure 2 of their paper.

Due to its usage of a pre-trained language model, LiLT is fast to pre-train. This is nice because it means when a new unimodal text encoder is developed, it is easy to make a LiLT variant. Additionally, because LiLT’s spatial Encoder uses a smaller dimensionality than its LLM, LiLT only adds an added 6.1M parameters to whatever base model it augments. The model paired with RoBERTa yields approximately 130M parameters. As we will see in Part II of this series, LiLT is surprisingly effective and begs the question of what is gained when using all three modalities: text, vision, and layout.

While not included in our comparison in Part II, there exists a handful of other multimodal DocAI models that leverage layout with textual information: BROS [5], StructuralLM [9], and FormNet [8].

Section 4: Information Fusion with Multimodal Models

Producing DocAI that combines all three modalities is the latest trend. The two versions of DocFormer [2, 3], LayoutLMv2 [17] and LayoutLMv3 [16], UDOP [14], and TILT [13] all try tri-modal fusion. As many of these models are Encoder-Decoder architectures, we focus on LayoutLMv3 as an Encoder tri-modal base model.

What makes LayoutLMv3 different from any other Transformer Encoder lies in its preprocessing and its pre-training. Text is embedded by summing together the embeddings for its token type, 1D position in the token sequence, and 2D position on the page. This directly entangles text and spatial information upon embedding, in contrast with LiLT which keeps the modalities disentangled. Visual features are extracted as visual “tokens” by segmenting the document image into patches and linearly projecting the patches into a “token” sequence. The spatially augmented text token sequence and the visual token sequence are then concatenated and fed through a typical Transformer Encoder stack. This function is nicely depicted in Figure 3 of their published paper.

Because of the concatenation of image “tokens” and text, LayoutLMv3 inherits sequence length bottlenecks like Donut (although LayoutLMv3 uses a lower image resolution which helps to reduce the sequence length). Due to its simplified processing of visual “tokens,” LayoutLMv3 has a base model with 133M parameters making it smaller than Donut but larger than LiLT.


From this introduction to Document AI models, it quickly becomes clear that choosing the right Document AI model will require understanding our problem’s needs. The usage of visual features always incurs a substantial computational cost due to the expansion of its sequence length. This means that unless we absolutely need visual features, we might seek a computationally cheaper Document AI model that leverages the layout structure without fully “seeing” the document. Even after selecting the right modalities for our model, there remains the need to compare different Document AI models within a particular class (e.g., LayoutLMv3 versus DocFormerv2).

But nonetheless, you still might be wondering why not use GPT-4? Do I really gain much from all this added complexity of multimodality?

In Part II of this publication, we’ll take a concrete look at this question by taking the DocAI models discussed here and put them head-to-head against GPT-4 on a real-world insurance challenge.

About the Author: Hunter Heidenreich is a Machine Learning Researcher at Roots Automation focused on developing their universal document understanding model. He joined Roots after completing his master’s in computer science from Harvard University, where he researched time-series forecasting of physical systems with an emphasis on Transformer models.

Discover how Roots Automation is using LLM to solve insurance businesses' most difficult unstructured data challenges.



[1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S. and Avila, R., 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[2] Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y. and Manmatha, R., 2021. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 993-1003).

[3] Appalaraju, S., Tang, P., Dong, Q., Sankaran, N., Zhou, Y. and Manmatha, R., 2023. DocFormerv2: Local Features for Document Understanding. arXiv preprint arXiv:2306.01733.

[4] Chen, T. and Guestrin, C., 2016, August. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).

[5] Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D. and Park, S., 2022, June. Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 10767-10775).

[6] Huang, Y., Lv, T., Cui, L., Lu, Y. and Wei, F., 2022, October. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 4083-4091).

[7] Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D. and Park, S., 2022, October. Ocr-free document understanding transformer. In European Conference on Computer Vision (pp. 498-517). Cham: Springer Nature Switzerland.

[8] Lee, C.Y., Li, C.L., Dozat, T., Perot, V., Su, G., Hua, N., Ainslie, J., Wang, R., Fujii, Y. and Pfister, T., 2022, May. FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3735-3754).

[9] Li, C., Bi, B., Yan, M., Wang, W., Huang, S., Huang, F. and Si, L., 2021, August. StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 6309-6318).

[10] Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C. and Wei, F., 2022, October. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 3530-3539).

[11] Liu, Z., Lin, W., Shi, Y. and Zhao, J., 2021, August. A robustly optimized BERT pre-training approach with post-training. In China National Conference on Chinese Computational Linguistics (pp. 471-484). Cham: Springer International Publishing.

[12] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. and Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).

[13] Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T., Pietruszka, M. and Pałka, G., 2021. Going full-tilt boogie on document understanding with text-image-layout transformer. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16 (pp. 732-747). Springer International Publishing.

[14] Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C. and Bansal, M., 2023. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19254-19264).

[15] Wang, J., Jin, L. and Ding, K., 2022, May. LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7747-7757).

[16] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F. and Zhou, M., 2020, August. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1192-1200).

[17] Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W. and Zhang, M., 2021, August. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 2579-2591).

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

Fusce non convallis mi. Curabitur nec rutrum orci. Etiam vitae diam ut tellus venenatis ultricies. Fusce vitae ipsum sed urna tempor tempor et vitae dui.
Fusce vulputate molestie est

Fusce non convallis mi. Curabitur nec rutrum orci. Etiam vitae diam ut tellus venenatis ultricies. Fusce vitae ipsum sed urna tempor tempor et vitae dui. Aliquam nibh ante, tempus vel ultricies nec, tempus sed felis. Nullam et efficitur velit. Aenean odio nulla, facilisis a commodo eu, suscipit at augue.

Aliquam rutrum dui sapien. Aliquam pulvinar lectus accumsan est dictum, et faucibus justo ornare. Mauris placerat placerat consequat. Donec commodo consectetur nunc, et posuere orci lacinia sed. Duis mollis, eros quis porta laoreet, mi est euismod lectus, vitae volutpat quam enim congue tellus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ornare laoreet consequat. Integer at accumsan lacus, eget ultricies augue. Vestibulum semper sapien at venenatis pretium. Integer nec iaculis lacus. Sed elit nisi, luctus sit amet vehicula nec, mattis nec purus. Nulla facilisi. Nam ornare in justo eget facilisis.

  • Praesent sit amet lectus quis metus sagittis tempor.
  • Sed mattis ipsum vitae turpis laoreet condimentum
  • Sed orci erat, rhoncus efficitur eros a, sollicitudin commodo tortor
  • Sed accumsan ex viverra est tincidunt bibendum a non nulla curabitur eget ligula mauris
  • Nam ut sagittis velit suspendisse ullamcorper quis lorem vitae hendrerit
  • Vivamus diam orci, dignissim ac nulla hendrerit, porttitor posuere risus

Cras vel leo mattis viverra tellus eget vestibulum est

  1. Praesent sit amet lectus quis metus sagittis tempor.
  2. Sed mattis ipsum vitae turpis laoreet condimentum.
  3. Sed orci erat, rhoncus efficitur eros a, sollicitudin commodo tortor.
  4. Sed accumsan ex viverra est tincidunt bibendum a non nulla curabitur eget ligula mauris.
  5. Curabitur sit amet auctor tellus, at scelerisque sem. In sit amet convallis arcu, id vulputate velit. Proin feugiat interdum nulla, eu malesuada massa commodo quis.
  6. Vivamus diam orci, dignissim ac nulla hendrerit, porttitor posuere risus.

Cras vel leo mattis viverra tellus eget vestibulum est

  • Etiam arcu metus, vestibulum et consequat sit amet, imperdiet at augue donec condimentum risus at consequat sollicitudin.
  • In sit amet nisi vitae odio tristique posuere integer vel magna dignissim, sodales mauris a, tempus odio nullam orci sapien, posuere non posuere et, laoreet vel velit.
  • Quisque eleifend tempor eros aenean et tempus neque nam ut porttitor velit maecenas consectetur, lacus at commodo efficitur, est neque tincidunt leo, et dictum nunc lorem a est.
  • Maecenas viverra turpis vitae eros tempus porttitor nulla tempor nunc eros, eu elementum arcu dapibus a etiam a tristique metus.

Share this post


Let's make work more human, together.
Contact Us