Does Computer Vision add value to self-attention NLP?


Maximilian Schneider Avatar



Multimodal AI

This research explores the integration of Computer Vision (CV) with Natural Language Processing (NLP) to enhance document classification, particularly in the context of real estate financing. The study investigates whether CV adds value to self-attention NLP models and how this combination can improve the accuracy of classifying various documents based on both their visual and textual elements.

Business Background

  • Document classification is crucial in processing a large volume of documents, especially in sectors like banking and insurance. Automated classification saves time by providing context in business processes.
  • Traditional Optical Character Recognition (OCR) and Input Management Software largely depend on rules and patterns, requiring substantial maintenance and customization.


  • The integration of CV with NLP significantly improves classification accuracy. The study found that using pre-trained Efficient Net for visual features along with text features enhanced the accuracy by 6% compared to using only text features.
  • The accuracy increased from 87% (with only text features) to 93% (with both text and image features).

Data and Methodology

  • A unique dataset from the real estate financing process in Germany was used. This dataset included various document types, such as land registration extracts, notarized contracts, declarations of partition, and rental agreements.
  • For text features, a Neural Bag of Words (NBOW) model with a multi-headed self-attention layer was employed. For visual features, EfficientNet, pretrained on ImageNet, was used.
  • The final architecture consisted of two branches (text and image features) with a classifier combining these features.


  • The performance might be affected if there’s an uneven distribution of documents across categories or if some categories have too few examples.
  • The model might struggle with blank pages or documents in multiple languages, depending on the tokenizers and models used.


  • The combination of CV and NLP proves to be more effective for document classification as it allows the model to recognize categories by considering both text and visual elements.
  • This approach is particularly advantageous for documents with distinct layouts or formats that are easily recognizable visually.

The full analysis of this groundbreaking research is accessible for those interested in exploring this dataset and its potential findings further. Access can be requested via e-mail for deeper engagement with the study and its implications in the field of document AI and machine learning.


Leave a Reply

Your email address will not be published. Required fields are marked *