Computer Vision and NLP

Does Computer Vision add value to self-attention NLP?

Computer Vision does support natural language processing (NLP) to classify documents by their visual illustrations and text elements. We define the NLP plus CV approach in detail, evaluate it using a real-world scenario for real estate financing, and list limitations that may arise.

Business Background

Why is document classification important (again)?

Document classification refers to the process of assigning a document to one or more categories or classes. Automatic classification is especially important when enterprises process a large number of documents or emails, as classification provides a context in the business process and thereby saves a lot of time. For example, the banking and insurance sector uses a wide variety of scanned documents or emails that have to be organized, stored, and understood manually. The information within documents is then used in financial processes to finance, invest in or insure. Automated (con-)text understanding in documents reduces the manual effort to input data and create human understandable knowledge.

How does traditional OCR differ from AI approaches?

Document classification is not new to the insurance or financial industry. Actually the insurance industry came up with a special naming for this type of software, Input Managment. Sometimes the term OCR is used to describe any activity to digitize, categorize and extract documents. Traditional Input Management Software relies on various rather rule based patterns.

  • Traditional Input Management Software uses expert rules, like RegEx. Updating and maintaining those rules over decades is costly.
  • Traditional Input Management Software uses Layout based patterns, so that information is expected at a special position on a page.
  • Traditional Input Management Software relies on consulting. In contrast to recent AI technology, Input Management Software does not learn by examples but needs to be customized.
  • The processing and classification document relies on obsolete OCR. In the past 10 years the costs for OCR have dropped by ca. 90 %.


Computer Vision improves the accuracy by 6 %

Let us start with the results first: Pre-trained Efficient Net improves the accuracy of NBOW by 6 % in document classification tasks. Combining text and image features the accuracy, and the F1-score are higher than pure NLP features. In this article, we describe our dataset in detail and the models we used for document classification?

NLP or/and CV Accuracy F1-score
Only text features 87 % 84 %
Text + image features 93 % 92 %


The real estate financing process in Germany provides us with a complex document classification problem

For this document classification problem, we have selected a unique use case that deals with documents from real estate financing. A bank has to check several documents that are handed in for a credit request before the decision of giving a loan can be made. Interestingly in this case every type of document adds more information about the real estate. After reviewing all types of documents the bank gains a 360 degree perspective per real estate. We describe and visualize all documents below as they exist in Germany. The goal was to develop a deep learning based solution that can automatically classify 356 documents by visual and text-based documents.


Land Registration

The land register extract is a copy of all entries in the land register. It is required, among other things, for the mortgage lending test on the basis of the mortgage lending documents and before the purchase of a property. Only if a justified interest has been proven, copies of the land register are accessible in Germany.

Land Registration Extract

Notarized Contract

A land purchase agreement, i.e. notarizes contract, is a purchase agreement for the acquisition of land, rights equivalent to land or heritable building rights. Due to the special economic significance and the more complicated legal issues and legal relationships, a wide range of purchaser protection mechanisms are in place for land by law and case law, so that a land purchase agreement cannot be concluded without form like most everyday purchase agreements. Rather, the involvement of a notary is mandatory and required irrespective of whether the contracting parties are experienced merchants or consumers.

Prior to the conclusion of the land purchase agreement, the notary verifies, for example the conditions under land register law by inspecting the land register (land ownership conditions, location, size, encumbrances), the possible assumption of existing obligations from rental and lease agreements, any impairment of the use of the property (public or private building encumbrances, pre-emptive rights). This information serves the notary as a basis for the preparation of the land purchase agreement, because the latter must take up these actual or legal circumstances.

Notarized Contract real estate financing

Declaration of Partition

In German residential property law, the declaration of division is the declaration of the property owner to the land registry that the ownership of the property is divided into co-ownership shares. The declaration of partition is an important legal basis of residential or partial ownership and must be submitted once when it is created. It is a unilateral declaration to be made by the property owner and is intended to provide the land registry office with precise information on matters relating to land registry law.

Effective Area Calculation


In the real estate industry, an prospectus is a description of a property. In most cases, a Pospectus, i.e. “Exposé”, is prepared by the seller/landlord or one of his agents such as an estate agent, architect or a credit institution. The goal is usually the better marketing of the object in the form of letting and/or sale. In this respect, in addition to the site plan, photos and basic data on the property such as the area, purchase price/rent, year of construction, a brief description of the property. Since May 1, 2014, information on energy consumption is mandatory to be included.

Effective Area Calculation

Rental Agreement

In Germany, a rental agreement is a mutual contract under the law of obligations for the temporary transfer of use in return for payment, by which one party to the contract (the landlord) undertakes to grant the other party (the tenant) the use of the rented item, while the consideration of the tenant consists of the payment of the agreed rent. Rent agreements provide the liquidity to pay back the loan in case the real estate is rented.

Effective Area Calculation

Effective Area Calculation

Usable space refers to the rooms in a building that can be used for a specific purpose, but not necessarily inhabited. Usable space therefore always includes living space.

Effective Area Calculation

Ground Plan

A floor plan, i.e. ground plan, of a house or apartment is not only important for building, but also for renting or selling. It gives prospective buyers a good overview of the premises, so that they can get an idea of the room layout before the viewing appointment and decide whether the property really suits their needs.

2 bhk Bungalow floor plan.jpg

Cadastral Map

The cadastral map - also known as the real estate map - is a scaled representation of all real estate (parcels, plots of land) and, together with the appraisal map, forms the representative part of the real estate cadastre. With its proof of location and demarcation, it is the official map basis of the land register with its properties and thus the basis for securing ownership of land and a fair land tax assessment. The cadastral maps have now been completely replaced in Germany by the Automated Real Estate Map (ALK) and are therefore to be regarded as historical.


Why do visual elements matter for natural language processing in documents?

For the document classification of the real estate financing use case, a combination of visual and text features was used. Visual features help to notice documents that are easily recognizable by their figures, e.g. a ground plan. Those visual features are given by EfficientNet, which is a family of convolutional neural network based models that were designed to be more efficient than previous computer vision models. This is true especially in terms of the number of parameters and FLOPS whilst maintaining the equivalent image classification performance. [1] The visual model benefits from transfer learning from an EfficientNet trained in a large dataset called ImageNet, containing millions of images. [2]

For the text features, an NBOW (neural bag of words) model with a multi-headed self-attention layer allowing the contextualization of the words was used. The bag of words model displays a vector of occurrence counts of the words. Multi-headed means that input vectors are split, then the attention is calculated for each in parallel. Afterwards, the independent attention outputs are combined at the output of the multi-head attention module. This approach allows for attending to parts of the sequence differently, e.g. longer-term dependencies versus shorter-term dependencies. Enhanced by self-attention, the model is able to pay more attention to specific words that are more important for one specific category as it considers the context surrounding it. [3]

Now the question arises, how the selected models were adjusted to our own specific training dataset. Our final architecture consisted of two branches and a superimposed classifier on top. The two branches were used to extract the features from text and images with the original weights of the trained models and the classification layers were trained on our own training dataset. By only using the NBOW with self-attention, the accuracy in the test was 87 %. This figure improved by 6 % (to 93 %) by adding the image features. This increase in accuracy arises due to the fact that some documents in our dataset, e.g. floor plans or land register certificates, can be better recognized by visual features. By only looking at the layout of those documents, they can be easily distinguished from other documents without understanding what is written in text. Thus, the combination of computer vision with NLP is preferred because the categories are more recognizable by looking at both text and contained figures.

Are there alternative models?

Of course, there are also alternative methods you can use for your specific use case. One alternative for the Efficient Net could be the VGG model introduced in 2014 [4], which is a different convolutional neural network architecture. These architectures can be easily found pretrained in the dataset ImageNet, very commonly used for image classification. By using the pretrained models you can benefit from using the weights of the model already trained with a probably larger dataset compared to your own. General features like shapes and contours are already learned, making it easy for differentiating images. We used EfficientNet in our example of classifying real estate financing documents as it is one of the latest models and it introduced a new baseline architecture as well as a scaling method that allows us to achieve better performance while being smaller and having fewer parameters than other architectures of the same type.

For the text features, alternative models that are more simple also can be used. For example, a simple bag of words only converts the text into a vector representation by counting the frequency of each word in that text. In comparison, the NBOW creates the vector representations by passing the text through an embedding layer. However, we decided to use NBOW with multi-head self-attention because of the added context described above. Also, BERT models which are a bidirectional Transformer, or LSTM (Long Short-Term Memory) which is a recurrent neural network that handles long-term dependencies, can be used as an alternative. [5]

Details about the results of the classification by combining Computer Vision and NLP

The process of document classification is conducted per page. This results in a final classification of a document based on the average classification of its pages. Let’s take for example a document with three pages, classified into two categories A and B.

Category Page 1 Page 2 Page 3 Document
Category A 0.5 0.6 0.9 0.67
Category B 0.5 0.4 0.1 0.33

Below you will find an example of the classification vector of one rent agreement in the actual test data set of the project with documents from real estate financing. Here, the distinction between eight different categories (document types) was made for each page (in this example three pages). Every single page has predictions for those eight categories, displayed in square brackets. Document type number 4 was predicted from the total sum of page predictions as this category achieves a 99.87 % confidence.

Category Page 1 Page 2 Page 3 Document
Land Registration > 0.1 % > 0.1 % > 0.1 % > 0.1 %
Notarized Contract > 0.1 % > 0.1 % 0.19% 0.10%
Declaration of Partition > 0.1 % > 0.1 % > 0.1 % > 0.1 %
Prospectus > 0.1 % > 0.1 % > 0.1 % > 0.1 %
Rental Agreement 99.90% 99.97% 99.74% 99.87%
Effective Area Calculation > 0.1 % > 0.1 % > 0.1 % > 0.1 %
Ground Plan > 0.1 % > 0.1 % > 0.1 % > 0.1 %
Cadastral Map > 0.1 % > 0.1 % > 0.1 % > 0.1 %

Limitations of NLP and CV document classification approach

The exception proves the rule. Of course, there are also a few limitations that we encounter when using the document classification tool. We have collected a few questions that may have crossed your mind while you have been reading.

1. What happens if only little training data is available in one category?

As the optimal dataset consists of an evenly distributed amount of documents in each category, an uneven distribution of the number of documents or too few examples in one category can worsen the performance. Therefore it is recommended to combine the insufficient category with another one at first and split them up at a later point of time when enough examples were collected to fine grain the classification process. However, the behavior of the classification is highly dependent on the documents that have to be classified. If the text and/or images in the documents of the less represented class are very distinguishable from the other categories, the right document classification is very probable. But in general, in an ideal dataset we speak of a minimum amount of 50 documents in each category.

2. How do we handle blank pages?

The table below shows the classification results of the document per page and the final classification as category A. The blank page will be ignored as the input to the model is empty. We found out that the model is more robust when it learns per page. The processing of the whole document is not disturbed by the insertion of a blank page.

Category Page 1 Page 2 Page 3 (blank) Page 4 Document
Category A 0.5 0.6 NaN 0.9 0.67
Category B 0.5 0.4 NaN 0.1 0.33
3. What happens when documents of different languages are included?

If multiple languages are to be processed, the model used must be adapted to these languages. Image- and text features process documents of different languages in one dataset differently. The image feature is independent of the language. For the text features, it depends on the tokenizers and models used. When using an NBOW model, the language is handled by the tokenizer. Consequently, a tokenizer that does not depend on the language (e.g. split by whitespaces) can be used. However, if a pre-trained deep learning model such as BERT is used, the language for which it was trained and the used tokenizer has to be considered. In those cases, a model providing the language of the majority of the documents is used.

4. What happens when humans substitute documents, e.g. a bank statement is used as a substitute for a rental agreement?

Substitution of a document might be no problem for the bank clerk who has to manually extract the rent payment amount out of the bank statement. But the AI decides which classification to make based on the training dataset it has learned. Thus, when a novel document is introduced, the AI needs training on the new document type. Otherwise, the AI may assign it to an incorrect category.

Wrap up on how document classification with a combination of computer vision and NLP actually works

As already mentioned before, the model learns per page. For each page, we get the tokens from the text and pass them through an embedding layer. After the embedding layer, there is a self-attention mechanism that considers the embeddings of all other tokens within the sequence, providing the context. At the end we get a vector with the codified text of the page, having into consideration the relations between the tokens (NBOW self-attention). When there are unique keywords within a category, the model will learn to distinguish that category based on the keyword. That’s because the keyword will be represented within the text features of documents of the respective category. Also, every page is visually encoded by using a pre-trained image classifier model and by extracting the feature map from the last layers. At the final step, we concatenate the image feature map with the text features and apply a classifier (two fully connected layers and a softmax function) to the result.

Background of the research

We hope you enjoyed to read about our research. Ad-hoc trainable REST API for any kind of document classification is now available in our document AI platform Konfuzio. Konfuzio provides modular solution for document classification that allows the use all models mentioned above out of the box. For every user defined classification category, the best combination of those models will be chosen. Furthermore, the solution is included within the range of tasks provided by the Konfuzio App, allowing further processing of data from classified documents. For example, through the extraction of individual information or entire text sections, data becomes accessible.


[1] Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (pp. 6105-6114). PMLR.

[2] Stanford Vision Lab (2020). ImageNet.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.

[4] Helm & Nagel GmbH (2021). Training documentation: Image modules.

[5] Helm & Nagel GmbH (2021). Training documentation: Text modules.

Foto von Sound On von Pexels