Automated Regex Generator

Automated Regex Generator in Python based on examples

To adapt Named Entity Recognition (NER) models to a business context, training data is needed. When adapting a model to a new context our team is very concerned about flaws in the training data. We have seen that the smallest labeling error in the training data caused significant losses in accuracy and F1. In this article, we show how we control the quality of our training data by summarizing all annotations via automated regex. We evaluate our approach on 5 real-world use-cases used to extract digitized business documents.

How to generate regex from string with Python?

If you are looking for the code, please visit our Python GitHub Konfuzio SDK. Using the SDK allows you to replicate the analysis below using your own data. Disclaimer: Helm & Nagel GbmH, a company in Germany, develops the Software Konfuzio.com. This software was used in the video and to create this blog post.

If you want to dive into an example, watch the video:

Why do we care about automating Regex in Python?

Even NER models became powerful tools over the past decade adapting those models requires high-quality domain-specific training data. In addition, after the initial retraining the training data in the production environment increases. Commonly practitioners want to understand which complexity is provided to the model within the training data and which patterns experts are aware of but are not yet available in the training data.

Maintaining high-quality NER data for more than 1,500 NLP data sets of clients is especially hard as the Software Konfuzio allows users to provide continuous feedback, see Video. To reduce the manual work to understand data sets of clients fast and accross langauges we created an automated approach.

Per NER entity we basically summarize the data by a set of Regex automatically.

Example how visual revision of training looks like

In general, we aim for explicit knowledge representation in training data to grant access for rapid development, (semi-)explainable AI and ease of maintenance. The easier we can summarize the data, the easier we can detect errors or explain edge cases of AI to our customers. Years ago we implemented a a multi-stage visual revision processes with a user friendly-interface, as shown in the gif below.

NLP Annotation of Bank Account PDF

Even the visual revision process provides lower error rates, annotation errors still exist. This is especially true if multiple subject-matter experts annotate the data. The visual approach improves the consistency of labeling within the business document but merely helps to detect annotation issues across documents.

  1. In this article, we go one step further than visual analysis. We summarize domain-specific NLP data using automated Regex. We show on five data sets how data scientists can manually review hundreds of annotations within seconds using this Regex-based summarization of the training and test data.

  2. In addition, we provide a script to use Regex, either manually or automatically created, to annotate training data with a minimum number of examples (N=1) automatically. Based on our experience the costs to review pre-annotated text data drops by ca. 50 %. Using this small data approach to per-annotate text data will reduce the costs to provide high-quality data to Deep Learning Algorithms like Named Entity Recognition (NER).

  3. Finally, our Regex approach can be exploited to create, monitor and optimize Regex automatically to then use them in traditional IT systems in a non-invasive way.

Auto Regex provides a perfect way to summarize information in an abstract but consistent way

In theoretical computer science, a regular expression (abbreviation RegExp or Regex) is defined as a character string that is used to describe sets of character strings with the help of certain abstract rules. In addition to implementations in many programming languages, many text editors also process regular expressions in the “Find and Replace” function. In use-cases where experts aim to extract or classify digitized business documents, we have seen two common patterns experts use.

1. Use domain-specific keywords and phrases as signals for relevant Information

Pretend that an expert tries to classify invoices and distinguishes them from other documents. Invoices contain mandatory contents which must be included in an invoice to be legally correct. An invoice is not the same as a receipt, which is an acknowledgment of payment. In this case, experts can use keyphrases so that any document which contains “invoice” or “Invoice” will be classified as an invoice. Furthermore, a keyword approach can be used to identify the information which needs to be extracted, like EURO, Euro, EUR, or for the currency of on invoice. Suppose that an expert aims to extract the currency. A potential Regex could be (EURO|Euro|EUR|€).

2. Identify previously unknown information to be extracted from documents

If the information in the text is unknown beforehand rephrases are not helpful. Suppose that an expert aims to extract the date of the invoice. As the date itself and its format are unknown the expert needs to define rules which do match the information to be extracted from the document without knowing the information beforehand. Assume you look at three invoices, the first contains the date 31.9.2021, the second contains the invoice date 1.09.21 and the third contains the same date as the first invoice 31.09.21. Suppose that an expert aims to extract the date of the invoice. As the dates are unknown the expert can define a so-called Regex: Use the first date which matches [0-3]*[0-9]\.[01][0-9]\.\d{2,4}. i.e. a date that is formatted like 31.9.2021, 1.09.21, or 31.09.21.

In addition, both approaches can be combined. The keywords can be used as a prefix or suffix to extract the information needed, like any number related to the Euro currency to identify potential payment amounts. Suppose that an expert aims extract all payment amount in Euro. A potential Regex could be ([\d,\.]+[\.,]\d\d) (EURO|Euro|EUR|€).

Regex are hard to create, maintain, deduplicate and evaluate

When a Regex is being created, expert iterating between looking at the data which needs to be extracted, e.g. the currency, writing those rules, and testing these rules. This manual approach goes hand in hand with many challenges:

  • Hard to create: Writing Regex is time-consuming and manual.
  • Hard to evaluate: When should a keyword be used and when to use a more abstract Regex? Normally, keywords provide a low false-positive rate and a high false-negative rate. On the other side, Regex will provide a high false-positive rate and a low false-negative rate. This has to say that, keywords match only what they were made for, a Regex provides a more generalized matcher. Irrespective of the recall and precision the regex should run fast.
  • Hard to maintain: Over time it is unclear when to update a previously created regex. It’s also hard to understand which examples were used to initially create a regex. Thinks of ([\d,\.]+[\.,]\d\d) (EURO|Euro|EUR|€): It can match 5,533.05 Euro, 5.533,05 €, 44.54 EURO. Over time, it might be hard to comprehend if the decimal delimiter . was intended to be supported.
  • Hard to deduplicate: Regex which match identical information are hard to identify and often cause the expert to maintain all Regex ever written. Without examples, it might be hard to find out, if two Regex should be merged or if two separate Regex provide a higher precision.

Automated Regex generation to identify inconsistencies in NLP data

  1. Load all annotations available. In the following scenario, we define a Label as a name of a group of individual pieces of information annotated in a type of document. An Annotation is a set of characters and/or bounding boxes that a label has been assigned to. In the Appendix, you find Technical details about the data available per Annotation and Label. The following image illustrates that a user has labeled Mustermann GmbH as Entity Name, i.e. Entity Name is the Label of the Annotation Mustermann GmbH. Annotation in Documents
  2. Run the Genetic Algorithm on those annotations to any Regex need to match all available annotations. Based on the approach proposed by Bartoli et. al (2019) we developed an improved approach in Python. 1 To summarize this approach: Use different strategies to create at least one Regex per annotation, try to deduplicate the list of regex, and optimize the runtime. Finally, make sure the deduplicated list will be able to extract all annotations available. To make sure we understand which Annotation was used to create a regex we use a standardized Regex structure. To (?P<currency_W_772213>EUR)
    • currency: When generating the Regex we keep the Label of the Annotation so that we can compare multiple Regex for one Label
    • W: Per annotation we create multiple Regex. The capital letter refers to the strategy used to create the Regex.
    • 772213: We keep the ID of the original annotation which was used to create the Regex. Later on, this will help to detect issues in the training data.
    • EUR: Finally, include the Regex to extract the text. In this case, it’s a keyword approach, that will just match any EUR.
  3. Review the list or Regex and find Annotations that cause non-intuitive Regex
  4. Correct those Annotations
  5. Use the Regex to pre-annotate further documents and let them review by subject-matter experts. You can find a Code example on GitHub.
  6. Sleep, eat and repeat…

We provide the algorithm as a REST API service to selected experts. On request, we grant experts access to the API free of charge for testing purposes. Feel free to contact us.

Results: Data-centric improvements of NER models using automated Regex

We show which data-centric improvements can be derived from our approach. Therefore, we use five real-world datasets which were used to train a NER model that extracts business documents.

Background on the document extraction: Traditionally Optical character recognition (OCR) was defined as a technique to recovers text information from image files or scanned documents. In recent years the term OCR became omnipresent and was used in ubiquitous ways. Providers of OCR or so-called Input Management software vendors utilized the term OCR to describe further use cases on top of the recovered text. Today OCR or Input management is used to refer to activities to capture business-relevant data (“content”) and connect this data to subsequent business applications.

Data Set 1: Extracting information from invoices with a 95 % F1-Score can even be improved

This data set was used to train a Document AI which extracts information from German invoices. In the Appendix, Table A.1 provides a detailed overview of all Labels in the project. The Label Currency reached the highest F1-Score of 84,8 % among all Labels in the project. The automated Regex approach summarized 144 Annotations into 4 Regex, see Table 1. After manually reviewing the Regex we find the following ways to improve the training data and thereby the Document AI:

  1. There is one wrong annotation with ID 764309 that does not extract a currency.
  2. The dataset is currently limited to Euro currency.

Table 1: 144 Currency Annotations summarized by 4 automatically generated Regex

Regex Runtime Recall Precision F1-Score Add. Matches
(?P<currency_W_772660>€) 0.0000 0.5417 0.7647 0.6341 78
(?P<currency_W_772213>EUR) 0.0000 0.4444 0.7619 0.5614 64
(?P<currency_N_764309>\d\,\d\d) 0.0000 0.0069 0.0043 0.0053 1
(?P<currency_F_774548>[A-ZÄÖÜ][a-zäöüß]+) 0.0003 0.0069 0.0003 0.0006 1

This data set was used to train a Document AI which extracts information from coverage inquiries. In the Appendix, Table A.2 provides a detailed overview of all Labels in the project. The Label VN_MNR, i.e. membership number of the policyholder, reached the highest F1-Score of 88,9 % among all Labels in the project. The automated Regex approach summarized 34 Annotations into 5 Regex, see Table 2. After manually reviewing the Regex we find that the insurance company seems to have a standardized 10 digit format. Any change in this format will heavily affect the extraction results for this label.

Table 2: 34 Membership number Annotations summarized by 5 automatically generated Regex

Regex Runtime Recall Precision F1-Score Add. Matches
(?P<Deckungsanf_VN__MNR_N_4504443>\d\d\d\d\d\d\d\d\d\d) 0.0001 0.8824 0.2752 0.4196 30
(?P<Deckungsanf_VN__MNR_N_4494481>\d\d\d[ ]+\d\d\d[ ]+\d\d[ ]+\d\d) 0.0001 0.0294 0.1667 0.0500 1
(?P<Deckungsanf_VN__MNR_N_4492655>\d\d\d\d\d\d\d\d\d\d\d) 0.0001 0.0294 0.0213 0.0247 1
(?P<Deckungsanf_VN__MNR_N_4314192>\d\d\d\d\d\d\d\d\d) 0.0001 0.0294 0.0072 0.0116 1
(?P<Deckungsanf_VN__MNR_N_4505807>\d\d\d\d\d\d\d) 0.0001 0.0294 0.0039 0.0068 1

Data Set 3: Hearing forms send by the police

This data set was used to train a Document AI which extracts hearing forms send by the police. In the Appendix, Table A.3 provides a detailed overview of all Labels in the project. The Label “Schadendatum”, i.e. damage date reached the highest F1-Score of 100 % among all Labels in the project. The automated Regex approach summarized 60 Annotations into 3 Regex, see Table 3. After manually reviewing the Regex we find the following:

  1. The AI will be very confident when extracting damage dates with the format DD.MM.YYYY
  2. The document AI can be improved by adding more damage dates with formats DD. Month YYYY and DD.MM.YY.

Table 3: 60 Damage Date Annotations summarized by 3 automatically generated Regex

Regex Runtime Recall Precision F1-Score Add. Matches
(?PSchadendatum_N_4000104>\d\d\.\d\d\.\d\d\d\d) 0.0000 0.9667 0.1908 0.3187 58
(?PSchadendatum_N_3999808>\d\d\.[ ]+November[ ]+\d\d\d\d) 0.0000 0.0167 0.2000 0.0308 1
(?PSchadendatum_N_4506647>\d\d\.\d\d\.\d\d) 0.0000 0.0167 0.0032 0.0054 1

Data Set 4: Bank account statement across all major banks in Germany

This data set was used to train a Document AI which extracts bank account statements from Banks in Germany. In the Appendix, Table A.4 provides a detailed overview of all Labels in the project. The Label IBAN reached the highest F1-Score of 98.6 % among all Labels in the project. The automated Regex approach summarized 186 Annotations into 16 Regex, see Table 4. After manually reviewing the Regex we find the following:

  1. The OCR quality seems to be high as 132 of 186 IBAN Annotations follow the standardized IBAN structure incl. whitespaces (?P<IBAN_F_655016>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d).
  2. Multiple Annotations are wrong, i.e. (?P<IBAN_W_824432>EUR) or (?P<IBAN_N_819527>\d\d\.\d\d\.\d\d\d\d).

Table 4: 186 IBAN Annotations summarized by 16 automatically generated Regex

Regex Runtime Recall Precision F1-Score Add. Matches
(?P<IBAN_F_655016>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d) 0.0001 0.7097 0.9706 0.8199 132
(?P<IBAN_N_654643>DE\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d) 0.0000 0.1129 0.9130 0.2010 21
(?P<IBAN_N_709835>DE\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d) 0.0000 0.0591 0.4074 0.1033 11
(?P<IBAN_W_824432>EUR) 0.0000 0.0269 0.0153 0.0195 5
(?P<IBAN_F_798198>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d) 0.0001 0.0161 0.0216 0.0185 3
(?P<IBAN_N_654939>LT\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d\d\d ) 0.0000 0.0108 1.0000 0.0213 2
(?P<IBAN_F_655032>[A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ]\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d) 0.0001 0.0108 0.6667 0.0212 2
(?P<IBAN_F_654961>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d ) 0.0000 0.0108 0.0870 0.0191 2
(?P<IBAN_F_681873>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d\d\d) 0.0000 0.0054 1.0000 0.0107 1
(?P<IBAN_F_654948>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d [A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ] \d\d\d\d \d\d\d\d \d\d\d\d \d\d) 0.0000 0.0054 1.0000 0.0107 1
(?P<IBAN_F_655377>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d[ ]{2,}[A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d) 0.0001 0.0054 1.0000 0.0107 1
(?P<IBAN_F_655392>[A-ZÄÖÜ][A-ZÄÖÜ]\d\d[ ]+[A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]{2,}\d\d) 0.0001 0.0054 1.0000 0.0107 1
(?P<IBAN_F_819473>[A-ZÄÖÜ][A-ZÄÖÜ]\d[A-ZÄÖÜ]\d[ ]+\d\d\d\d[ ]{2,}\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d) 0.0001 0.0054 1.0000 0.0107 1
(?P<IBAN_N_681878>EE\d\d \d\d\d\d \d\d\d\d \d\d\d\d \d\d\d\d) 0.0000 0.0054 1.0000 0.0107 1
(?P<IBAN_N_671398>BE\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d[ ]+\d\d\d\d) 0.0000 0.0054 1.0000 0.0107 1
(?P<IBAN_N_819527>\d\d\.\d\d\.\d\d\d\d) 0.0001 0.0054 0.0017 0.0026 1

Data Set 5: Detect the Contract Type

This data set was used to train a Document AI which extracts rent agreements. In the Appendix, Table A.5 provides a detailed overview of all Labels in the project and their evaluation of the NER model on the test data The Label “Vertragsbezeichnung”, i.e. type of contract, reached the highest F1-Score of 100 % among all Labels in the project. The automated Regex approach summarized 22 Annotations into 5 Regex, see Table 5. After manually reviewing the Regex we find the following:

  1. The dataset seems to be dominated by two contract types as the Keyphrase approach creates precise matches incl. a high recall.
  2. Annotation 4306665 is the only multi-line annotation.

Table 5: 22 contract type Annotations summarized by 5 automatically generated Regex

Regex Runtime Recall Precision F1-Score Add. Matches
(?P<Vertragsbezeichnung_W_4303006>MIETVERTRAG) 0.0001 0.6818 0.7895 0.7317 15
(?P<Vertragsbezeichnung_W_4301825>PKW[-]Stellplatz[-]Mietvertrag) 0.0001 0.1364 0.7500 0.2308 3
(?P<Vertragsbezeichnung_W_4306798>STELLPLATZMIETVERTRAG) 0.0001 0.0909 1.0000 0.1667 2
(?P<Vertragsbezeichnung_F_4306665>[A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ][A-ZÄÖÜ])\n[ ]{2,}[a-zäöüß]+[ ]+[a-zäöüß]+[ ]+[A-ZÄÖÜ][a-zäöüß]+) 0.0007 0.0455 1.0000 0.0870 1
(?P<Vertragsbezeichnung_F_4306695>[A-ZÄÖÜ][a-zäöüß]+) 0.0018 0.0455 1.0000 0.0870 1

Expecting that the legal description of the type of contract is somehow standardized, e.g. “Rent Agreement”, can presumably be corrected. When looking at 4306665 in detail it consists of two annotations Mietvertrag and für Kontore, gewerbliche Räume und Grundstücke. For this annotation, it might be good to talk to the subject-matter expert as the question arises if the type of the contract “Mietvertrag”, i.e. rent agreement, can be separated from type of the real estate to be rented.

Using the regex we can calculate the Bounding Box for the characters 54 to 65 and 93 to 139.

Extract Rent Agreements with NLP

Limitations of automated Regex in Python

  1. The Regex approach performs best on individual points of information. The runtime to build and evaluate regex for information across multiple lines \n or even across \f pages limits the approach in production. Using a faster r-CNN approach provides a more appropriate way to differentiate Titles, Paragraphs, Tables, Images and List. In previous posts, we reported how Computer Vision can be used to detect the page layout 2 and how to use those abstract visual features to improve the accuracy of document classification 3.
  2. A regex lacks the information of the visual position. Even we can calculate the bounding box and page number for a given text, see GitHub, we do not use this information to compare those annotations . Further research might incorporate visual features of NLP annotations in documents.

To the reader

Independent of our contribution to how to improve the consistency of NLP data we would like to add three remarks:

  1. The regex approach allows analyzing text in a foreign language. The conclusion is drawn from the analysis of the data set 2 to 5 were based on the regex, even the annotations were created in the German language.
  2. The regex approach will not replace text embeddings. One can see that the F1-Score of a Regex is always outperformed by the AI Model in the Appendix.
  3. We provide the algorithm as a REST API service to selected experts. On request, we grant experts access to this API free of charge for testing purposes. If you are eager to automate Regex, feel free to contact us incl. a summary of your use case.

Appendix

Appendix: Technical details about the data available per Annotation and Label.

The Label Entity Name with ID 836 is used in the project to extract German invoices and defined as Name of the legal entity. There will be no type conversion as the label is defined to be text.

{
  "id": 39,
  "name": "Rechnung (German)",
  "labels": [
    ...
    {
      "id": 836,
      "text": "entity name",
      "text_clean": "entity_name",
      "description": "Name of the legal entity",
      "get_data_type_display": "Text",
      ...
    },

The Annotation Mustermann GmbH with ID 29319840 starts at character 258 and ends at 274 in document expense_c19d7305-5d45-4aba-9c4e-f2c1c3abda91.pdf. We use the ID of the Annotation later on to trace errors in the training data.

{
  "id": 78639,
  "project": 39,
  "number_of_pages": 1,
  "callback_url": null,
  "sync": false,
  "file_url": "/doc/show/78639/",
  "data_file_name": "expense_c19d7305-5d45-4aba-9c4e-f2c1c3abda91.pdf",
  "labels": {
    ...
    "groups": {
      "Vendor": [
        ...
        "entity name"
        :
        {
          "extractions": [
            {
              "id": 29319840,
              "value": "Mustermann  GmbH",
              "correct": null,
              "accuracy": 0.146,
              "bbox": {
                "bottom": 137.58,
                "page_index": 0,
                "top": 126.602,
                "x0": 69.36,
                "x1": 164.753,
                "y0": 704.34,
                "y1": 715.318,
                "line_index": 1
              },
              "start_offset": 258,
              "end_offset": 274
            }
          ]
        },

Table A.1: Extract information from invoices

Label Accuracy Balanced Accuracy F1-score Precision Recall
general/all annotations 94.7% 53.6% 94.5% nan% nan%
all annotations except TP of NO_LABEL 47.9% 47.1% 48.5% nan% nan%
NO_LABEL nan% nan% 97.2% 97.4% 96.9%
currency nan% nan% 84.8% 81.7% 88.1%
date nan% nan% 76.2% 94.1% 64.0%
name nan% nan% 71.4% 60.4% 87.3%
price nan% nan% 64.6% 70.5% 59.6%
cip code nan% nan% 61.5% 72.7% 53.3%
number nan% nan% 57.1% 47.1% 72.7%
gross amount nan% nan% 53.7% 64.7% 45.8%
house number nan% nan% 53.3% 57.1% 50.0%
place nan% nan% 50.0% 52.4% 47.8%
street nan% nan% 48.9% 50.0% 47.8%
quantity nan% nan% 43.9% 47.4% 40.9%
entity name nan% nan% 15.4% 8.3% 100.0%
Label Accuracy Balanced Accuracy F1-score Precision Recall
general/all annotations 98.9% 46.9% 98.6% nan% nan%
all annotations except TP of NO_LABEL 43.2% 38.6% 42.7% nan% nan%
NO_LABEL nan% nan% 99.4% 99.7% 99.1%
Deckungsanf_VN__MNR nan% nan% 88.9% 100.0% 80.0%
Deckungsanf_Absender_Adr.__PLZ nan% nan% 85.7% 75.0% 100.0%
Deckungsanf_Absender_Adr.__Ort nan% nan% 75.0% 75.0% 75.0%
Deckungsanf_Abs_Bank__IBAN nan% nan% 72.7% 80.0% 66.7%
Deckungsanf_Absender_Adr.__Postfachnummer nan% nan% 66.7% 100.0% 50.0%
Deckungsanf_Absender_Adr.__Hausnummer nan% nan% 50.0% 33.3% 100.0%

Table A.3: Extract information from police hearing forms

Label Accuracy Balanced Accuracy F1-score Precision Recall
general/all annotations 99.7% 88.8% 99.7% nan% nan%
all annotations except TP of NO_LABEL 73.4% 83.2% 70.4% nan% nan%
Schadendatum nan% nan% 100.0% 100.0% 100.0%
Anhörungsbogen_Fahrzeug__Fahrzeugtyp nan% nan% 100.0% 100.0% 100.0%
Anhörungsbogen_Fahrzeug__Kennzeichen nan% nan% 100.0% 100.0% 100.0%
Anhörungsbogen_Empfänger__Anrede nan% nan% 100.0% 100.0% 100.0%
Anhörungsbogen_Verfahren__Tatbestandsnummer nan% nan% 100.0% 100.0% 100.0%
Anhörungsbogen_Empfänger__Geburtsdatum nan% nan% 100.0% 100.0% 100.0%
NO_LABEL nan% nan% 99.8% 99.8% 99.9%
Anhörungsbogen_Empfänger__Nachname nan% nan% 85.7% 90.0% 81.8%
Anhörungsbogen_Verfahren__Tatbestand nan% nan% 85.7% 100.0% 75.0%
Anhörungsbogen_Empfänger__Straße nan% nan% 83.3% 83.3% 83.3%
Anhörungsbogen_Empfänger__PLZ nan% nan% 80.0% 100.0% 66.7%
Anhörungsbogen_Empfänger__Hausnummer nan% nan% 80.0% 66.7% 100.0%
Anhörungsbo_Absender_Adr.__PLZ nan% nan% 72.7% 100.0% 57.1%
Anhörungsbogen_Empfänger__Vorname nan% nan% 72.7% 66.7% 80.0%
Anhörungsbo_Absender_Adr.__Postfachnummer nan% nan% 66.7% 100.0% 50.0%
Anhörungsbogen_Empfänger__Ort nan% nan% 66.7% 75.0% 60.0%
Anhörungsbo_Absender_Adr.__Firma/Behörde nan% nan% 57.1% 50.0% 66.7%
Anhörungsbo_Absender_Adr.__Ort nan% nan% 50.0% 66.7% 40.0%

Table A.4: Extract information from bank account statements

Label Accuracy Balanced Accuracy F1-score Precision Recall
general/all annotations 97.7% 94.8% 97.8% nan% nan%
all annotations except TP of NO_LABEL 94.6% 87.8% 92.8% nan% nan%
NO_LABEL nan% nan% 98.6% 97.3% 99.9%
IBAN nan% nan% 98.6% 97.6% 99.6%
Geschäftsvorfall nan% nan% 97.9% 97.6% 98.3%
Zahlungspartner nan% nan% 97.8% 98.1% 97.6%
Wertstellungsdatum nan% nan% 97.1% 99.2% 95.2%
Verwendungszweck nan% nan% 96.7% 99.6% 94.0%
Soll Einzelbetrag (-) nan% nan% 96.0% 99.0% 93.2%
Buchungsdatum nan% nan% 93.4% 100.0% 87.6%
Kontostand zu Beginn nan% nan% 91.4% 94.1% 88.9%
Haben Einzelbetrag (+) nan% nan% 91.1% 95.3% 87.2%
Währung nan% nan% 90.1% 94.3% 86.2%
Beginn nan% nan% 90.0% 90.0% 90.0%
Letzter Kontostand nan% nan% 89.5% 85.0% 94.4%
Ende nan% nan% 85.1% 80.0% 90.9%
Regexes for Kontostand zu Beginn took 46.79587268829346.
Regexes for Letzter Kontostand took 13.377197265625.
Regexes for Geschäftsvorfall took 35.091267347335815.
Regexes for Buchungsdatum took 27.7017605304718.
Regexes for Soll Einzelbetrag (-) took 26.558791875839233.
Regexes for Währung took 14.793615341186523.
Regexes for Haben Einzelbetrag (+) took 22.516244649887085.
Regexes for Beginn took 5.3628129959106445.
Regexes for Ende took 6.262711048126221.
Regexes for IBAN took 34.41475582122803.
Regexes for Zahlungspartner took 643.4446530342102.
Regexes for Verwendungszweck took 2619.2250328063965.
Regexes for Wertstellungsdatum took 44.21670651435852.

Table A.5: Extract information from rent agreements

Label Accuracy Balanced Accuracy F1-score Precision Recall
general/all annotations 100.0% 72.3% 100.0% nan% nan%
all annotations except TP of NO_LABEL 74.7% 64.6% 77.2% nan% nan%
Vertragsbezeichnung nan% nan% 100.0% 100.0% 100.0%
Typ nan% nan% 100.0% 100.0% 100.0%
NO_LABEL nan% nan% 100.0% 100.0% 100.0%
PLZ nan% nan% 95.2% 100.0% 90.9%
Ort nan% nan% 95.2% 100.0% 90.9%
IBAN Zahlungskonto nan% nan% 85.7% 75.0% 100.0%
Hausnummer nan% nan% 84.2% 80.0% 88.9%
Straße nan% nan% 82.4% 70.0% 100.0%
Vertreten_durch nan% nan% 80.0% 75.0% 85.7%
Mieternummer nan% nan% 66.7% 100.0% 50.0%
Bruttomiete nan% nan% 57.1% 40.0% 100.0%

Ressoures for Regex Generation and Active Learning

Photo by Francesco Ungaro Pexels