Unsupervised Key-Value Extraction from Invoices and Contracts using Positional Knowledge

Extracting important information from differently formatted contracts and documents can be challenging. Here we will look into unsupervised methods using simple page geometry for extracting key-value pairs from PDF documents.

4 min readFeb 21, 2021

Table Extraction from PDFs is a well-known [1], yet a pretty challenging and nasty task. For those of us working in the industry, the challenge can be daunting given the diverse formats of invoices and bills we get from several sources. Several solutions [2] [3] are available for the task of table extraction, but most of these solutions imagine that the information is present in well-structured formats in the PDF, but that’s hardly the case. Let us directly get into an example account statement.

Although the product descriptions in the account statement are fairly well-formatted, most important information such as IBAN, Account Number is not. The popular commercial tools can easily extract information such as date and balance but fail when it comes to the other important information on the right-hand side of the layout.

In this blog, we will try to extract such information using simple methods such as OCR and orientation of the different information wrt. the page layout. In the context of this blog, an example key-value pair is Sort Code: 05–09–59.

Just show me the code, will ya?

Well, I will not get much coding in here, will more talk about the algorithm and also demonstrate some code snippets in python so that you can easily implement the unsupervised method.

The main algorithm is based on Optical character recognition (OCR) from the account statement. OCR recognizes the text in an image along with the bounding boxes around it. Some examples of extracted bounding boxes are shown in the image below from [4].

The bounding boxes have different X and Y co-ordinates denoting their position on the page. The key-value extraction method proposed in this blog is based on the X and Y co-ordinate or the location of the text in the document image.

For the case of the above example, the leftmost coordinate of the bounding box for Sort code and its corresponding value i.e. 05–09–59 are aligning. We could easily parse through the different elements (or text) in the document and get the probable key-value pairs and extract the specific as required for your specific use-case.

Getting directly into the Implementation

For our implementation we use azure OCR [5], but you can easily use any OCR of choice, we just need the different co-ordinates of the bounding boxes for the recognized texts.

It should be noted that we extract the key and a list of probable values. The main method used to extract the probable key-value pairs is below:

# <snippet_imports>
from collections import defaultdict
# </snippet_imports>


def get_prob_ele(elements, x_corrd_thr=5.0, y_corrd_thr=5.0):
    """
    Get probable elements from an OCR output
    :param elements: OCR output merged
    :param x_corrd_thr: threshold for tilt in x axis
    :param y_corrd_thr: threshold for tilt across y axis
    :return: dictionary of elements and probable values
    """

    ele_dict = defaultdict(set)
    for j, e in enumerate(elements):  # parse through all the values
        e_text = e['text'].replace(' ', '').replace('.', '').replace(':', '').replace('/', '')
        # e_text = e['text']
        if e_text.isalpha() or e_text.isdigit():  # Get only alphanumeric values as key
            # probable_elements = []
            e_coord = [int(c) for c in e['boundingBox'].split(',')]
            # Get centroids
            e_cetroid_x = e_coord[0]+(e_coord[2]/2)
            e_cetroid_y = e_coord[1]-(e_coord[3]/2)
            for k, n_e in enumerate(elements[j+1:]):  # OCR is from left, for each element get all the next ones
                n_e_coord = [int(c) for c in n_e['boundingBox'].split(',')]
                # Get centroid for probable values
                ne_cetroid_x = n_e_coord[0] + (n_e_coord[2] / 2)
                ne_cetroid_y = n_e_coord[1] - (n_e_coord[3] / 2)
                # if centroid difference less than threshold select as probable value
                if abs(ne_cetroid_y - e_cetroid_y) < y_corrd_thr:  
                    if e_coord[0] < n_e_coord[0]:
                        ele_dict[e['text']].add(n_e['text'])
                elif abs(ne_cetroid_x - e_cetroid_x) < x_corrd_thr:
                    if e_coord[1] < n_e_coord[1]:
                        ele_dict[e['text']].add(n_e['text'])
                else:
                    continue

    return ele_dict

There you go! that’s the main code to extract the probable key-value pairs, for the above example, the set of probable values extracted by the algorithm for the key Sort Code is: [‘Payment’, ‘£3,001.67’, ‘39418977’, ‘05–09–59’, ‘£3,904.48’, ‘Current’, ‘£654.27’, ‘£1,557.08’, ‘Debits E’]. As observed the required value is there in the extracted set of values. One can easily sort these based on the distance of their centroids, type (for example for the price the values should be numeric).

It must be noted that, although in the sample case it is relatively easier to get the key-value from the OCR text (just the next element), in real-world cases the texts might have a lot of unwanted information between them if there are structural similarities between the key and the value, this algorithm can easily extract it.

Many commercial models (such as Microsoft form recognizer) can also work on these complicated examples, but it requires additional training; however, this approach can be used out-of-the-box without any training with additional tweaks based on your use-case.

Happy Extraction!

References

[1] Corrêa, Andreiwid Sheffer, and Pär-Ola Zander. “Unleashing tabular content to open data: A survey on pdf table extraction methods and tools.” Proceedings of the 18th Annual International Conference on Digital Government Research. 2017.

[2] Extract table: https://www.extracttable.com/

[3] Nanonets: https://nanonets.com/

[4] https://jaafarbenabderrazak-info.medium.com/ocr-with-tesseract-opencv-and-python-d2c4ec097866

[5] https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text

Unsupervised Key-Value Extraction from Invoices and Contracts using Positional Knowledge

Extracting important information from differently formatted contracts and documents can be challenging. Here we will look into unsupervised methods using simple page geometry for extracting key-value pairs from PDF documents.

Just show me the code, will ya?

Getting directly into the Implementation

Written by Debanjan Chaudhuri