Table Header Detection and Classification using Machine Learning — Aditya Vyas

A very important application of machine learning is in the field of document processing and extracting important information from it. This was one of my projects at Shipmnts, a shipping logistics company. The idea was to identify potential table headers in PDFs and also extend it to identify the table rows. The reason to use machine learning is because most tables always follow a symmetric structure in their rows and table headers are always written to differentiate them from the rest of the words in a document. This led to a natural way of utilising features for passing into a machine learning model. Some of the features I used were:

Spacing between the words
Boldness of words
Height and width of the words
The font of the words
Amount of whitespace between words in a row
Average length of words

With no prior dataset available, I had to manually construct a labelled dataset by decomposing the PDFs into their respective rows and then labelling each row (a list of words) as “Header” or “Not a Header”. Being an imbalanced classification problem - because headers are very less compared to lots of non-header rows in a document - I had to use different sampling techniques:

Undersampling
Oversampling
SMOTE

Of these, SMOTE yielded the best results with a Random Forest model and gave me an 70% recall score.

Note: Due to an NDA with Shipmnts, I cannot link the code.