Scanning documents for personal data

The business idea of identification of sensitive data needs no introduction. In my opinion, it is one of the most popular use-cases for Machine Learning. In this simple application (github.com/konvica/read-presonal-data), I leverage several libraries (spacy, presidio, pymupdf, pytesseract, OpenCV, streamlit) to process PDF or image documents and extract all personal and sensitive data like photos of a face or personal information.

PDF files are extracted with the pymupdf library. Each page of pdf is processed sequentially. Named Entity Recognition NER using spacy and presidio is applied on the text of the whole page. Only English is supported. Supported entities are described here. Detected entities and personal information are then visualized in a streamlit app.

Image files and any images extracted from PDF files are scanned for faces using OpenCV and its haarcascade_frontalface detector. Based on pytesseract, presidio OCR with entity recognition is applied on text in image. Found faces and personal text data are highlighted with bounding boxes in the streamlit app.

To set it up you need anaconda or miniconda and to follow instructions in the repository github.com/konvica/read-presonal-data.

Written on June 27, 2021