Denoising Noisy Documents

Computer Vision Techniques

Chinmay Wyawahare
Towards Data Science

--

Image by author

Numerous scientific papers, historical documentaries/artifacts, recipes, books are stored as papers be it handwritten/typewritten. With time, the paper/notes tend to accumulate noise/dirt through fingerprints, weakening of paper fibers, dirt, coffee/tea stains, abrasions, wrinkling, etc. There are several surface cleaning methods used for both preserving and cleaning, but they have certain limits, the major one being: that the original document might get altered during the process.

I along with Michael Lally and Kartikeya Shukla worked on the data set of noisy documents if from the UC Irvine NoisyOffice Data Set. Denoising dirty documents enables the creation of higher fidelity digital recreations of original documents. Several methods for denoising documents like — Median Filtering, Edge Detection, Dilation & Erosion, Adaptive Filtering, Autoencoding, and Linear Regression are applied to a test dataset and their results are evaluated, discussed, and compared.

Source: UCI

Median Filtering

Median filtering is the simplest denoising technique and it follows two basic steps: first, obtain the “background” of an image using Median Filtering with a kernel size of 23 x 23, then subtract the background from the image. Only the “foreground” will remain, clear of any noise that existed in the background. In this context, “foreground” is the text or significant details of the document and “background” is the noise, the white space between document elements.

Median Filtering Pipeline

Edge Detection, Dilation & Erosion (EDE Method)

Edge detection methods identify the points where the image brightness changes sharply, to organize them into edges.

Canny edge detection is particularly helpful to extract edges

Before using the techniques, the images were processed to be cleaned away from the edges of noise. First apply dilation, which makes lines thicker by adding pixels to boundaries. Notice this results in “filling in” the text, while edges surrounding stains remain hollow. Then, by applying the reverse operation, erosion, one can completely remove thin lines while preserving the thicker lines.

Edge Detection Pipeline

Adaptive Thresholding

Another characteristic of the dirty images is that the text tends to be darker than the noise. Within dark noises, the text inside is even darker.

Thus the objective is to preserve pixels that are the darkest locally

Thresholding sets all pixels whose intensity is above a threshold to 1 (background), and the remaining pixels to 0 (foreground). During adaptive thresholding, there is no single global threshold: the threshold value is computed for each pixel. To determine the threshold, we use Gaussian Thresholding. The threshold value is the weighted sum of neighboring pixel intensities, where the weights are a gaussian window

Adaptive Thresholding Pipeline

Linear Regression

Instead of modelling the entire image at once, we tried predicting the cleaned-up intensity for each pixel within the image and constructed a cleaned image by combining together a set of predicted pixel intensities using linear regression. After creating a vector of y values, and a matrix of x values, the simplest data set is where the x values are just the pixel intensities of the dirty image.

Except at the extremes, there is a linear relationship between the brightness of the dirty images and the cleaned images. There is a broad spread of x values as y approaches 1, and these pixels probably represent stains that need to be removed. The linear model has done a brightness and contrast correction. That’s quite good performance for a simple least squares linear regression!

Intensity Plot for Linear Regression
Linear Regression Pipeline

Autoencoders

Autoencoders are neural networks composed of an encoder and a decoder. The encoder compresses the input data into a lower-dimensional representation. The decoder reconstructs the representation to obtain an output that mimics the input as closely as possible. In doing so, the autoencoder learns the most salient features of the input data.

Autoencoders are closely related to principal component analysis (PCA). If the activation function used within the autoencoder is linear within each layer, the latent variables present at the bottleneck (the smallest layer in the network, aka. code) directly correspond to the principal components from PCA

The network is composed of 5 convolutional layers to extract meaningful features from images. In the first four convolutions, we use 64 kernels. Each kernel has different weights, perform different convolutions on the input layer, and produce a different feature map. Each output of the convolution, therefore, is composed of 64 channels.

The encoder uses max-pooling for compression. A sliding filter runs over the input image, to construct a smaller image where each pixel is the max of a region represented by the filter in the original image. The decoder uses up-sampling to restore the image to its original dimensions, by simply repeating the rows and columns of the layer input before feeding it to a convolutional layer.

Batch normalization reduces covariance shift, that is the difference in the distribution of the activations between layers, and allows each layer of the model to learn more independently of other layers.

Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
image_input (InputLayer) (None, 420, 540, 1) 0
_________________________________________________________________
Conv1 (Conv2D) (None, 420, 540, 32) 320
_________________________________________________________________
pool1 (MaxPooling2D) (None, 210, 270, 32) 0
_________________________________________________________________
Conv2 (Conv2D) (None, 210, 270, 64) 18496
_________________________________________________________________
pool2 (MaxPooling2D) (None, 105, 135, 64) 0
_________________________________________________________________
Conv3 (Conv2D) (None, 105, 135, 64) 36928
_________________________________________________________________
upsample1 (UpSampling2D) (None, 210, 270, 64) 0
_________________________________________________________________
Conv4 (Conv2D) (None, 210, 270, 32) 18464
_________________________________________________________________
upsample2 (UpSampling2D) (None, 420, 540, 32) 0
_________________________________________________________________
Conv5 (Conv2D) (None, 420, 540, 1) 289
=================================================================
Total params: 74,497
Trainable params: 74,497
Non-trainable params: 0
_________________________________________________________________
Autoencoder Pipeline

Results

Source: GIPHY

Median Filtering

Median Filtering Results

Edge Detection, Dilation & Erosion

Edge Detection Results

Adaptive Thresholding

Adaptive Thresholding Results

Linear Regression

Linear Regression Results
Autoencoder Results

Performance Metrics

PSNR

PSNR Comparative Results

RMSE

RMSE Comparative Results

UQI

UQI Comparative Results

Architecture

After observing the results of various computer vision, machine learning and neural nets, we thought of deploying it as a software tool — “Denoizer” which is hosted on AWS

Denoizer Architecture

Code

Here’s the complete code for the comparative study along with the application hosted on AWS:

References

[1] F. Zamora-Martinez, S. Espan ̃a-Boquera and M. J. Castro-Bleda,
Behaviour-based Clustering of Neural Networks applied to Document Enhancement, in: Computational and Ambient Intelligence, pages 144- 151, Springer, 2007.

[2] Z. Wang, A. C. Bovik “A Universal Image Quality Index” IEEE Signal Processing Letters, Volume 9, Issue 3, Pages 81–84, August 2002

[3] UC Irvine NoisyOffice Data Set: https://archive.ics.uci.edu/ml/datasets/NoisyOffice

[4] https://medium.com/illuin/cleaning-up-dirty-scanned-documents-with-deep-learning-2e8e6de6cfa6

--

--