Skip to main content

Exploitation des Tickets de Caisse pour le Suivi Budgétaire : Solution Data Science pour l'Extraction, Data Wrangling et l'Analyse Sémantique

Engineer: Islame KACHADE
Organisation: EXVIVO
Language: French
Promotion: 2019
Year: 3

Abstract #

This document is the result of my end-of-study project work that took place in Exvivo. And

which concerns the extraction of data from till receipts in order to ensure budgetary follow-up.

The objective of this project is to analyze and extract purchase data from the receipts of the

various supermarkets in order to draw up balance sheets for individual budget forecasting.

In this respect, this project aims to produce a generic prototype to recognize the different shapes

of cash register receipts, as well as an OCR for the extraction of purchase data. The study also

adopted a Clutering model to identify names of products that are similar but spelled differently.

Then, after multiple syntactic and semantic processing at the data level, the OCR results are

used to implement balance sheets for individual budgeting.

Before starting the implementation, a study of the existing system was carried out and

functional needs were defined in order to define the objectives to be achieved. Then, we

documented the different technologies and techniques to realize the proposed solution and also

focused on work related to our project.

Finally, the last step presents the realization of the different steps of the design, let’s start with

the application of CNN model on our basis which allows to recognize the tickets, then the

procedure of preprocessing the tickets using image processing techniques this step is considered

preparatory for the two following ones: segmentation and extraction. Then the analytical study

is done on the level of the data extracted from the tickets with a clustering model: affinity

propagation in order to correct extraction errors on the one hand in addition to Data Wrangling,

and on the other hand to unify the nominations of products that are the same but are written

differently and finally we use the processed data to establish forecast balances and conclude

customer consumption habits.

Keywords: OCR, Machine Learning, budget monitoring, semantic analysis, Clustering,

receipt, Data Wrangling