Exploitation des Tickets de Caisse pour le Suivi Budgétaire : Solution Data Science pour l'Extraction, Data Wrangling et l'Analyse Sémantique
Abstract #
This document is the result of my end-of-study project work that took place in Exvivo. And
which concerns the extraction of data from till receipts in order to ensure budgetary follow-up.
The objective of this project is to analyze and extract purchase data from the receipts of the
various supermarkets in order to draw up balance sheets for individual budget forecasting.
In this respect, this project aims to produce a generic prototype to recognize the different shapes
of cash register receipts, as well as an OCR for the extraction of purchase data. The study also
adopted a Clutering model to identify names of products that are similar but spelled differently.
Then, after multiple syntactic and semantic processing at the data level, the OCR results are
used to implement balance sheets for individual budgeting.
Before starting the implementation, a study of the existing system was carried out and
functional needs were defined in order to define the objectives to be achieved. Then, we
documented the different technologies and techniques to realize the proposed solution and also
focused on work related to our project.
Finally, the last step presents the realization of the different steps of the design, let’s start with
the application of CNN model on our basis which allows to recognize the tickets, then the
procedure of preprocessing the tickets using image processing techniques this step is considered
preparatory for the two following ones: segmentation and extraction. Then the analytical study
is done on the level of the data extracted from the tickets with a clustering model: affinity
propagation in order to correct extraction errors on the one hand in addition to Data Wrangling,
and on the other hand to unify the nominations of products that are the same but are written
differently and finally we use the processed data to establish forecast balances and conclude
customer consumption habits.
Keywords: OCR, Machine Learning, budget monitoring, semantic analysis, Clustering,
receipt, Data Wrangling