Tackling Product Recognition at Checkouts Using Neural Networks: Part 1 of 3

By Eleftherios Fanioudakis and Nimesh Patel

Introduction

This is part 1 of a 3 part series which details the technical journey SeeLabs have been on, to solve the following challenge we were presented with. The parts are as follows:

Part 1: Retail Loss at Checkouts: Introduction to the Problem and the Detector-Classifier System

Part 2: Retail Loss at Checkouts: The Two-Headed Classifier

Part 3: Retail Loss at Checkouts: Smoothing Algorithms to Count Products

Challenge

Can we create a system which can detect products as they are scanned, and classify the products that may be of interest to the retail store at checkout (self or staffed checkout). The hardware we were constrained to was a NVIDIA Jetson Xavier AGX and the nature of this problem requires the system to produce real time insights for later application.

Immediately it was clear that this was a Machine Learning problem, more specifically an Object Detection/Recognition one. We know there exists many open source pre-trained network architectures and decided to investigate for our needs – why reinvent the wheel?!

Detector-Classifier

After some investigation and knowledge sharing within the team, we concluded that our best bet was to use the FasterRCNN model provided by Google’s TensorFlow model zoo. This was due to its high accuracy on various well known datasets, as well as TensorRT’s support for most of the operations in this model. This was (and is!) important to gain the best performance on the Xavier AGX.

At a retail store, the number of products in the inventory can easily grow to be of the order of magnitude of 10K. If a large subset of this is of interest to the retail store, then we need to be able to recognise these products accurately. Many products have similar ‘features’ and so our thoughts were that using a single detector would not be enough. The feature space of a detector without enough data would not be rich enough to support this potentially huge number of classes. While our PoC is a small list of products, we wanted to make the system as scaleable as possible without the need for mass data collection, and so we needed to have something in addition to the detector which could do a finer classification job. We planned to bolt on a classifier to the output of the detector, trying to keep additional latency as low as possible.

Our first steps were to collect data. There was no data online for us to use and we somehow needed to emulate the end environment as much possible, which is a self checkout in a store. So we setup a space in SeeLabs, with a mock till area and cameras setup around it. As with any ML model, the data had to have a variety of angles, lighting and people involved. Due to time constraints and privacy, we had to label the data ourselves, and we undertook a mass labeling exercise, which wasn’t fun!

Camera Field of View Example of man Scanning Wine at Mock Checkout
Figure 1. An example field of view from a camera in our mock checkout setup ready for labelling

About the system

Our base system comprises 2 main stages of processing. First we localise (locate the product in the frame) and classify the products into super-categories such as bottles, boxed items etc. The output of this Object Detection stage is used in the second stage, the classifier, which does a further, more precise classification in order to identify the specific product from the localised super-category.

The Object Detection is performed by a fine tuned FasterRCNN to determine the super-category of a product (e.g. bottle, boxed item, can etc). The model has been trained to reliably identify a core set of super-categories, and can be further extended to identify different types of products if required.

The classifier is also a deep neural network using a CNN, but relies on the object detector to point at the exact point in the frame to be able to focus on the product to identify it. The classifier needs to be trained with each product that is to be identified, a process that we have automated. This classifier adds very little latency to the overall system while providing accurate results.

The 2 stage object detector/classifier can be used when it is necessary to identify products in a wide area within the camera view. In situations where there is a specific area in the frame where products need to be identified (for example the bagging area, or the scanning area).

Diagram Outlining Camera Detection, Locating Packet, Tin or Bottle, then Identifying Product SKU
Figure 1b. The two stages of our product detection and recognition pipeline

The SeeChange Object Detector and Classifier models discussed are comparative to the state of the art. Further to this, our models have been optimised on a model operation level for highest performance on our reference device.

Performance

modelinference time(ms)
detector90
classifier20
overall110
Table 1: inference performance on NVIDIA Jetson Xavier AGX

For a more realistic test, we took all our test data (video footage of scanning events that have not been used for training) and simulated one long video that could represent a long sequence of scanning product by product. In these tests, maximum one product is in frame at any one time.

The test dataset looks like this:

  • 23,585 frames from 1965 seconds, 32 mins of video
  • There were 21 known products and 6 unknown products, none of which had been used for training
  • The products were all scanned several times, and in total there were 455 scanning events
  • 350 of these events were known products, and 105 were unknown

We then worked out how many times:

  1. We correctly identified that it was a know product (True Positive)
  2. We correctly identified that is was an unknown product (True Negative)
  3. We missed a know product (False Negative)
  4. We incorrectly classified a product as the wrong known product, either an unknown being classified as a know, or a known being classified as the wrong thing (False Positive)

The False Positives are potentially the real worry here because arguably this means that you will potentially raise a false alarm for a customer who has scanned everything correctly depending on the deployment model.

Prediction
NegativePositive
ActualNegative1041105
Positive96254350
455
Accuracy78.7%(254 + 96) / 455
Recall72.6%254 / (254 + 96) 
Precision99.6%254 / (254 + 1)
Table 2: Precision & Recall

The model was tuned to avoid False Positives as much as possible, even if that means getting fewer True Positives. The idea being that each false positive is potentially an occasion when a customer is incorrectly deemed to have not scanned something.

There was a single unknown product that was classified as a know product out of the 455 things that were scanned. This was a bag of coffee beans that was classified as a bottle of Jamaican Ginger Beer. In technical terms that means the Precision was 99.6% (the proportion of the positives that were correct).

Camera Product Detection of Coffee Beans Being Incorrectly Detected as Jamaican Ginger Beer Being Scanned at Mock Checkout
Figure 2. An example of misclassification of Coffee Beans as Jamaican Ginger Beer
Camera Product Detection of Jack Daniels Bottle Being Scanned at Mock Checkout
Figure 3. An example of correct classification

Of the known products it successfully identified the know products 254 times out of 350. Technically this gives a Recall of 72.6% (the proportion of the positive that we classified correctly).

There are a number of things we could do to improve the performance, to rule out the False Positive and to increase the the True Positives:

  • Train on more data for the current known and unknown products
  • Add more known and unknown products, which would allow the model to better distinguish between products of similar color and shape

Conclusion

Our system can detect and classify the products on which it has been trained, at a point of sale. However as we can see, there are various improvements to be had before this can be production ready.

Firstly, we do not need to correctly recognise a product in every frame, rather during a “scanning event”, recognise the product enough to classify it. This implies that if we can process more frames during a scanning event, we can increase this chance. So this means we need to increase the performance of our system

The detector consumes a whole frame where only a small area in frame, around the till is of interest. This leads to false positives in the background from the detector and increases classifier processing time as the number of false positives increase.

In the next part of this series we will discuss our approach to making our system detector-less and therefore increase performance, while keeping/increasing accuracy.

To learn more about SeeChange computer vision technologies & product SKU recognition software, and their use cases for tackling loss prevention and stock shrinkage, please visit https://seechange.ai/retail-stores/#stock-shrinkage or contact us here.

About Lefteris

Lefteris is part of the technology team. His work is focused on the integration and evaluation of AI models, to provide solutions for various tasks. He is passionate about exploring and applying state of the art machine learning models and algorithms on different devices and platforms.

Profile picture for SeeChange’s Lefteris of the Technology Team

About Nimesh

Nimesh is a part of the technology team. He has had various responsibilities while at SeeChange including the development of AI models and Technology Innovation and Demo work. Currently Nimesh is working on the automation of end to end model training pipelines for various deployments. Nimesh is passionate about the promotion of D&I in technology.

Profile picture for SeeChange’s Nimesh of the Technology Team