March
28
Abstract
We improved computer vision tools for street signs detections. Our algorithm is based on deep learning using the real-time object detection YOLOv8 (You Only Look Once). This is a model that was initially developed by researchers from the University of Washington, the Allen Institute for AI, and Facebook AI Research. By aiming high performance quality delivery, Cognitivo implemented its computer vision tool with an updated model version, furtherly developed by the Ultralytics team.
Keywords: Computer Vision, Artificial Intelligence, Machine Learning, Visual Object Detection.
Introduction
Computer vision models are part of the artificial intelligence technology to train machines on understanding the visual world. This artificial intelligence – machine learning (AI-ML) approach is gaining popularity thanks to the efficiency of the process and the evident work-load reduction required by users. With the advent of YOLO models for images processing, with a fast model architecture, it is possible to achieve highly accurate results for challenging computer vision problems such as object detection, image classification, and face recognition.
In this use case, we built an AI-powered algorithm able to detect and classify street signs. Thanks to the AI Factory User Interface (UI) every user can upload images, run an AI model, and get classified and detected street signs. In addition, the user can also train the model and define new classes for street signs classification (e.g., stop sign, speed limit sign etc.).
Method
-Model selection
We tested different AI algorithms and selected the most performing one. The models analysed are reported below:
· Bootstrapping Language-Image Pre-training (BLIP) model.
· Grounding DINO.
· YOLOv8.
The point of the BLIP model and Grounding DINO is to identify objects without the need to retrain the model, collect, and label new data. This is also called zero-shot detection.
BLIP is a multi-tasking model as it performs the following actions: visual question answering, image-text matching, image captioning. The model divides the input image into patches and encodes them with word embedding (see Figure 1). The model showed good results on detecting street signs, but it was likely to fail on correctly identifying the right class (e.g., a stop signal was classified as a give way sign).
Grounding DINO is based on two-stream architecture were features from images and features from word text are extracted. The two features’ kinds are then combined into a single unified representation. Below there is a schematisation of the model architecture.
An example of the model output is shown in Figure 3. The number in the picture is a text similarity score and assesses the quality of the model. We remind that this model combines text and images.
BLIP gives scores for all prompt classes while Grounding DINO gives the best score only. On the other hand, with Grounding DINO we were able to obtain output images with labelled detected objects while BLIP only printed classes with confidence levels but did not produce an image with proper boxes.
We hence decided to update our AI Computer vision model with YOLOv8. Initially created by researchers from the University of Washington, the Allen Institute for AI, and Facebook AI Research, YOLOv8 has been furtherly developed by the Ultralytics team. YOLOv8 is built on a fast and efficient architecture (see Figure 4) and adaptable to several hardware platforms such as cloud-based APIs and edge devices.
- Training
YOLOv8 required training before obtaining street signs detection. A simplified overview of the work methodology is shown in Figure 5. We labelled thousands of images with different properties (e.g., format, luminosity, resolution, environment) to build an AI algorithm able to analyse and perform object detection. This is then a supervised/AI algorithm which required proper human intervention on the training phase. Once the training was over, the tool become independent from humans and started to learn more and more by analysing new images.
Object detection on the AI Factory
The new YOLO Deep Learning algorithm runs on the Cognitivo AI Factory where the user can upload images with GPS data and run the AI model. Each street sign is classified, and its location is showed on map. See below.
Summary and conclusions
We experimented different ML/AI models for object detection. We integrated the most suitable model into our AI algorithm running on the AI Factory. Thanks to the user-friendly interface of the AI Factory, any user is easily capable of running AI models with no need of technical skills.
Possible developments
We presented AI neural network application for street signs detection. However, the same procedure can be applied for other computer vision related use cases.
Author: Daniele d'Antonio, AI/ML Data Engineer, Cognitivo