October

AI-POWERED ALGORITHM FOR STREET SIGNS DETECTION V2

‍

‍Abstract

We improved computer vision tools for street signs detections. Our algorithm is based on deep learning using the real-time object detection YOLOv8 (You Only Look Once). This is a model that was initially developed by researchers from the University of Washington, the Allen Institute for AI, and Facebook AI Research. By aiming high performance quality delivery, Cognitivo implemented its computer vision tool with an updated model version, furtherly developed by the Ultralytics team.

Keywords: Computer Vision, Artificial Intelligence, Machine Learning, Visual Object Detection.

‍

Introduction

Computer vision models are part of the artificial intelligence technology to train machines on understanding the visual world. This artificial intelligence – machine learning (AI-ML) approach is gaining popularity thanks to the efficiency of the process and the evident work-load reduction required by users. With the advent of YOLO models for images processing, with a fast model architecture, it is possible to achieve highly accurate results for challenging computer vision problems such as object detection, image classification, and face recognition.

In this use case, we built an AI-powered algorithm able to detect and classify street signs. Thanks to the AI Factory User Interface (UI) every user can upload images, run an AI model, and get classified and detected street signs. In addition, the user can also train the model and define new classes for street signs classification (e.g., stop sign, speed limit sign etc.).

‍

Method

-Model selection

We tested different AI algorithms and selected the most performing one. The models analysed are reported below:

‍

· Bootstrapping Language-Image Pre-training (BLIP) model.

· Grounding DINO.

· YOLOv8.

‍

The point of the BLIP model and Grounding DINO is to identify objects without the need to retrain the model, collect, and label new data. This is also called zero-shot detection.

BLIP is a multi-tasking model as it performs the following actions: visual question answering, image-text matching, image captioning. The model divides the input image into patches and encodes them with word embedding (see Figure 1). The model showed good results on detecting street signs, but it was likely to fail on correctly identifying the right class (e.g., a stop signal was classified as a give way sign).

‍

Figure 1: BLIP Model architecture.

‍

Grounding DINO is based on two-stream architecture were features from images and features from word text are extracted. The two features’ kinds are then combined into a single unified representation. Below there is a schematisation of the model architecture.

Figure 2: Grounding DINO Model architecture.

‍

An example of the model output is shown in Figure 3. The number in the picture is a text similarity score and assesses the quality of the model. We remind that this model combines text and images.

Figure 3: example of Grounding DINO model output.

‍

BLIP gives scores for all prompt classes while Grounding DINO gives the best score only. On the other hand, with Grounding DINO we were able to obtain output images with labelled detected objects while BLIP only printed classes with confidence levels but did not produce an image with proper boxes.

We hence decided to update our AI Computer vision model with YOLOv8. Initially created by researchers from the University of Washington, the Allen Institute for AI, and Facebook AI Research, YOLOv8 has been furtherly developed by the Ultralytics team. YOLOv8 is built on a fast and efficient architecture (see Figure 4) and adaptable to several hardware platforms such as cloud-based APIs and edge devices.

‍

Figure 4: YOLO architecture based on convolutional layers followed by 2 fully connected layers (source: You Only Look Once: Unified, Real-Time Object Detection, Redmon et al. 2016).

‍

- Training

YOLOv8 required training before obtaining street signs detection. A simplified overview of the work methodology is shown in Figure 5. We labelled thousands of images with different properties (e.g., format, luminosity, resolution, environment) to build an AI algorithm able to analyse and perform object detection. This is then a supervised/AI algorithm which required proper human intervention on the training phase. Once the training was over, the tool become independent from humans and started to learn more and more by analysing new images.

‍

Figure 5: simplified schematisation of the object detection model process.

‍

Object detection on the AI Factory

The new YOLO Deep Learning algorithm runs on the Cognitivo AI Factory where the user can upload images with GPS data and run the AI model. Each street sign is classified, and its location is showed on map. See below.

‍

Figure 6: example of street sign detection on the AI Factory.

‍

Figure 7: location of a street sign detected on the AI Factory.

‍

Summary and conclusions

We experimented different ML/AI models for object detection. We integrated the most suitable model into our AI algorithm running on the AI Factory. Thanks to the user-friendly interface of the AI Factory, any user is easily capable of running AI models with no need of technical skills.

Possible developments

We presented AI neural network application for street signs detection. However, the same procedure can be applied for other computer vision related use cases.

‍

Author: Daniele d'Antonio, AI/ML Data Engineer, Cognitivo

‍

AI-POWERED ALGORITHM FOR STREET SIGNS DETECTION V2

recent posts

quick access

Connect with us