Object detection is a key discipline in AI. It allows computer systems to detect entities – such as people or objects – in images or videos. It has applications in many areas of computer vision, but video surveillance is one of the most common. The focus here is to detect any suspicious activity or human presence during the day or at night.
What’s most important in video surveillance is that object detection happens in real time – if there’s a lag of even a few seconds, then it can defeat the whole purpose of having surveillance in the first place.
Historically, however, the deep learning algorithms that power the AI to detect real-time objects with adequate precision require vast computing power. You need many graphics processing units (GPUs) and a computationally heavy platform. This is not only incredibly expensive – the GPU system will likely be more expensive than the AI itself – but also challenging to deploy and manage.
Alternatives have been light models for image object detection, such as You Only Look Once (YOLO) will highlight the entity’s location in an image by putting a box around it. YOLO models for video object detection do essentially the same – they work on a frame-by-frame basis, so they analyze a single image, detect any entities present, and then move on to the next image.
This is a sub-optimal approach because something quite fundamental is missing: context. Imagine I show you daytime drone footage taken from a height of 150 meters. You will see people appear very small because you are capturing the video from a very high elevation. If I pause the video and show you the frame, those humans could be confused with artifacts. You may only see a head from right overhead, so they could be anything – a rock or a pothole. Without context, the system can lack accuracy.
Recognizing this problem, I wanted to help develop a real-time video object detection system that would work incredibly well on a computationally limited platform yet provide state-of-the-art accuracy. It needed to be very easy to deploy on a simple laptop or other edge devices , and I wanted it to be fundamentally better than the simple solutions we’ve seen in the past.
We then added this transformer layer onto a YOLO model, giving the model a memory. This meant that it could now capture the context and keep a summary of what has happened in the last few frames and place attention on certain areas where there is a high probability of an object or person appearing.
When a program like Chat GPT, for example, uses a transformer model, it essentially identifies the most important words and predicts the next important word. We’re doing the same but at a pixel level. We identify which pixels are important based on the video summary. This way, we can exclude what previously might have been inaccurately flagged as a suspicious person or object. Our model can do all of this in real time while delivering state-of-the-art performance. What’s more, it can do this from any edge device without needing a data connection.
The potential is huge – our model can be customized to perform in almost any scenario where monitoring is required. As well as being used in surveillance, it is already proving to be incredibly beneficial in defense, where it’s essential to have real-time intelligent object detection for drone footage. It’s also proved useful in monitoring social distancing violations.
But this is just the start. There are countless other scenarios where this type of solution might prove transformative. This might be simple asset monitoring, for example. It could be used on a train track to detect whether a person is walking towards the tracks and issue an alert that might prevent a fatality from occurring. It might be used by autonomous vehicles to detect obstacles or in traditional vehicles to recognize that a driver is unresponsive, for example, and to issue a real-time alert that the person urgently needs medical attention. It might even be used in augmented reality situations in the metaverse.
The number of potential applications will grow even further as we see the growing adoption of AI-based solutions, especially those leveraging transformer models. I’m excited to see what the future holds.
Kunal Singh’s research paper ‘3D attention based YOLO-SWINF for real-time video object detection’ is available to download here.