Face Detector with VisionKit and SwiftUI

Detecting Faces and Face Landmarks in Realtime

Welcome to another article exploring Apple’s Vision and VisionKit frameworks. I have already written two articles and code samples that use these frameworks.

The first one that I wrote earlier is “Barcode Scanner in SwiftUI”.

The second article is quite recent and is called “Document Scanner in SwiftUI”.

Now it ‘s time for something new and even more fun.

Introduction

Exploring the existing frameworks for Face Detection and Face Landmark Detection is something I wanted to do for a long time.

I find it fascinating that clever algorithms can detect and track faces even when people wear glasses or masks.

Thanks to Apple, we can have all this functionality right on our phones. Nothing has to be sent to any servers for processing— that would be too slow anyway.

We want everything in real time, right on our device.

Therefore, I started this project, which resulted in this article and a fully working iOS app. You can download the accompanying Xcode project and use the source code freely in your own projects.

Let’s first talk about what the app actually does.

App Features

The title already states that this article is about Face Detection and Face Landmark Detection. The app does a little more, though.

Apple’s Vision and VisionKit frameworks deliver the algorithms out of the box, and I have included the following features in the sample app:

Detect and visualise bounding box
Detect and visualise face landmarks
Determine image capture quality
Determine head position

The complete Xcode workspace with the fully functional app can be downloaded from here.

This is how the app looks in action:

https://youtu.be/ngdOJ7f7dLo

The app was written in Xcode 13.2.1, Swift Language Version 5, and tested on an iPhone 11 Pro Max with iOS 15.2.1.

In the next section I explain the features in more detail.

What is a Bounding Box?

When we want to track a person’s face as a whole entity, we use the bounding box. We get these values straight from the detection algorithms and can take them to visualise, for example, a green square around the face.

This is exactly what the sample app does. You can change the visualisation of the bounding box to suit your use case. Maybe you need another color or dashed lines instead of solid ones? This is an easy task with SwiftUI and the way I have decoupled the detection logic from its visualisation.

What are Face Landmarks?

Instead of viewing the whole face as one entity, face landmarks give us more detail on specific facial features. What we receive from the detection algorithms are sets of coordinates that depict features like the mouth, nose, eyes, and so on.

What is Capture Quality?

The Capture Quality indicator provides a specific value indicating how good the image is for detection. The higher the value, the better the quality. The range is from 0.0 to 1.0.

This is especially useful when you have a series of images of the same subject, e.g., for selecting the best shot for further processing.

What are the different Head Positions?

Another fascinating metric is the head position. Actually, we get three metrics: roll, yaw, and pitch. The image below depicts the differences between these values.

Now that we know what the sample app does, we will learn how it is organised to implement these features.

App Architecture and Code Organization

When researching for this article I came across many examples. Almost all of them had the same problem: the “Massive View Controller” anti-pattern.

On a high level, the sample app has three processing steps:

Capturing an image sequence
Running the detection algorithms
Visualising the result

In a “Massive View Controller” style application, almost all of these steps are implemented by one huge class. In classic UIKit apps, this is usually done in UIViewControllers— hence the name. Been there, done that.

What we want instead is good code design in the form of “Separation of Concerns,” where each class serves one specific purpose. We can also look at the SOLID principles from Uncle Bob (Robert C. Martin). The one we want here is the “Single-Responsibility Principle.”

Therefore, my main goal for this project was to have a clear and concise code structure by separating the concerns of capturing, detection, and visualisation into distinct classes and connecting them via a pipeline mechanism.

The project in Xcode is structured as follows:

FaceDetectorApp: The entry point of the application, holding the application delegate. The application delegate is where we instantiate the different functional components and connect them.
ContentView: The top-most view for the application.
CameraView: There is no native camera view for SwiftUI yet. Therefore we need this helper class to wrap the video preview layer, which is native to UIKit.
CaptureSession: This class is responsible for capturing the image sequence (the video feed).
FaceDetector: This is where the magic happens! All the detection algorithms are called from this class.
AVCaptureVideoOrientation: A helper function for converting UIDeviceOrientation to AVCaptureVideoOrientation, which is needed for correct visualisation.

The classes on their own aren’t really useful. Therefore they need to be connected. I prefer the pipeline pattern. I love writing code by implementing the elements of a pipeline and then connecting them in a declarative style. That is why I embraced Apple’s introduction of the Combine framework. I use it where it makes sense and does not interfere with the clarity and understandability of the code.

The image below shows the elements of our pipeline and their input and output data types. All output variables are realised with Combine publishers (@Published). The next pipeline element subscribes to this publisher.

That’s about it for how the sample application is structured to implement our required features.

It is now time to wrap it all up.

Conclusion

This tutorial should have given you an in-depth look at what is necessary to implement Face and Face Landmark Detection.

After explaining what kind of information we can gather from Apple’s algorithms, I laid out in detail how I structured the code of the sample application.

My main goal was to provide a clean code design to disentangle the detection code into a simple-to-understand and reusable pipeline.