Extracting Text From Documents Using Artificial Intelligence

In this tutorial we will learn how to scan documents and extract the contained text using document segmentation and optical character recognition.
A fully working iOS application, provided as source code, will help us understand which of Apple’s frameworks and classes are required to accomplish this task.
Introduction
Phones are not just phones anymore; among other things, they have become our offices. Not many of my generation (I’m a bit older …) would have thought that today’s smartphones could replace so many other devices.
One of those devices is the document scanner.
In 2017, Apple introduced a document scanner in Notes, Mail, Files and Messages. It provides cleaned-up document “scans” that are perspective-corrected and evenly lit.
Since WWDC 2019, we as developers can leverage that feature in our apps as well.
I decided to build the most minimal but fully working SwiftUI app possible. The result is this tutorial and the accompanying app source code.
Here is a demo of how the app works:
The complete Xcode project can be downloaded following this link.
At the time of writing, my setup is:
- MacBook Pro, 16-inch, 2021, M1 Max
- macOS Monterey, Version 12.0.1
- Xcode, Version 13.1
Here is what we will cover:
- Overview of frameworks
- Document segmentation
- Optical character recognition
- Reference application walkthrough
- Optimisations and best practices
- Conclusion
Let’s get started.
Overview
There are basically two steps necessary to accomplish our task.

First, we need to “find” the document in our image: document segmentation.
Second, we need to “find” the characters and extract text: optical character recognition.
Fortunately, Apple provides plenty of ready-made code that we can use to build a polished app.
The graphic below shows the main classes and frameworks involved.

However, there are only two classes that are most important and that we will focus on.
The first is VNDocumentCameraViewController:
A view controller that shows what the document camera sees.
The second is VNRecognizeTextRequest:
An image analysis request that finds and recognizes text in an image.
Additionally, we use a bit of Combine to “glue” things together. For user interaction we use SwiftUI.
Document Segmentation
Let’s focus on document segmentation to understand the background necessary to implement our app.
Wikipedia defines image segmentation as:
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects).
Apple provides VNRequest:
The abstract superclass for analysis requests.
VNRequest offers a number of requests for document analysis, such as
- barcode detection
- text recognition
- contour detection
- rectangle detection
- NEW for 2021: document segmentation detection
The one we need is VNDetectDocumentSegmentationRequest:
An object that detects rectangular regions that contain text in the input image.
It is a machine-learning-based detector that runs in real time on devices with Apple’s Neural Engine. It provides a segmentation mask and corner points and is used by VNDocumentCameraViewController.
Previously we had to use rectangle detection with the more general VNDetectRectanglesRequest to segment a document:
An image analysis request that finds projected rectangular regions in an image.
It uses a traditional CPU-bound algorithm that detects edges and intersections to form quadrilaterals (corner points only). It can detect multiple rectangles including nested ones.
In comparison, VNDetectDocumentSegmentationRequest uses a machine-learning algorithm that can run on the Neural Engine, GPU, or CPU. It has been trained on documents, labels, and signs, including non-rectangular shapes. It typically finds one document and provides a segmentation mask and corner points.
For our convenience, we don’t even have to use VNDetectDocumentSegmentationRequest directly because it is already encapsulated in VNDocumentCameraViewController.
We now have everything needed for the next step: extracting the text from the document.
Optical Character Recognition
Wikipedia defines optical character recognition (OCR) as:
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example, text on signs and billboards ), or from subtitle text superimposed on an image.
Until 2019 we would have had to use VNDetectRectanglesRequest and VNTextObservation, then perform multiple steps to extract the text:
- Iterate over character boxes in the observation.
- Train a Core ML model to do character recognition.
- Run the model on each character box.
- Threshold against possible garbage results.
- Concatenate characters into strings.
- Fix recognized characters in strings.
- Correct recognized words based on a dictionary and heuristics.
That’s a lot of work— but not anymore.
Thanks to VNRecognizeTextRequest, many of those steps are reduced to what is essentially a one-liner.
It really is that simple. We now have everything to write an app, which I have done for you.
Next I provide an overview of the Xcode project, point out the classes that do the work, and walk you through a few features.
Reference Application Walkthrough
Everything starts in Xcode with the standard SwiftUI application. After adding the necessary code, the project structure looked as follows:

The two most important classes are DocumentCameraView and TextScanner.
We need DocumentCameraView to “wrap” VNDocumentCameraViewController, which is a UIKit class and not native to SwiftUI.
I described this in my tutorial:
https://itnext.io/using-uiview-in-swiftui-ec4e2b39451b
The “result” of VNDocumentCameraViewController is a VNDocumentCameraScan.
At this point the TextScanner class takes the VNDocumentCameraScan for further processing— extracting the text data from the scanned image.
To keep the code simple, I glued everything together in the main view (ContentView). I don’t find it necessary to introduce a fancy app architecture for an app intended to be as easy as possible.
Now let’s introduce a few features of the app that come with VNDocumentCameraViewController.
There are several options in the camera view that will optimise performance and accuracy.

- Flash can be set to auto (default), always on, or off.
- Document mode options are color (default), grayscale, black and white, and photo.
- Shutter behaviour can be automatic (default)— the camera “presses” the shutter when it thinks it has recognised a document— or manual, if you prefer to press the shutter yourself.
When the shutter is in manual mode, the document will not be segmented automatically. Segmentation must be done manually by adjusting the corner points.

When the “Cancel” button is pressed we go to the result view, which, if no picture has been taken, is empty.

In the result view we can return to the camera view by pressing the “Scan” icon in the top-right corner.
And that’s about it. Let’s wrap up.
Conclusion
What a ride. Starting from the basic building blocks— document segmentation and text recognition— we learned the classes and frameworks needed to build an app.
Then we worked through a fully working iOS app that can be adapted and extended for any specific use case or requirements you might want to realise.
I hope this helps jump-start your next big app idea!
The complete Xcode project can be downloaded following this link.
Thank you for reading!