Chaining iOS Machine Learning, Computer Vision, and Augmented Reality to Make the Magical Real
Etsy recently released a feature in our buyer-facing iOS app that allows users to visualize wall art within their environments. Getting the context of a personal piece of art within your space can be a meaningful way to determine whether the artwork will look just as good in your room as it does on your screen. The new feature uses augmented reality to bridge that gap, meshing the virtual and real worlds. Read on to learn how we made this possible using aspects of machine learning and computer vision to present the best version of Etsy sellers’ artwork in augmented reality. It didn’t even require a PhD.-level education or an expensive 3rd party vendor – we did it all with tools provided by iOS.
Building a Chain
Using Computers to See
Early in 2019, I put together a quick proof of concept that allowed for wall art to be displayed on a vertical plane, which required a standalone image of the artwork filling the entire image. Oftentimes, though, Etsy sellers upload images that show their item in context, like on a living room wall, to show scale. This complicates the process because these listing images can’t be placed onto vertical planes in augmented reality as-is; they need to be reformatted and cropped.
Two engineers, Chris Morris and Jake Kirshner developed a solution that used computer vision to find a rectangle within an image, perhaps a frame, and crop the image for use. Using the Vision framework in iOS, they were able to pull out the artwork we needed to place in 3D space. We found that trying to detect only one rectangle, as opposed to all, created performance wins and gave us the shape with greatest confidence by the system. Afterwards, we used Core Image in order to crop the image, adjusting for any perspective skew that might be present. Apple has an example using a frame buffer but can be applied to any UIImage.
To Crop or Not to Crop
As I mentioned before, some Etsy sellers upload their artwork as standalone images, while others depict their artwork in different environments. We wanted to present the former as-is, and we needed to crop the latter, but we had no way to automatically categorize the more than 5 million artwork listings available on our marketplace.
To solve this, we used on-device machine learning provided by Core ML. The team sifted through more than 1,200 listings and sorted the images by those that should be cropped and those that should not be cropped. To create the machine learning model, we first used an iOS Playground and, later, a Mac application called Create ML. The process was as easy as dropping a directory with two subdirectories filled with correct images, “no_frames” and “frames”, into the application along with a corresponding smaller set of different images used to test the resulting model. Once this model was created and verified, we used VNCoreMLRequest to check a listing’s image and determine whether we should crop it or present it as-is. This type of model is known as image classification.
We also investigated a different type of mode called object detection, which finds the existence and coordinates of a frame within an image. This technique had two downsides: training the model required laborious manual object marking for each image provided, and the resulting model, which would be included in our app bundle, would be well over 60mb vs. the 15kb model for image classification. That’s right, kilobytes.
Translating Two Dimensions to Three
Once we had the process for determining whether the image needs to be reformatted, we used a combination of iOS’ SceneKit and ARKit to place the artwork as a material on a rudimentary shape. With Apple focusing heavily on this space, we were able to find plenty of great examples and tutorials to get us started with augmented reality on iOS. We started with the easy-to-use RealityKit framework, but the iOS 13-only restriction was a blocker as we supported back to iOS 11 at the time.
The implementation in ARKit was relatively straightforward, technically, but working for the first time in 3D space vs. a flat screen, it was a challenge to develop a vocabulary and way of thinking about the physical space being altered by the virtual. It was difficult putting into words the difference between, for example, moving on a y-axis and how that differed from making the item scale in size. While this was eventually smoothed out with experience, we knew we had to keep this in mind for Etsy buyers, as augmented reality is not a common experience for most people. For example, how would we coach them through the fact that ARKit needs them to use the camera to scan the room to find the edges of the wall in order to discern the vertical plane? What makes it apparent that they can tap on screen? In order to give our users an inclination of how to use this feature successfully, our designer, Kate Matsumoto, product manager, Han Cho, and copywriter, Jed Baker, designed an onboarding flow, based on user-testing, that walks our buyers through this new experience.
Wrapping it All Up
Using machine learning to determine if we should crop an image or not, cropping it based on a strong rectangle, and presenting the artwork on a real wall was only part of the picture here. Assisted by Evan Wolf and Nook Harquail, we also dealt with complex problems including parsing item descriptions to gather dimension, raytraced hit-testing, and color averaging to make this feature feel as seamless and lifelike as possible for Etsy buyers. From here, we have plenty of ideas for continuing to improve this experience but in the meantime, I encourage you to consider the fantastic frameworks you have at your disposal, and how you can link them together to create an experience that seemed impossible just years ago.