How We Built PlantVision
An in-depth, behind the scenes look at our Auto-magical AR Plant Camera!
In the last few years, investments in AR technology by major corporations like Apple and Meta have been very interesting to watch. On one hand, advancements in AR tech appear to be accelerating, and we may be on the precipice of some big product launches as we head into 2023. On the other hand, things are still pretty awkward when it comes to finding useful and compelling applications built on the currently available AR platforms, like Apple’s ARKit.
Despite the current dearth of killer apps, it’s clear that these advancements in both software and hardware have the potential to revolutionize a variety of industries — including the way we care for plants. With AR, we can imagine a science-fiction inspired future where plant care is made easier and more intuitive, with virtual overlays providing information about a plant's needs and real-time guidance on how to care for it.
At Greg, we know the key to keeping your plants happy and healthy is to understand the relationship between each plant and its environment. Things like the plant’s species, the size of its pot, and the natural lighting conditions are all important factors in its wellbeing – as well as key inputs into the proprietary Evapotranspiration model that powers our Smart Reminders. We realized that by combining state-of-the-art computer vision models with algorithms built on modern AR frameworks, we could theoretically measure all of those things directly from a video feed in the app. Thus, the idea for PlantVision was born :)
After months in R&D and product development, we built PlantVision as an in-app camera that is capable of instantly determining
- The species of a plant
- The size of its container
- If it’s indoors or outdoors
- Its distance from the nearest window
And, with the tap of a button, Greg uses that information (and more!) to schedule accurately tailored care reminders for each individual plant in its own unique environment.
Check out the hype reel!
PlantVision is now available on iPhone as part of Greg 2.0, and is currently in beta testing on Android. We’re really excited to be able to share some of the technical achievements that went into creating it!
Short for Augmented Reality Plant Interface, ARPI is the internal name for the underlying engine that powers PlantVision. The engine is responsible for coordinating the 3 main components that make up the PlantVision experience:
- AR Framework — ARKit + SceneKit on iOS
- PlantVision ML — in-house, custom ML models for Species Prediction, Plant Object Detection and Indoor/Outdoor classification
- 3D Scanning Algorithms — Combining the outputs of 1 and 2 to predict 3D characteristics like pot and foliage dimensions, nearby windows, navigating closely grouped plants
Each of these tasks alone present numerous challenges around latency, energy consumption, and memory usage, let alone variation precision and recall. As such, running them all in parallel while maintaining a cool 60 fps video feed was no small feat.
One of the principal challenges faced in developing the engine was resource management and process coordination. Between 3-5 possible ML models, depth scanning and point-cloud processing algorithms, plant detection and tracking code (and so on..), it’s clear that we can’t naively run every process on every frame. Instead, the engine needs to maintain a rich state for a given scanning session, and on each frame selectively choose which processes need to be run for each plant in a given environment. At 60 fps, or about 17ms per frame, the budget is small to begin with, and many processes take longer than a single frame to process – which is fine, up to a certain point. While CoreML is generally fast on most iOS devices (thanks to the Apple Neural Engine), latency can be unpredictable depending on the overall system load, which can further impact the coordination of results.
Another challenge that the engine needs to overcome is the inevitably sparse and noisy outputs from the detection processes. The ML models, the point-cloud analysis and depth estimation techniques, and the AR tracking itself can all produce incorrect outputs from time to time, and we need the system to be robust in its ability to filter those false readings if we want a smooth and confidence-inducing user experience.
When correctly classifying a Baby Rubber plant, for instance, a single video frame might be queued for inference that happens to be a little blurry and catches the light at a particular angle, such that the model temporarily classifies that same plant as a Heartleaf Philodendron. Or perhaps in another frame, our plant detection model assumes that the green bath towel hanging out of a laundry basket is actually a large potted plant. Maybe when trying to deduce the dimensions of a particular flower pot, our initial depth estimation is way off and we mistake the 6 inch pot resting 3 feet away for a 22 inch pot that’s 10 feet away.
Most of the noisiness is introduced simply by means of running our models on video frames, instead of individually exposed photographs. However, doing so provides its own means of error correction. Since all of the ML inference and processing is done locally on the device, it can be performed relatively frequently, producing streams of outputs from each individual process. Then, real-time stream processing techniques can be used to filter out any outliers and produce stable results. The end result is that, in much less time than it would take for a round trip photo upload to a huge model on a remote server, we’ve combined the results of dozens of classifications and analyses to infer as much information about your plant as possible. As you move your camera around and show different angles of your plant in the room, those results are continuously updated and refined.
Included in this initial release are 3 proprietary computer vision models that provide classification and object detection for numerous components within PlantVision.
- Species identification: Built on the EfficientNet Lite B3 architecture, adapted via transfer learning from a model pre-trained on ImageNet. The training dataset contains 2632 species, and over 1 million images.
- Plant Object Detection: Built on the YOLOv5 architecture, trained from scratch. The training set includes 20k images with 200k+ object instances. For iOS deployment, it required custom spatial pyramid pooling built in CoreML.
- Indoor/Outdoor Scene Classification: Built on a custom CNN binary cross-entropy model. The training set contains 180k images
To achieve performance targets, our ML engineering efforts needed to account for the difficult balance between model size, inference latency, and prediction accuracy. We went through many iterations of these models, each of which required extensive testing and analysis. Part of the challenge in evaluating these models is that often, their real-world performance can differ significantly from standard measurements of performance such as f0.5. The main culprit for this is the nature of using video frames instead of individually exposed images. There is a not-so-subtle drop in quality and clarity when using video frames for production inference, whereas our primary source of training images is still currently photos.
To mitigate this challenge, we needed to try several different image data augmentation strategies during training, before we landed on one that provided the most efficient and effective compensation. Then, additional processing is employed to further mitigate these quality issues, in the form of a video pre-processing pipeline that enhances contrast, reduces the impact of motion blur, and selectively crops the salient portion of the image before running it through the model. This pipeline runs in parallel to the engine and ensures that only the minimum amount of processing required for a particular set of processes is performed for each frame.
Sample Confusion Matrix from one of our YOLOv5 experiments
To accomplish things like detecting and highlighting the dimensions of a flowerpot, or tracking the position of the windows relative to a plant, we need to combine the output of the PlantVision ML models with accurate depth estimation provided by the ARKit. The engine coordinates which frames require further processing with depth information, and ensures that the timing lines up with the correct frame data, even when results from ML can arrive multiple frames later.
For devices with LiDAR sensors built in, we take advantage of the full captured depth data to quickly build a complete understanding of the physical environment. This process is fast and very accurate, providing near-instant response times.
For those devices without it, however, it’s a much more difficult challenge. Rather than directly process the estimated depth of a scene, we need to query for depth at the individual pixel level – relying on the built-in ARKit models to do so. This comes with a significant latency cost (about 1000x slower than LiDAR), so to maintain a fast and fluid user experience, we estimate the salient factors with lower precision and much smaller input threshold. The end result is that all devices are capable of taking accurate measurements with PlantVision, but it can take a little bit longer for us to compute the physical dimensions.
While PlantVision currently only scans one plant at a time, it’s processing every plant visible in the frame to ensure accurate segmentation and to provide the user an intuitive sense of ‘focus’ when panning the camera around. Graphical debugging techniques were critical to understanding the behavior of the system and updating the algorithms to achieve more robust results.