New AI system to give Microsoft Seeing AI a boost

A new object captioning AI model from Microsoft strengthens the tech giant's accessibility services, which the vendor has championed for years.

Mark Labbe

Published: 15 Oct 2020

Microsoft developed a new AI system that can accurately caption images in a push to make its products more accessible to people who are blind.

The vendor made the system available in the Azure Cognitive Services Computer Vision product, and is incorporating it into its Seeing AI application, as well as Microsoft Word, Outlook and PowerPoint.

The AI captioning platform appears to be a response to the need for software for people with disabilities and augments the tech giant's Seeing AI system, first introduced in 2017.

"People with disabilities have been overlooked when it comes to technology development as a whole," said Nick McQuire, head of enterprise and AI research at CCS Insight.

AI for accessibility

Seeing AI is an iOS app that uses AI to describe objects, people, scenery, colours, currency and documents. The app -- designed to audibly describe the world to people who are visually impaired -- marked one of the first major strides into the accessibility area by a big tech vendor, McQuire noted.

People with disabilities have been overlooked when it comes to technology development as a whole.

Nick McQuire Company Head of enterprise and AI research, CCS Insight

The new AI system, revealed in a blog post by the company on Oct. 14, will improve the accuracy of Seeing AI's ability to describe objects. In Microsoft Word, Outlook and PowerPoint, people who are visually impaired can use the system to provide alternate text on images.

"Accessibility has long been a requirement that quietly was ignored," because it was an expensive activity, said Alan Pelz-Sharpe, founder of market advisory and research firm Deep Analysis.

Technological advancements are changing that, however.

The new system, according to Microsoft, can caption images more accurately than humans can in certain tests.

Microsoft's claims have some merit: Its system is currently leading the Leaderboard in the online tech competition site's novel object captioning challenge developed by AI researchers to evaluate novel object captioning.

Object captioning is made possible by advances in deep learning, computer vision and natural language models, as well as the accurate labeling of robust datasets, according to Forrester analyst Mike Gualtieri.

While Microsoft currently has an advantage in some standardized image captioning competitions, where image captioning really matters is in specific domains such as manufacturing quality, agriculture, transportation and medical imaging, Gualtieri added.

Those highly specialized domains require domain experts to label images, as the layperson wouldn't have the knowledge to correctly say what they're seeing. It's unclear, then, how Microsoft's system might fare in a specialized setting, such as correctly labeling medical X-rays, for example.

"Success in domain-specific image captioning is all about the training data, and the best training data has been labeled by human experts," Gualtieri said.

Microsoft's advancements in object captioning will help automate managing rich media assets and enrich metadata significantly, Pelz-Sharpe said.

"It will have strong value in diverse sectors such as defense, insurance and retail. But as of now, its use is in its infancy, so those will develop further over time," he said.

New AI system to give Microsoft Seeing AI a boost

A new object captioning AI model from Microsoft strengthens the tech giant's accessibility services, which the vendor has championed for years.

AI for accessibility

Dig Deeper on AI technologies

vision language models (VLMs)

Cloudinary developer lead: When (and when not) to use multi-modal LLMs for visual content

History of generative AI innovations spans 9 decades

AWS' new generative AI service fills need in market

AI for accessibility

Related Resources

Dig Deeper on AI technologies

vision language models (VLMs)

Cloudinary developer lead: When (and when not) to use multi-modal LLMs for visual content

History of generative AI innovations spans 9 decades

AWS' new generative AI service fills need in market