OCR Support in Office 365

Last year at Ignite, Microsoft announced OCR capability using “computer vision technology” would be coming to Office 365! This stirred up a lot of excitement as this was a feature many people have been wanting for years! Microsoft posted an article at techcommunity about the new advancement in Intelligent Search using OCR which can be found here. Well I am happy to say it exists and it sure is awesome!

Supported Types

Per the techcommunity article above, the supported types are “bmp”, “png”, “jpeg”, “jpg”, “gif”, “tif”, “tiff”, “raw”, and also “arw”, “cr2”, “crw”, “erf”, “mef”, “mrw”, “nef”, “nrw”, “orf”, “pef”, “rw2”, “rw1”, “sr2”.

To test this capability, I have uploaded a mock up design for a Help Desk Add-In I have built for Office 365 into a Document library.

SupportTicket

The jpg looks like this

HelpDesk

You’ll notice inside the image there are 4 tickets have been mocked up with a bit of bacon ipsum. The ticket titles are “Trouble with SharePoint”, “Computer not Working”, “No Trouble Just Saying Hi” and “Trouble with SharePoint” again.  My plan to test out OCR is to search for the contents of that Title using SharePoint search!

Test: Search for Ticket values

My first test will be trying to find one of the tickets labeled “Trouble with SharePoint” and look at the results! Not only does it match with the values in the image, but it’s also picking up the other ticket values as well.

 

SearchResult

It should be no surprise that when I search for some of the description of the tickets, that they should return in search results as well. I’ve decided to search for the first sentence of the description in the mock up. Here is the result!

SearchResult2Does it work in Modern too?

In my previous examples I was using SharePoint Classic Search. If you were wondering if it works in Modern search as well, you bet!

modernsearchHow does it work?

My guess is that OCR in SharePoint is using Azure Media Services to convert text content in digital files into digital text. The reason for assuming this is because of the following naming convention used in SharePoint.

Whenever SharePoint finds text within your images, the values get stored on the item in a field called MediaServiceOCR. Take a look at the JSON response from querying for the list item.

MediaServiceOCR

I wasn’t able to find a default managed property for this field but that isn’t a huge problem because SharePoint automatically creates a crawled property called ows_MediaServiceOCR. Using this crawled property, I can create whatever managed property mappings that I want.

Crawled.png

 

Some Comments

OCR PDFs have native support in Office 365. However, scanned documents which are PDFs currently aren’t generating values in the MediaServiceOCR column. I’ve been testing this functionality with no success — yet.

I have noticed some inconsistencies with the OCR functionality. I have tested this on multiple libraries and I have noticed it hasn’t been creating the MediaServiceOCR values on some items (doesn’t exist). I’ll keep you posted when I find more information about this.

 

 

3 thoughts on “OCR Support in Office 365

  1. bryan reisner April 19, 2018 / 7:01 am

    Curious that this does not include PDFs. Do PDFs already support OCR in SharePoint? Is there a way to get native PDF OCR all within SharePoint??

    Like

    • Beau Cameron April 19, 2018 / 9:42 am

      Great Question. There is PDF support native to Office 365. However, scanned PDFs or Image PDFs do not have OCR functionality. I’m not sure if there is any timeline for something like this.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s