Last year at Ignite, Microsoft announced OCR capability using “computer vision technology” would be coming to Office 365! This stirred up a lot of excitement as this was a feature many people have been wanting for years! Microsoft posted an article at techcommunity about the new advancement in Intelligent Search using OCR which can be found here. Well I am happy to say it exists and it sure is awesome!
Per the techcommunity article above, the supported types are “bmp”, “png”, “jpeg”, “jpg”, “gif”, “tif”, “tiff”, “raw”, and also “arw”, “cr2”, “crw”, “erf”, “mef”, “mrw”, “nef”, “nrw”, “orf”, “pef”, “rw2”, “rw1”, “sr2”.
To test this capability, I have uploaded a mock up design for a Help Desk Add-In I have built for Office 365 into a Document library.
The jpg looks like this
You’ll notice inside the image there are 4 tickets have been mocked up with a bit of bacon ipsum. The ticket titles are “Trouble with SharePoint”, “Computer not Working”, “No Trouble Just Saying Hi” and “Trouble with SharePoint” again. My plan to test out OCR is to search for the contents of that Title using SharePoint search!
Test: Search for Ticket values
My first test will be trying to find one of the tickets labeled “Trouble with SharePoint” and look at the results! Not only does it match with the values in the image, but it’s also picking up the other ticket values as well.
It should be no surprise that when I search for some of the description of the tickets, that they should return in search results as well. I’ve decided to search for the first sentence of the description in the mock up. Here is the result!
Does it work in Modern too?
In my previous examples I was using SharePoint Classic Search. If you were wondering if it works in Modern search as well, you bet!
How does it work?
My guess is that OCR in SharePoint is using Azure Media Services to convert text content in digital files into digital text. The reason for assuming this is because of the following naming convention used in SharePoint.
Whenever SharePoint finds text within your images, the values get stored on the item in a field called MediaServiceOCR. Take a look at the JSON response from querying for the list item.
I wasn’t able to find a default managed property for this field but that isn’t a huge problem because SharePoint automatically creates a crawled property called ows_MediaServiceOCR. Using this crawled property, I can create whatever managed property mappings that I want.
OCR PDFs have native support in Office 365. However, scanned documents which are PDFs currently aren’t generating values in the MediaServiceOCR column. I’ve been testing this functionality with no success — yet.
I have noticed some inconsistencies with the OCR functionality. I have tested this on multiple libraries and I have noticed it hasn’t been creating the MediaServiceOCR values on some items (doesn’t exist). I’ll keep you posted when I find more information about this.