OCR Support in Office 365

Last year at Ignite, Microsoft announced OCR capability using “computer vision technology” would be coming to Office 365! This stirred up a lot of excitement as this was a feature many people have been wanting for years! Microsoft posted an article at techcommunity about the new advancement in Intelligent Search using OCR which can be found here. Well I am happy to say it exists and it sure is awesome!

Supported Types

Per the techcommunity article above, the supported types are “bmp”, “png”, “jpeg”, “jpg”, “gif”, “tif”, “tiff”, “raw”, and also “arw”, “cr2”, “crw”, “erf”, “mef”, “mrw”, “nef”, “nrw”, “orf”, “pef”, “rw2”, “rw1”, “sr2”.

To test this capability, I have uploaded a mock up design for a Help Desk Add-In I have built for Office 365 into a Document library.

SupportTicket

The jpg looks like this

HelpDesk

You’ll notice inside the image there are 4 tickets have been mocked up with a bit of bacon ipsum. The ticket titles are “Trouble with SharePoint”, “Computer not Working”, “No Trouble Just Saying Hi” and “Trouble with SharePoint” again.  My plan to test out OCR is to search for the contents of that Title using SharePoint search!

Test: Search for Ticket values

My first test will be trying to find one of the tickets labeled “Trouble with SharePoint” and look at the results! Not only does it match with the values in the image, but it’s also picking up the other ticket values as well.

 

SearchResult

It should be no surprise that when I search for some of the description of the tickets, that they should return in search results as well. I’ve decided to search for the first sentence of the description in the mock up. Here is the result!

SearchResult2Does it work in Modern too?

In my previous examples I was using SharePoint Classic Search. If you were wondering if it works in Modern search as well, you bet!

modernsearchHow does it work?

My guess is that OCR in SharePoint is using Azure Media Services to convert text content in digital files into digital text. The reason for assuming this is because of the following naming convention used in SharePoint.

Whenever SharePoint finds text within your images, the values get stored on the item in a field called MediaServiceOCR. Take a look at the JSON response from querying for the list item.

MediaServiceOCR

I wasn’t able to find a default managed property for this field but that isn’t a huge problem because SharePoint automatically creates a crawled property called ows_MediaServiceOCR. Using this crawled property, I can create whatever managed property mappings that I want.

Crawled.png

 

Some Comments

OCR PDFs have native support in Office 365. However, scanned documents which are PDFs currently aren’t generating values in the MediaServiceOCR column. I’ve been testing this functionality with no success — yet.

I have noticed some inconsistencies with the OCR functionality. I have tested this on multiple libraries and I have noticed it hasn’t been creating the MediaServiceOCR values on some items (doesn’t exist). I’ll keep you posted when I find more information about this.

 

 

19 thoughts on “OCR Support in Office 365

  1. bryan reisner April 19, 2018 / 7:01 am

    Curious that this does not include PDFs. Do PDFs already support OCR in SharePoint? Is there a way to get native PDF OCR all within SharePoint??

    Like

    • Beau Cameron April 19, 2018 / 9:42 am

      Great Question. There is PDF support native to Office 365. However, scanned PDFs or Image PDFs do not have OCR functionality. I’m not sure if there is any timeline for something like this.

      Like

  2. Everton Batista July 16, 2018 / 7:39 pm

    HI, thanks by post! I’m using sharepoint online, but the MediaServiceOCR field does not appear in the view (allitems.aspx). Do I need to enable any features?

    Like

    • Beau Cameron July 17, 2018 / 8:12 am

      MediaServiceOCR is a hidden field used for the search capabilities. You likely wouldn’t want to show this field in a view anyways 🙂

      Like

      • Everton Batista July 18, 2018 / 7:06 pm

        Ok, Thank you!!!

        Like

      • Hoang Bao Phu December 18, 2018 / 8:27 pm

        Hi, Thank you very much. But i have a question. We must config MediaServiceOCR or it automatic fill data when we update lmage to Libray. I was upload an image to library but this field still empty.

        Like

      • Beau Cameron December 19, 2018 / 5:36 am

        It should automatically fill in, if it’s the right file types. It’s inconsistent unfortunately, and I think happens during some timer job, not immediately.

        Like

  3. master July 26, 2018 / 3:37 am

    pdf image ocr is on document library. Image ocr is within image library. vice versa doesnt work. Ocr is new to sharepoint online and only came into effect 3 months ago so is going through some teething issues but its better than paying for 3rd party tool to ocr documents on azure and then post the doucment back to sharepoint ocr’ed.

    Like

  4. Daniel Moerland June 7, 2019 / 10:46 am

    Did anyone else see the snippet on the SharePoint Conference Key Note regarding Image Only PDFs being indexed through Cognitive services? It said coming soon and not sure if anyone has anymore insight on this?

    Like

    • Beau Cameron June 11, 2019 / 7:03 am

      Yea I’m not sure when they may come. Microsoft Search is continually evolving, I expect to hear more during Ignite.

      Like

  5. DIMITRI ILIUK July 7, 2019 / 8:46 pm

    It works for me when i use Office Lens to create the PDF which is then automatically saved to OneDrive for Business. If i then search for body txt of the newly created PDF the results are returned as required. I then use a product called MacroView DMF to move it to a SharePoint library – details for DMF can be found at https://macroview365.com.

    Like

  6. Kurt Henderson October 30, 2019 / 2:58 pm

    Beau I work with a few clients on Office 365 and on my tenant and most other clients, the OCR’ing of images works fine and I can find the files in a search but I have one client tenant that it does not work on. Any suggestions?

    Like

  7. Kurt Henderson November 6, 2019 / 6:21 am

    Beau, Have you heard any more on SharePoint being able to OCR documents? It has been really hit or miss with my testing. It seems to work for PNG and JPG but I can’t get TIF formats to work out of the box.

    Like

  8. Kurt Henderson November 21, 2019 / 9:58 am

    Beau have you heard any more updates on this. I went round and round with Microsoft support on what file types were supported and what was not. The only thing they would commit to was JPG and PNG formats which of course have no multi-page support. Nothing else really worked.

    Like

    • Beau Cameron January 27, 2020 / 7:30 am

      Hi Kurt, sorry I missed this. I haven’t heard any new updates on this unfortunately 😦

      Like

      • Kurt January 27, 2020 / 7:42 am

        Thanks for the feedback. Have you implemented any other solutions for other clients? I did find out that Foxit Pro has an option to “Make PDFs Readable”. I used this on a set of image based PDF documents then uploaded those to SharePoint and they were searchable in SharePoint. That’s kind of the direction we are thinking of going for my client.

        Like

Leave a comment