Insight and Discovery Accelerator
Unlock insights and discover intelligent connections in unstructured data.
The Insights and Discovery Accelerator uses Microsoft AI to speed and scale investigations and research by helping journalists find critical content in a sea of data.
The Insights and Discovery Accelerator will:
- Intelligently identify entities: IDA recognises entities within text, intelligently linking related topics to help you understand key figures, organisations, concepts etc.
- Retrieve full transcripts from scanned physical books and text: IDA uses object visioning to identify characteristics like columns and hyphens from text scans to give you a full and accurate transcript.
- Create data sets from your information: IDA allows you to create data sets from which, visualisations of relationships and the charting of many facets is possible. A great visual tool for investigative journalism.
- Scan video & audio: Using the video indexer, IDA can recognise people, topics, entities and pull transcripts from both video and audio files.
Case study: Finding insights hidden in 160+ years of archives
Azure Cognitive Search ingests a variety of file formats and then applies custom AI models, OCR, entity extraction, and document classification. We can achieve incredible accuracy in identifying and labeling document segments, entities, and other key components like author, page number, issue, and more.
Technical details for the Insights and Discovery Accelerator
For brands like The Atlantic, which has been publishing since 1857, there are massive archives of photos, articles, long form stories, ads and more that must be searched through.
Working side-by-side with 4500 of the world’s most influential and respected news outlets, Microsoft News learned that content search was a universal pain point for journalists. Archival news is hampered by several challenges:
- Tagging: Manual tagging is slow, cumbersome, non-standardised, and often inaccurate – as well as subjective. Would something you search for in the future have been tagged as important when the content is digitised?
- Media types: Some archives are digital, but much material is still physical, which means thumbing through delicate books and documents that need preservation.
- Formatting: Magazines vary layout elements such as two- or three-column structure, pull quotes, jump pages, and embedded ads. Templates for any magazine can evolve over time. Forced layouts produce hyphenation that is inappropriate when reflowed into new layouts. A typical OCR scan runs left to right across the page -- ignoring column breaks which causes indexing issues.
Without organisation, a rich data set can be unwieldy. The solution was to build tools that could more easily navigate key facets within the data, like issue, author, year of publish.
Microsoft News and Unify, a Microsoft Digital Transformation partner, created a solution to help media organisation's – like The Atlantic – crack open their archives to easily gain access to valuable insights. Part of that was to analyse generational changes within the archives with respect to identified core features. Once these subsets were created, the Insights and Discovery Accelerator labeled approximately 500 articles, randomly sampled from each generational age, to determine the preliminary discriminating characteristics of the document segment labels. It was key to explore the content as it was presented in different layouts.
Following the exercise in labelling, the documents and bounding box labels were uploaded to Azure to train an Azure Custom Vision model to identify document segments, columns and other boundaries.
This new skill was integrated into an Azure Cognitive Search enrichment pipeline to extract known entities (such as people, places, dates, etc.) and display their relationships in an easy-to-navigate data visualisation, as well as power faceted search experience.