For government agencies, open source intelligence (OSINT) fuel the never-ending charge to make informed decisions. The mainstay of OSINT to government agencies are the social and local news posts found on the web – with 90%-95% of these being non-english text and sources.
With so much data being created every day, how can agencies focused on National Security like the NSA, DHS, CIA and others gain a higher level of confidence identifying and acting on the posts? Combine Import.io’s market leading web data extraction platform with Basis Technology’s Rosette text analytics platform, ensuring a continuous flow of contextually accurate, disambiguated data about people, places, organizations and things into analytical and intelligence systems.
All that Open Data ― The Challenge
We’ve all heard of the 3 V’s of big data: volume, variety and velocity. Now layer on the characteristics of multiple languages, authors and sources: ambiguity, another layer of variety and ghosts.
- Variety: One thing can have many names (example: Franklin D. Roosevelt or President Roosevelt or Frank Delano Roosevelt or FDR can all be the same person, not to mention the foreign spellings of those variations)
- Ambiguity: Many things sharing similar names. You can find thousands of people named George Bush, two of them are former US Presidents.
- Ghosts: People, organizations, and entities that exist in your data that haven’t been catalogued.
Adding these 3 elements to the mix makes it even more difficult to sift through the data and derive contextual meaning.
Gain Contextual Meaning by Build a Better Knowledge Base
By combining a powerful web data extraction platform with a proven text analytics solution you build yourself a better, more connected and intelligent knowledge base. You need to find the things you care about, understand the relationship those things have with other people, places and things and ultimately, take action. Here’s how it would work:
- Get as Much as You Can but Be a Bit Selective ― In the guise of what an intelligence exploitation system looks like, you need to consider monitoring and extracting content from the biggest source of open source information in the last 10 years – all the stuff being posted to news sites and social media. But, let’s be realistic, there is a probably a set (while it may be a few hundred or thousand) of news and social media sites you care about most and want to target as a source of content. For this step you leverage a web data extraction platform like Import.io.
- Identify the Things You Care About – Whether the native news story or post be in Russian, Korean, Chinese, Arabic, Pashto, Urdu or some other language, harvesting the newly posted content is something that can be done with ease. But, the challenge of mutitlingual unstructured text makes extracting actionable information and meaningful relationships more challenging. Identifying the entities within the content can be done by leveraging Basis Technologies Rosette solution. And, not only will Rosette identify the high priority items but it also tag, index and resolve the content referencing an easy to augment knowledge base. Once items are identified you layer on natural language processing (NLP) to identify the high priority items that should be sent for human translation.
- Integrate and Visualize in Your Existing Intelligence System – The ability to collect, identify and correlate collected data and expose it in your existing systems allows OSINT to become a first class citizen within your analytical infrastructure.
Monitoring open source data is a mammoth, multifaceted task that is required for many missions – and the ability to find the things you care about (people, places and things) will provide you with a higher level of confidence in analytics and intelligence.