Web data extraction is a powerful tool for gathering information from around the web. It can help organizations gain competitive and market intelligence, keep abreast of changes to regulation and compliance terms, or simply stay up-to-date with developments in their industry. A prime example of this is the Connotate platform, which extracts data not just from easily accessible web pages, but also from the deep web. This level of data extraction provides access to a large repository of high-value content that’s typically hidden.
To better understand how this process works, it’s important to know some web data extraction basics. A good starting point is to look at the different layers of the internet ― the surface web, deep web and dark web.
The Three Layers: Surface Web, Deep Web and Dark Web
In terms of content accessibility, the web can be divided into three layers: surface web, deep web, and dark web. The surface web is the top layer and includes content that is readily accessible (indexable) using search engines. The deep web is a layer below ― it’s not indexed by search engines ― and accessing it often requires filling out forms to secure specific content, searching a database, or logging into a specific set of content pages. Finally, at the bottom layer is the dark web, which is accessible only with special software and often used for niche purposes ― sometimes criminal, requiring the highest level of encryption and anonymity.
When most people think of web content, they think about the content they find by searching or simply browsing via links. This includes news articles, blog posts, and general information found on websites. Google and Bing index this web content and make it easier for people to find. This indexing process relies on the links on each page and the connections between them. For example, the higher the number and quality of links to the page, the higher that page is typically ranked in search results. At Google, this was previously referred to as PageRank (which has since been discontinued). Links, therefore, are an important factor in what’s found in search results. But what about content that’s not linked? This is where the deep web comes in.
Deep is Not Dark
First, let’s look at what the deep web is not. In the last few years, the scare phrase “dark web” has popped up in news articles referring to a wild west of the internet that’s filled with unabated criminal activity. Many people continue to confuse dark web with deep web. While these layers share some traits ― like being outside the reach of search engines ― there is an important distinction.
The dark web is only accessible through certain browsers like Tor, using protocols that provide anonymity. As might be expected under such conditions, the dark web frequently harbors criminal activity, although it can also be used to access encrypted sites outside the control of government censors. Except for law enforcement and the intelligence community, most organizations will have no interest in gathering data from this part of the web. Between the surface web and the dark web, though, is a valuable trove of content: the deep web.
The Deep Web: A Goldmine of Information
The deep web is not indexed by search engines but is easily accessible through normal channels and doesn’t require special software to access it. Getting to this content often requires:
- login access to the site,
- the completion of several fields on a form requesting a series of identifying attributes about the content you seek, or
- typing in information that allows you to search the contents of a database.
Since search engines currently crawl the web by going from link to link, they are unable to get to this content. Think, for example, about the information that is available in federal and state databases, such as court or property records ― you need to know something about the individual, the case, or the property to secure any information. Other examples include online forums and scanned documents. To get to these, you must go to the site and then search through their records, and this sometimes requires credentials or a document number to access.
In addition to restricted content, some web pages are either not linked to other sites or have specific coding to prevent search engine crawls from indexing them. This frequently includes archived content or file formats that can’t be indexed by search engines. While search engines may eventually index these, right now many of them are outside of their reach and, therefore, part of the deep web.
The deep web, while less accessible, is potentially more valuable for businesses and organizations, since it’s hundreds of times larger than the surface web and the content it serves up is often a more authoritative source of information. For organizations relying on technical or legal information, this can be particularly valuable.
Harvesting Data from Both the Surface Web and Deep Web
Many organizations are looking to aggregate data from a large quantity of different websites, focusing mainly on the easily accessible and searchable content on these pages. Although this surface web data extraction covers the same terrain as search engines, it needs to be significantly more powerful and intelligent to be effective. It requires software that can precisely target content and monitor that content for changes. Import.io’s Connotate is an industry leader in this kind of web content harvesting.
Import.io’s Connotate monitors and extracts data from the public sites that make up the surface web. A large government agency may need news articles from around the world. The Import.io Connotate’s platform can quickly scan the surface web to extract this content and present it as easily digestible data. From monitoring job listings for a job board site to helping customer care specialists retrieve information faster, Import.io’s Connotate has helped businesses succeed. Import.io’s Connotate helped Thompson Reuters monitor more than 20,000 web data sources to help keep the news organization up-to-date on thousands of companies in 50 different countries. The large scale of this project was best accomplished using Import.io’s Connotate platform.
Other organizations are looking for content found in the deep web. Even if it’s only a few targeted sites, scripts and manual extraction arren’t reliable approaches to accessing and harvesting the content. Import.io’s Connotate can effectively navigate the deep web by filling in search boxes and forms using its machine learning. This dramatically reduces the cost of harvesting.
A Deep Web Example
Import.io’s Connotate helped MassHousing, a public agency that provides financing for homeowners and developments in Massachusetts, switch from manual data collection processes to automated data extraction. Connote harvested content from Department of Housing and Urban Development (HUD) deep web pages to find the latest compliance bindings. It helped ensure that MassHousing was up-to-date on Section 8 compliance and ensured they didn’t miss a single contract milestone. This was done by regular monitoring of HUD’s website, guided by specific keywords to detect changes.
The deep web is a treasure trove of information that’s otherwise hard to access efficiently in bulk. Import.io’s Connotate provides the platform that can reach it, harvest it, and deliver it in a usable format.
If your organization is looking for a massively scalable and automated approach that produces precise and intelligent results, Import.io’s Connotate platform can harvest both the surface web and deep web to provide you with the most relevant data for your business needs.