Research: Categorizing content on the dark web via a novel crawler

Research: Categorizing content on the dark web via a novel crawler

DeepDotWeb

DeepDotWeb

  27 Jan 2019   0


The dark web is similar to other parts of the deep web in that its content cannot be indexed by conventional search engines such as Google, Yahoo, and others. Even though there are currently special search engines that can crawl the content of several darknets, they are still immature and cannot index all parts of the dark web. This represents a challenge not only to ordinary users, but also to internet security professionals and law enforcement agents who are concerned with collecting data from various darknets in order to monitor illegal activities taking place on this dark side of the world wide web.

The dark web is similar to other parts of the deep web in that its content cannot be indexed by conventional search engines such as Google, Yahoo, and others. Even though there are currently special search engines that can crawl the content of several darknets, they are still immature and cannot index all parts of the dark web. This represents a challenge not only to ordinary users, but also to internet security professionals and law enforcement agents who are concerned with collecting data from various darknets in order to monitor illegal activities taking place on this dark side of the world wide web.

web-crawler.jpg

Dognaedis is an internet security company that offers its customers various tools to protect themselves online. Nevertheless, the company recently encountered a void that had to be filled within the sources of information monitored by the company. Accordingly, the company published a paper that includes a specification and implementation of a novel dark web intelligence solution for one of Dognaedis’s products, known as Portolan.

The goal of this solution it to crawl hidden services on the dark web and to extract intelligence data in order to boost the company’s information sources. Throughout this article, we will take a look at this new dark web crawler and the results obtained from testing it on the live network.

Method of data collection:

The study collected data through the company’s newly developed dark web crawler. This crawler is developed to specifically crawl hidden services on the Tor network. Results were assessed via two steps: First, an initial group of hidden services were manually grouped into different categories and used to train a special document classifier (Support Vector Machine), which represents a statistical categorization algorithm that utilizes machine learning for content classification purposes. Secondly, an automated classifier was utilized to complete categorization of the remainder of dark web pages.

In order to prevent inclusion of content related to terrorist organizations and child pornography, only textual content was obtained automatically. Any other irrelevant content was either immediately discarded or filtered out.

Analysis was conducted via two steps. The first step was necessary for the completion of the second step and involved manual classification of the hidden services. Collected data was randomly sampled, and a collection of hidden services’ web pages was assembled and used in training of the automatic document classifier. This yielded 12 categories of content on the dark web as shown in table (1). The second step involved assigning the rest of hidden services to these 12 categories. Each hidden service’s category was deduced via aggregate classification of all pages under a single URL (onion address), rather than just classifying the home page’s hidden services as done in most previous studies.

Table (1): Categories of hidden services

Results:

The results of the categories of hidden services, obtained via the crawler, are shown in table (2). The results of analysis of crawled data highlighted the high percentage of illicit content present on the dark web. Scans of the developed crawler yielded a total of 5,205 live hidden services, of which 2,723 could be grouped under the aforementioned categories with acceptable levels of accuracy. The category “None” denotes absence of content and was thus considered as neither illicit nor licit, yet the category “Unknown” refers to content that was accessible, yet the researchers were not able to determine its exact nature as it was either illegible or sparse. The results obtained via automatic classification were manually rechecked to guarantee the accuracy of results. Collectively, the crawler managed to access around 300,000 hidden services on the Tor network, which yielded a highly significant and diverse volume of data in the form of 205,000 pages on various hidden services.

Table (2): Classification of crawled hidden services.

Final thoughts:

The paper presents a novel dark web crawler which introduced a new classification of content on the Tor network. Results of the study denote that the majority of hidden services on the Tor network are related to criminal activities such as illicit drug trading, illegal finance, and illegal pornography. One important finding was the confirmation of the near absence of content linked to Islamic extremism on the Tor network as only a handful of sites could be linked to such content.

 

.towauthorscopyaddress p { -webkit-user-select: all; -moz-user-select: all; -ms-user-select: all; user-select: all; overflow-x: scroll;
}


Source: TheOnionWeb

Leave a Reply