Intelligent Web Crawler by Supervised Learning

Advancements in Forum Crawling with Supervised Learning

Authors

  • Deepak Ranoji Naik
  • Prof. Dr. Satish R. Todmal

Keywords:

Intelligent Web Crawler, Supervised Learning, web scale forum crawler, URL, forum threads, forum content, minimum overhead, regular expression patterns, page type classifiers, accuracy, precision, crawling time, Question Answer sites, blog sites, social media sites

Abstract

In this paper we present Intelligent Web Crawler (IWC) a supervise and intelligent web scale forum crawler. The goal and objective of this IWC is to crawl relevant forum content from the web with minimum overhead. URL and forum threads have information content that is collected by forum crawlers. Web forum crawling problem to a URL type have been reduced to recognition problem which shows how to learn accurate and effective regular expression patterns of constant navigation paths by automatically created training sets using aggregated results from weak page type classifiers. Every forum have different layouts or styles and have different forum software packages, they always have homogeneous constant navigation paths connected by specific URL types to direct users from entry pages to thread page. Robust page type classifiers can be get from as few as five annotated forums and applied to a large set of unseen forums. To have accurate specification we have used the supervise machine learning process applied to immense set of Forum. Among the other forum crawlers, IWC gives best performance. The results show that IWC gives better performance in terms of precision and crawling time. In future, we would like to extend this crawler to other sites like Question Answer (Q A) sites, blog sites and other social media sites to develop as IWC as better forum crawler.

Downloads

Published

2018-06-02

How to Cite

[1]
“Intelligent Web Crawler by Supervised Learning: Advancements in Forum Crawling with Supervised Learning”, JASRAE, vol. 15, no. 4, pp. 99–109, Jun. 2018, Accessed: Nov. 08, 2024. [Online]. Available: https://ignited.in/index.php/jasrae/article/view/8182

How to Cite

[1]
“Intelligent Web Crawler by Supervised Learning: Advancements in Forum Crawling with Supervised Learning”, JASRAE, vol. 15, no. 4, pp. 99–109, Jun. 2018, Accessed: Nov. 08, 2024. [Online]. Available: https://ignited.in/index.php/jasrae/article/view/8182