Article Details

Web Scraping Based on Tag and Value Similarities | Original Article

Vijay R. Thombare*, Shailesh Patil, in Journal of Advances and Scholarly Researches in Allied Education | Multidisciplinary Academic Research


Web Scraping is a technique for extracting huge amounts of data available on internet websites. The text data available on websites is generally not available to download directly and can’t be used for some other application. It’s only accessed by using a web browser via their HTML query interface. A web page also contains irrelevant data, such as advertisements, comments, GIF and other links. We are presenting a technique to automatically extract result records from the dynamically generated result page returned by search engine. This paper present an efficient extraction and alignment procedure called EXCTVS which considers both tag and value likeness. It extracts data from query result pages by first recognizing and then segmenting the Query Result Records (QRRs) based on its tag and value considering tags similarities. Once extraction is completed, it aligns the segmented QRRs into a table. This paper put data values with identical attribute into the identical column. This paper suggests a new method to handle the case where the QRRs are not contiguous in web pages, which may be due to the occurrence of auxiliary data such as a comments, recommendation or promotion. This paper is considering the nested structures of web pages while processing the QRR. This paper uses the record alignment algorithm that aligns the attributes in a record, first it do by pair wise and then holistically, by combining the tag and tag values similarity information.