Data Mining the Content of Food Articles Using Web Crawling

Ramil  Gupta

Data Mining the Content of Food Articles Using Web Crawling

Exploring Association Rules in Web Crawled Food Article Data

by Ramil Gupta*,

- Published in Journal of Advances and Scholarly Researches in Allied Education, E-ISSN: 2230-7540

Volume 15, Issue No. 5, Jul 2018, Pages 236 - 239 (4)

Published by: Ignited Minds Journals

ABSTRACT

With the growing internet, searching web is an important part. To retrieve the web pages automatically, web crawler is used. Web crawler feeds on a seed URL and visits all the subsequent URLs to gather information. The processed information is stored in JSON documents. To further find the relationships between web pages, association rule mining is used. The frequent items are found using Apriori algorithm. Association rules are formed using these frequent items. In this paper we proposed a crawler that crawls the recipe site. Then from the structured data of JSON file, association rules are predicted.

KEYWORD

data mining, content, food articles, web crawling, internet, web crawler, seed URL, JSON documents, association rule mining, Apriori algorithm, recipe site

INTRODUCTION

With the aim of providing most comprehensive information, structuring of data is an important perspective. For structuring the data, data mining is an effective process (Estlick, et al., 2001). The data is scattered all across the web. In this paper, we have considered structuring the data for various recipes. The data about recipes from one of the famous recipe website has been identified by help of crawling. Further data mining has been applied to the data collected for extracting useful information. Association Rules are used to find interlink between the processes (Ahmed, et. al., 2006. Bonchi & Lucchese, 2006. Chi, et. al., 2006). The frequent item sets can be determined by use of Apriori algorithm (Elmasri & Shamkant, 2009)

Overview of Association rule mining

The association rules are presented in the form A -> B, which implies wherever there is an occurrence of A there is occurrence of B also. A is the item found in data and B is the item found in combination with A. The goal is to extract the important correlations among the items existing in the database. The important correlations are those which satisfy the criteria of a minimum confidence and support. Support is measured by calculating the probability of the occurrence of a particular item in a set of items. Confidence is the true occurrence of condition ―if A occurs then B will also occur‖. Association rule mining is accomplished with the help of certain rules or algorithms. One of the best algorithms for mining association rules is Apriori algorithm.

Overview of Apriori Algorithm

One of the most popular algorithms for association rule mining is Apriori algorithm. The searching technique used by it is breadth first search. A level wise search is done by the algorithm in which it uses n item sets to explore n-1 item sets. It iterates in several passes over the database (Charanjeer, 2013). In the first pass it searches for a large set of items, which is further used for discovering datasets in further passes. The functioning of the algorithm is based on the minimum support. All the frequent items above that minimum level of support are considered. The other constraint that can be added is of minimum defined confidence level.

Overview of Web crawler

A web crawler is a program that visits the web pages in a way humans do, with the objective of validating, analyzing and visualizing the web pages (Chakrabarti, et. Al., 1998) (Bharat and Henzinger, 1998). There are two steps in focused crawling process: Identification of seed URL (i.e. starting URL). This is the primary and important step because without a starting URL the crawler cannot start. All the pages identified by the seed URL are retrieved. Then these pages are checked for further presence of URLs. The secondary step is to choose a technique for crawling. The URLs found in the pages of seed URL are queued for processing. It places the URLs in the queue based on their

Proposed Method

Our implementation works in following steps: 1. Choosing a seed URL 2. Fetch the list of URLs of recipes present. 3. Retrieval of desired information from html document and storing it in JSON document. 4. Applying Apriori to JSON data. In our implementation we have chosen ―sanjeevkapoor.com‖ as our seed URL. Then the pages of this URL are visited and a list of URLs of recipes is retrieved by hitting upon by Python and stored in a text file. For performing this activity various open source crawlers can be used. Here we have used Scrapy [5]. The list obtained is then normalized in order to remove duplicate URLs. The list is then sorted using set function and result is stored in a text file for easy retrieval.

Fig 1: List of URLs retrieved

After the retrieval of the list of URLs, the crawler has been configured. Now the data needs to be extracted from different web pages and structure them as required.

Fig 2: Retrieved data

When the pages are hit using Scarpy the result is html tags. These tags may contain irrelevant data also as shown in figure 2. To find the relevant information the page has to be navigated. Data cleaning has been done with the help of beautiful soup which helps in finding regular expressions using Python. After finding the relevant information, it is stored in a structured format i.e.in JSON document.

Fig 3 : Processed JSON data

Ramil Gupta*

Fig 4: Graph depicting ingredients and list of recipies of Chinese cuisine.

From this processed Json data we have gathered the results in form of a graph, which depicts the list of ingredients used in a particular cuisine. We have selected Cuisine as the criteria for making the graphs. One of the example has been depicted above in figure 4. On taking the mouse cursor on a particular recipie name, all the ingredients used in that recipie are highlighted. This makes it easy to identify the prerequisites for making a particular recipe.

Fig 5 : Highlighted ingredients on selecting the recipe

Now on this processed data, Apriori algorithm is applied. For finding the associations between web pages on the basis of keywords, association rule mining is the best technique (Agrawal and Srikant, 1994). Apriori algorithm has been chosen to handle the web pages Apriori Algorithm for Web crawling (Estlick, et. al., 2001). minimum confidence to be 0.80. If any item doesn‘t match the support criteria, the Apriori algorithm doesn‘t consider it for further evaluation, rather rejects it. Support is the ratio of frequency of item to the total number of items. Confidence deals with conditional probability. It calculates the occurrence of B whenever A has ocurred (i.e. it calculates if –else possibilities). By applying the association rules we found that by marking Minimum support of 20% and minimum confidence of 80%, there are total of 18 association rules as shown in figure 4.

Fig 4: Association rules retrieved

CONCLUSION AND FUTURE

In this paper we have implemented a crawler for recipe website. It detects the entry URL first and the processes the pages to find the subsequent URLs. It uses the sorted order to perform crawling. Then using Apriori algorithm, the association rules have been formed. These association rules help to give us the occurrence of two ingredients together. We can easily correlate how two ingredients are linked. In future, the crawler could be extended to other sites like blog sites, forum sites and social networking sites.

REFERENCES

[1] Ramez Elmasri, Shamkant B. Navathe (2009). ―Fundamentals of Database Systems‖, PEARSON, fifth edition, 2009, 978-81-317-1625-0. [2] Charanjeer Kaur (2013). Association Rule Mining Using Apriori Algorithm:A Survey Internationl Journal of advanced Research in Computer Enginerring & Technology (IJARCET) Volume2,Issue 6,June 2013 [3] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg

Conference, Brisbane, Australia. [4] K. Bharat and M. Henzinger (1998). ―Improved algorithms for topic distillation in hyperlinked environments,‖ in Proceedings 21st Int‘l ACM SIGIR Conference., 1998. [5] Scrapy Web tool, http://scrapy.org/. [6] Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pp. 487-499. [7] M. Estlick, M. Leeser, J. Szymanski, and J. Theiler (2001). Algorithmic Transformations in the Implementation of K-means Clustering on Recon_gurable Hardware. In Proceedings of the Ninth Annual IEEE Sym-posium on Field Programmable Custom Computing Machines 2001 (FCCM '01), 2001. [8] C. Wolinski, M. Gokhale, and K. McCabe (2004). A Recongurable Computing Fabric. In Proceedings of the Engineering of Recon_gurable Systems and Algorithms ERSA '02, 2004. [9] Q. Zhang, R. D. Chamberlain, R. Indeck, B. M. West, and J. White (2004). Massively Parallel Data Mining using Recon_gurable Hardware: Approximate String Matching. In Proceedings of the 18th Annual IEEE International Parallel and Distributed Processing Symposium (IPDPS '04), 2004. [10] Ahmed S., Coenen F., Leng P.H. (2006). Tree-based partitioning of date for association rule mining. Knowl. Inf. Syst. 10(3): pp. 315–331. [11] Bonchi F., Lucchese C. (2006). On condensed representations of constrained frequent patterns. Knowl. Inf. Syst. 9(2): pp. 180–201 [12] Chi Y., Wang H., Yu P.S., Muntz R.R. (2006). Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3): pp. 265–294. [13] M. Yuvarani, N. Ch. S. N. Iyengar and A. Kannan (2006). "L S Crawler: A Framework for an Enhanced Focused Web Crawler based on Link Semantics" in Proceedings of the IEEEIWIC/ACM International Conference on Web Intelligence, 2006.

Corresponding Author

Baba Farid College of Engineering and Technology, Bathinda, India

E-Mail – ramilgupta.bfcet@gmail.com