A Review Paper on Data Mining Concepts and Its Techniques

Ankita .

A Review Paper on Data Mining Concepts and Its Techniques

Exploring the Concept and Techniques of Data Mining

by Ankita .*,

- Published in Journal of Advances and Scholarly Researches in Allied Education, E-ISSN: 2230-7540

Volume 14, Issue No. 2, Jan 2018, Pages 1369 - 1374 (6)

Published by: Ignited Minds Journals

ABSTRACT

This paper offers an overview to the basic concept of data mining. That provides an explanation how data mining is used to collect meaningful information and to establish significant relationships between variables contained in a large data set data warehouse. In the case study reported in this paper, a data mining approach is used to extract knowledge from a data set. Data mining is one of the most critical information exploration phases in the database cycle and is known to be a major sub-field of knowledge management. Data mining activity continues to grow in market and learning organizations over the coming decades. A review paper discusses the uses of data mining technologies that have been developed to help the information management method.

KEYWORD

data mining, concepts, techniques, information, relationships, variables, data set, data warehouse, knowledge, database cycle

INTRODUCTION

Data mining is the path toward discovering patterns in colossal data sets including methods at the intersection purpose of AI, bits of knowledge, and database systems.[1] Data mining is an interdisciplinary subfield of computer science and estimations with a general target to remove data (with canny methodologies) from a data set and change the data into a fathomable structure for extra usage. Data mining is the analysis adventure of the "knowledge discovery in databases" technique or KDD.[2] Aside from the unrefined analysis step, it also incorporates database and data the board points of view, data pre-dealing with, model and deriving contemplations, fascinating quality estimations, unpredictability contemplations, post-getting ready of discovered structures, visualization, and online updating.[3] The articulation "data mining" is a misnomer, considering the way that the goal is the extraction of patterns and knowledge from a great deal of data, not simply the extraction (mining) of data. It likewise is a buzzword and is much of the time applied to any type of enormous scale data or data handling (assortment, extraction, warehousing, analysis, and measurements) just as any utilization of computer decision support system, including man-made consciousness (e.g., AI) and business insight. The genuine data mining task is the self-loader or programmed analysis of huge amounts of data to separate beforehand obscure, fascinating patterns, for example, gatherings of data records (cluster analysis), strange records (anomaly detection), and conditions (affiliation rule mining, successive example mining). This for the most part incorporates using database systems, for instance, spatial records. These patterns would then have the option to be seen as a kind of layout of the data, and may be used in further analysis or, for example, in AI and judicious assessment. For example, the data mining step may perceive different social occasions in the data, which would then have the option to be used to obtain dynamically exact estimate results by a decision support system. Neither the data combination, data arranging, nor result understanding and uncovering is a bit of the data mining step, anyway have a spot with the general KDD process as additional advances. The qualification between data analysis and data mining is that data analysis is used to test models and speculations on the dataset, e.g., separating the reasonability of an advancing exertion, paying little personality to the proportion of data; strikingly, data mining uses statistical models and statistical models to uncover subtle or hidden patterns in a gigantic volume of data.[4]

KNOWLEDGE DISCOVERY DATABASES

The expression Knowledge Discovery in Databases, or KDD, to put it plainly, alludes to the expansive procedure of discovering knowledge in data and underscores the "significant level" utilization of explicit data mining techniques. Analysts are engaged with AI, design acknowledgment, databases, investigation, man-made brainpower, data creation for master systems and data visualization. The bringing together objective of the KDD technique is to get data from data as enormous databases. measurements and limits, utilizing the database alongside any pre-handling, sub-testing and change criteria of the database.

Figure 1: Step of KDD Process

A few people don't separate data mining from knowledge discovery, while others see data mining as a fundamental advance during the time spent knowledge discovery. Here is an outline of the means engaged with the data discovery process – Data Cleaning − In this stage, commotion and clashing data are disposed of. Data integration − Multiple data sources are integrated in this phase. Data Selection − In this step, data relevant to the analysis task are retrieved from the database. Data transformation–In this phase, the data is converted or compiled into forms appropriate for mining by carrying out description or aggregation operations. Data Mining − Intelligent approaches are used in this phase to collect data trends. Pattern Evaluation − Data patterns are analyzed in this phase.

Knowledge Presentation − Knowledge is expressed in this phase.

DATA MINING CONCEPTS

Data mining is a set of techniques for the effective and automatic detection of previously unknown, true, new, useful and understandable trends in large databases. The results must be actionable so that they can be used in the organization's decision-making process. Traditionally used by business intelligence companies and financial analysts, it is primarily used in analysis to extract information from large data sets provided by new experimental and analytical methods[5]. Data mining strategies can be grouped as follows: described. Another consideration may be whether a consumer has to be listed as a trustworthy client or defaulter in a credit card transaction report, despite the different demographic and prior payment characteristics. • Estimation- Like arrangement, the motivation behind an estimation model is to decide an incentive for an obscure yield trait. Notwithstanding, in contrast to order, the yield characteristic for an estimation issue are numeric as opposed to straight out. A model can be "Gauge the compensation of a person who possesses a games vehicle?" • Prediction- It is difficult to separate forecast from characterization or estimation. The main distinction is that as opposed to determining the present conduct, the prescient model predicts a future result. The yield trait can be absolute or numeric. A model can be "Anticipate one week from now's end cost for the Dow Jones Industrial Average". clarifies the development of a decision tree and its prescient applications. • Association rule mining - There, fascinating secret laws called association rules are mined in a massive transactional database. For example, the milk, butter->biscuit} rule provides information that, if milk and butter are bought together, biscuits are purchased in such a manner that they can be priced together to increase the total profits of each item.

• Clustering- Clustering is a special type of grouping in which the goal groups are not defined. Of eg, in the case of 100 consumers, they have to be categorized on the basis of certain similarity parameters and it is not preconceived the groups will eventually be divided into[6].

DATA MINING MODELS

There are many common models that can be used successfully in a number of data mining issues. There are few decision trees, neural networks, Naive Bayes classifiers, Lazy trainers, Vector help devices, and regression dependent classifiers. Depending on the type of use, the complexity of the data and the characteristics, the most suitable model can be used. There is still no straightforward answer to the question of which is the best data mining platform. One can only assume that we model is better than the other for a particular application.

• Decision trees

a tree-like structure where each internal node reflects a judgement of the importance of the attribute. -the branch represents the result of the decision and the tree leaves represent categories. The decision tree is a construction that is both analytical and concise. The decision tree shows the relationship in the training data.

• Neural networks

Neural networks deliver a mathematical model which aims to imitate the human brain. Knowledge is referred to as a complex system of intertwined receptors called neurons. Every hub has a weighted association with different hubs in neighboring layers. Singular hubs take the info got from associated hubs and utilize the loads together with a straightforward capacity to register yield esteems. Training in neural networks is developed by integrating weight changes, while a number of learning instances have been through the method more than once. When prepared, an obscure occasion going through the system is grouped by the qualities seen at the yield layer. studies existing work on neural system development, endeavoring to distinguish the significant issues included, bearings the work has taken and the present best in class. Normally, a neural system model is having a design as appeared in figure 2 in its essential structure. Neurons possibly fire when info is greater than some edge. It should, nonetheless, be noticed that terminating doesn't get greater as the boost expands, it is a win big or bust course of action [7].

Figure 2: A neural network configuration • Naive Bayes classifier

This classifier offers a straightforward yet amazing directed classification method. The model accept all info credits to be of equivalent significance and autonomous of each other. Naive Bayes classifier depends on the traditional Bayes hypothesis exhibited in 1763 which takes a shot at the likelihood hypothesis. In fundamental phrase, the naive classifier of Bayes agrees that the proximity (or non-compliance) of a particular class variable is incompatible with the proximity (or non-appearance) of some other item. Despite the fact that these presumptions are likely to be false, the Bayes Classifier nevertheless functions very well. Contingent to the accurate initiative of a produced in a regulated pick-up setting. In a variety of practical implementations, the parameter estimation for the Naive Bayes model utilizes the most severe likelihood method.

• Association rules

They are truly not unreasonably unique in relation to classification decides aside from that they can foresee any trait, not simply the class, and this gives them the opportunity to anticipate blends of qualities as well. Additionally, affiliation rules are not planned to be utilized all together, as classification rules seem to be. Distinctive affiliation decides express various regularities that underlie the dataset, and they by and large anticipate various things. Since such a large number of various affiliation rules can be gotten from even a modest dataset, intrigue is confined to those that apply to a sensibly huge number of cases and have a sensibly high precision on the examples to which they apply to. The inclusion of an affiliation rule is the quantity of occurrences for which it predicts accurately. This is regularly called its support. Its exactness is regularly called certainty. It is the quantity of cases that it predicts effectively, communicated as an extent of all occurrences to which it applies.

• Machine learning and statistics

Data mining can be considered as a juncture of measurements and AI. In truth, one ought not search for a separating line between AI and insights on the grounds that there is a continuum. Some get from the aptitudes instructed in standard measurements courses, and others are more firmly connected with the sort of AI that has emerged out of computer science. Verifiably, the different sides have had rather various customs. While insights is progressively worried about testing theories, AI is increasingly worried about detailing the procedure of speculation as a pursuit through potential theories. In any case, this is a gross distortion. Measurements is definitely more than speculation testing, and many AI methods don't include any looking through whatsoever. Previously, numerous comparable strategies were created in parallel in AI and insights. One is decision tree acceptance.

LITERATURE REVIEW

Nikita Jain et al (2013) Data mining alludes to separating or mining the knowledge from huge measure of data. The term data mining is properly named as 'Knowledge mining from data' or "Knowledge mining". In this paper, the idea of data mining was abridged and its hugeness towards its philosophies was represented. The data mining dependent on Neural Network and Genetic Algorithm is investigated in detail and the key survey of the zone of rule extraction from ANN and GA[8]. Suyog Dhokpande et al. (2013) The Data Warehousing supports business analysis and decision making by making an endeavor wide planned database of sketched out, recorded data. Data mining, the extraction of hidden perceptive data from immense databases, is an astonishing new development with remarkable potential to help associations with focusing on the most critical data in their data dispersion focuses. Data mining mechanical assemblies envision future patterns and works on, empowering associations to make proactive, knowledge-driven decisions. The computerized, expected examinations offered by data mining move past the assessments of past events gave by audit instruments typical of decision support systems. This paper depicts about the fundamental building of data warehousing, its item and technique of data warehousing. It in like manner shows different systems followed in data mining[9]. Vishal et al. (2014) Data mining is one of the most pertinent regions of research in computer applications among the different sorts of data mining. This paper is going to concentrate on web mining. This is the audit paper which shows profound and extraordinary investigation of different procedures accessible for web mining Tools and Techniques. Web mining - for example the utilization of data mining procedures to extricate knowledge from Web substance, structure, and use - is the assortment of advancements to satisfy this potential. Above meaning of web mining is investigated in this paper[10]. Hemlata Sahu et al. (2012) This paper gives a prologue to the fundamental idea of data mining. Which gives diagram of Data mining is utilized to remove important data and to create critical connections among factors put away in huge data set/data distribution center. For the situation study detailed in this paper, a data mining approach is applied to extricate knowledge from a data set. Data Mining, additionally famously known as Knowledge Discovery in Databases (KDD), alludes to the nontrivial extraction of verifiable, already obscure and possibly valuable data from data in databases. While data mining and knowledge discovery in databases (or KDD) are much of the time treated as equivalent words, data mining is quite of the knowledge discovery process. Data mining is the way toward finding possibly helpful, intriguing, and beforehand obscure patterns from an enormous assortment of data. Data mining is a multidisciplinary field, drawing work from territories including database innovation, AI, measurements, design acknowledgment, data recovery, neural networks, knowledge-based systems, computerized reasoning, elite registering, and data visualization. We present methods for the discovery of patterns hidden in huge data sets, concentrating on issues identifying with their achievability, helpfulness, review instruments run of the mill of decision support systems[11]. Aakanksha Bhatnagar etb al. (2012) Data mining is a procedure which finds valuable patterns from huge measure of data by transforming assortment of data into knowledge. The idea of data mining is focal point of fascination for the clients due to numerous elements as high accessibility of data which should be changed over from masses of data to valuable data. The rundown of sources that produce these data is perpetual. Organizations overall produce colossal arrangements of data regular that may incorporate stock, exchanges and a lot a greater amount of comparable sorts. So there comes the need of incredible and in particular programmed apparatuses for revealing important openings of composed data from huge measure of data. Considering any long range interpersonal communication site or an internet searcher, they get a great many inquiries consistently. Right off the bat, the Database Management Systems developed to deal with the questions of comparative sorts. At that point the methodology was adjusted to cutting edge Database the executives system, Data Warehousing and Data mining for advance data analysis and online databases. Data mining has colossally infiltrated in every single field of everyday life[12]. K.Murugan et al. (2013) Data mining utilizing incorporation of clustering and decision tree calculation has been proposed for foreseeing the financial exchange costs. This system includes concentrating stock value patterns in time by endeavoring to foresee future consequences of a period arrangement by just contemplating patterns in the time-arrangement of stock costs. The objective of this venture is to actualize data mining so as to foresee the Time-Series Stock costs by incorporating clustering and Decision Tree Algorithm. The stock costs are gathered into clusters to such an extent that the data are like each other inside a cluster. These clusters of data are then used to foresee the stock Prices utilizing decision tree[13]. Ranbir Gagat et al. (2016) Extraction the hidden prescient data from the enormous databases is known as data mining. Organizations or associations have had the option to center and recover the data from their data stockrooms according to the prerequisite. Data mining has been used effectively in the enormous number of organizations .The organizations that were included here from the start were primarily the data escalated ventures including the monetary administrations just as regular postal mail promoting. Data mining is the methodology which is applied to remove valuable data from the crude data. The strategy of clustering, the comparable and divergent kind of data are clustered together to break down complex data. The past occasions, different kinds of clustering have been

thickness based clustering and their procedures have been inspected and looked at regarding different parameters. Catchphrases: Hierarchical clustering, partitional clustering, Density-based clustering, Grid – based clustering[14]. Sachin Kumar et al. (2016) Data mining has been demonstrated as a solid procedure to examine street mishaps and give profitable outcomes. A large portion of the street mishap data analysis use data mining systems, concentrating on distinguishing factors that influence the seriousness of a mishap. Be that as it may, any harm coming about because of street mishaps is constantly unsuitable regarding wellbeing, property harm and other financial components. Once in a while, it is discovered that street mishap events are progressively visit at certain particular areas. The analysis of these areas can help in recognizing certain street mishap includes that make a street mishap to happen often in these areas. Affiliation rule mining is one of the famous data mining methods that distinguish the relationship in different traits of street mishap. In this paper, we initially applied k-implies calculation to amass the mishap areas into three classes, high-recurrence, moderate-recurrence and low-recurrence mishap areas. k-implies calculation takes mishap recurrence consider a parameter to cluster the areas. At that point we utilized affiliation rule mining to describe these areas. The standards uncovered various variables related with street mishaps at various areas with fluctuating mishap frequencies. The affiliation rules for high-recurrence mishap area unveiled that crossing points on thruways are progressively hazardous for each sort of mishaps. High-recurrence mishap areas for the most part included bike mishaps at bumpy districts. In moderate-recurrence mishap areas, settlements close to nearby streets and convergence on expressway streets are discovered risky for passerby hit mishaps. Low-recurrence mishap areas are dispersed all through the locale and the greater part of the mishaps at these areas were not basic. In spite of the fact that the data set was restricted to some chosen properties, our methodology extricated some valuable hidden data from the data which can be used to take some preventive endeavors in these areas[15].

CONCLUSION

Data mining is worried about extricating helpful guidelines or fascinating patterns from the mass measure of data gathered through different sources. There are numerous data mining systems which can be utilized to play out the activity proficiently. In this article represent the different data mining procedures prevalently utilized. Models like decision tree, neural system and Naive Bayes classifier are portrayed in detail. Significant research works completed utilizing these models are assessed in this section. Decision trees are easy to decipher and work quicker. Neural networks oversee missing qualities and all out qualities data ought to be typically disseminated. This examination gives the thought regarding different data mining procedures, various strategies, various procedures and a few issues identified with data mining.

REFERENCES

[1]. Ming-Syan Chen, Jiawei Han, Philip S Yu (1996). Data Mining: An Overview from a Database Perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 8(6): pp. 866-883. [2]. R Agrawal, T, mielinski, A. Swami (1993). Database Mining: A Performance Perspective[J]• IEEE Transactions on Knowledge and Data Engineering, 12: pp. 914-925. [3]. Fayyad, Usama; Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996). "From Data Mining to Knowledge Discovery in Databases". http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf Retrieved 2008-12-17. [4]. Y. Peng, G. Kou, Y. Shi, Z. Chen (2008). "A Descriptive Framework for the Field of Data Mining and Knowledge Discovery" International Journal of Information Technology and Decision Making, Volume 7, Issue 4 7: pp. 639–682. DOI:10.1142/S0219622008003204. [5]. Data mining:Ford, C.W.; Chia-Chu Chiang; Hao Wu; Chilka, R.R.; Talburt, J.R. (2005). Information Technology: Coding and Computing, 2005. ITCC 2005 InternationalConference Volume: Digital Object Identifier: 10.1109/ITCC.2005.270 Publication Year:, Page(s): pp. 122-127 Vol. 1 [6]. Han, J. & M. Kamber (2001). Data mining: concepts and techniques, San Francisco: Morgan Kaufman. [7]. ―Data mining tools‖, by Ralf Mikut, Markus Reischl (2011). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. [8]. Nikita Jain, Vishal Srivastava (2013). ―DATA MINING TECHNIQUES: A SURVEY PAPER‖, IJRET: International Journal of Research in Engineering and Technology [9]. Suyog Dhokpande, Hitesh Raut (2013). ―Introduction to data warehousing and data mining‖, International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December-2013 ISSN 2229-5518 [10]. Vishal (2014). ―DATA MINING TOOLS AND TECHNIQUES‖, International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 1707 ISSN 2229-5518 [11]. Hemlata Sahu, Shalini Shrma, Seema Gondhalakar (2012). ―A Brief Overview on Data Mining Survey‖, International Journal of Computer Technology and Electronics Engineering (IJCTEE) Volume 1, Issue 3 2012 [12]. Aakanksha Bhatnagar, Shweta P. Jadye, Madan Mohan Nagar (2012). ―Data Mining Techniques & Distinct Applications: A Literature Review‖, International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 9, November- 2012 ISSN: 2278-0181 [13]. K. Murugan, Varalakshmi (2013). ―DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE‖, International Journal of Recent Advances in Engineering & Technology (IJRAET) ISSN (Online): 2347 - 2812, Volume-1, Issue -2. [14]. Ranbir Gagat (2016). ―Clustering Techniques of Data Mining- A Review‖, International Journal of Computer Science and Mobile Computing, Vol.8 Issue.7, July- 2016, pg. 152-160 [15]. Sachin Kumar, Durga Toshniwa (2016). ―A data mining approach to characterize road accident locations‖, springer J. Mod. Transport. 24(1): pp. 62–72 DOI 10.1007/s40534-016-0095-5

Corresponding Author Ankita*

# 677, Huda Sector - 01, Shahabad Markanda, Distt. Kurukshetra – 136135, Haryana-India