Sunteți pe pagina 1din 8

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976

6375(Online) Volume 4, Issue 2, March April (2013), IAEME & TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), pp. 535-542 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

A COMPARATIVE STUDY ON DIFFERENT TYPES OF EFFECTIVE METHODS IN TEXT MINING: A SURVEY


#1 Inje Bhushan V. and #2 Prof. Mrs. Ujwalapatil #1 Post Graduate Student M.E (Computer) Department of Computer Science and Engineering R.C.Patel Institute of Technology, Shirpur, DistDhule, Maharashtra, India. #2 Associate Professor (Department of Computer Science & engineering) Department of Computer Science and Engineering R.C.Patel Institute of Technology, Shirpur, DistDhule, Maharashtra, India.

ABSTRACT Textmining is the one of the most resent area for research because of in databases storing information in text form, to extracting information that is the challenging issue to motivate textmining. This survey paper tries to cover the all textmining method that solves these challenges. We presented an exhaustive survey of different pattern mining methods proposed in the literature. Pattern mining methods have been used to analyze this data and identify patterns. Textmining is the discovery by computer for extracting new, previously unknown information and also by automatically extracting information from different written resources.In this survey paper we discuss such successful techniques they gives effectiveness over information retrieval in textmining. Keywords: Textmining, Information Retrieval, Sequential pattern model, Pattern taxonomy model. 1 INTRODUCTION Nowadays most of the information in business, industry, government and other institutions are stored in the form of text into databases. This text database contains semi structured data in that they are not only completely unstructured and structured. For example,
535

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

a document may contain a few structured fields, such as title, name of authors, date of publication, category, and so on, but also contain some largely unstructured text components, such as abstract and detail content. There have been a great deal of studies on the modeling and implementation of semi structured data in recent database research. So that information retrieval techniques [18], such as text indexing methods, have been developed to handle unstructured documents. On other handin traditional search, the user is typically looking for already known terms and has been written by someone else. The problem is in result appearing all the material that currently is not relevant to your needs in order to find the relevant information. This is the goal of textmining discover unknown information, something that no one yet knows and so could not have yet written down. Text mining is a variation on a field called data mining [2] that tries to find interesting patterns from large databases. Text mining, also known as Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Figure 1. Shows a generic process model for a text mining application [1]. Starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating techniques until information is extracted. Three text analysis techniques are shown in the example, but many other combinations of techniques could be used depending on the goals of the organization. The resulting information can be placed in a management information system, yielding an abundant amount of knowledge for the user of that system. Information Extraction In computers firstly it analyze unstructured text is to use information extraction [2]. An information extraction technique identifies key phrases and relationships within text. It does this by looking for predefined sequences in text, a process called pattern matching. The technique infers the relationships between all the

Figure 1. Generic Process Model for a Text MiningApplication identified people, places, and time to provide the user with meaningful information. This technology can be very useful when dealing with large volumes of text. Traditional data mining techniques assumes that the information to be mined is already in the form of a relational database. Unfortunately, for many applications, electronic information is only available in the form of free natural-language documents rather than structured databases.
536

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Since IE addresses the problem of transforming a corpus of textual documents into a more structured database, the database constructed by an IE module can be provided to the KDD module for further mining of knowledge as illustrated in Figure 2 [2].

Figure 2.Overview of Information Extraction Based Text Mining

Knowledge discovery Knowledge discovery [19] and data mining have attracted a great deal of attention with an imminent need for turning such data into useful information and knowledge. Many applications, such as market analysis and business management, can benefit by the use of the information and knowledge extracted from a large amount of data. Knowledge discovery can be viewed as the process of nontrivial extraction of information from large databases, information that is implicitly presented in the data, previously unknown and potentially useful for users. Data mining is therefore an essential step in the process of Knowledge discovery in databases. In the past decade, a significant number of data mining techniques have been presented in order to perform different knowledge tasks.These techniques include association rule mining, frequent item set mining, sequential pattern mining, maximum pattern mining and closed pattern mining. Most of them are proposed for the purpose of developing efficient mining algorithms to find particular patterns within a reasonable and acceptable time frame. With a large number of patterns generated by using data mining approaches, how to effectively use and update these patterns is still an open research issue. In this paper, we focus on the development of a knowledge discovery model to effectively use and update the discovered patterns and apply it to the field of text mining. Text mining is the discovery of interesting knowledge in text documents. It is a challenging issue to find accurate knowledge (or features) in text documents to help users to find what they want. In the beginning, Information Retrieval (IR) provided many term-based methods to solve this challenge, such as Rocchio and probabilistic models [4], Rough set models [4],Okapi BM25 and SVM [20] based filtering models. The advantages of term-based methods in term of performance improvement for IR and machine learning. However, term-based methods suffer from the problems of polysemy and synonymy, where polysemy means a word has multiple meanings, and synonymy is multiple words having the same meaning. The semantic meaning of many discovered terms is uncertain for answering what users want.

537

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

2. METHODS AND MODELS USED IN TEXMINING Traditionally there are so many technique was developed to solve the problem in textmining that is nothing but the relevant information retrieval according to users requirement. So that research in textmining broadly divides in several terms to find the solution. The list of the datamining technique are often try to overcome the problem , and the techniques likes Association rule mining[8],Sequential pattern mining[16] , close pattern mining[4] , frequent itemset mining[16] ,maximum pattern mining [4], minimum pattern mining[4] . According to the information retrieval basically there are four methods are used 1) Term Based Method (TBM). 2) Phrase Based Method (PBM). 3) Concept Based Method (CBM). 4) Pattern taxonomy Method(PTM). There are some more models are used to evaluate and improving the efficiency in textmining like A. Sequential pattern mining (SPM). B. Sequential closed pattern mining (SCPM). C. Frequent itemset mining (NSPM). D. Frequent closed itemset mining (NSCPM). The algorithms from the Data Mining community inherited some characteristics from the association rule mining algorithms, and are best suited to work with many (from hundreds of thousands to millions) sequences with relative small length (from 4 to 20). The first algorithms proposed for this task were AprioriAll [2] and GSP[16], from Agrawal and Srikant. Other algorithms like FreeSpan [8], PrefixSpan [4],SPADE [19], CloSpan [18], SPAM [3], were developed afterwards and successively improved the task of find frequent sequence patterns. Algorithms with particular features like, MEMISP [11] which is a memory indexing approach, or SPIRIT [5], which integrates constraints to the mining process through regular expressions, can also be found in literature. Term Based Method (TBM). In TBM [3] include efficient computation performance is the advantages are but in other side there are also the limitation in TBM like it occurring polysemy and synonymy problem polysemy mince word having multiple meaning and synonyms mince multiple word having same meaning. There are some methods based on TBM like 1. Rocchio and probabilistic models [4]. 2. Rough set models [4]. 3. BM25 and SVM based filtering models [4]. Phrase Based Method [PBM] In PBM [4], phrases are less ambiguous and more discriminative than individual terms, the likely reasons for the discouraging performance include: 1) Phrases have inferior statistical properties to terms, 2) They have low frequency of occurrence, and 3) There are large numbers of redundant and noisy phrases among them [4].

538

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Concept Based Method Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced [6]. In Concept-Based Information Retrieval Using Explicit Semantic Analysis[5]in this paper author Concept-based IR using Explicit Semantic Analysis (ESA) makes use of concepts that encompass human world knowledge, encoded into resources such as Wikipedia (from which an ESA model is generated), and that allow intuitive reasoning and analysis. Feature selection is applied to the query concepts to optimize the representation and remove noise and ambiguity. Pattern Taxonomy Method Pattern mining has been extensively studied in data mining communities for many years. Many data mining techniques have been proposed in the last decade. These techniques include association rule mining, frequent itemset mining, sequential pattern mining, maximum pattern mining, and closed pattern mining. However, using these discovered knowledge (or patterns) in the field of text mining is difficult and ineffective. The reason is that some useful long patterns with high specificity lack in support (i.e., the low-frequency problem). Here author NingZhonget.al argue that not all frequent short patterns are useful. Hence, misinterpretations of patterns derived from data mining techniques lead to the ineffective performance. In this research work, an effective pattern discovery technique [3] has been proposed to overcome the low-frequency and misinterpretation problems for text mining. The proposed technique uses two processes, pattern deploying and pattern evolving, to refine the discovered patterns in text documents. The experimental results show that the proposed model outperforms not onlyother pure data mining-based methods and the concept-based model, but also term-based state-of-the-art models, such as BM25 and SVM-based models. Sequential pattern mining (SPM) Before going to elaborate term SPM first we see what is Sequence Data? Sequence data is omnipresent. Customer shopping sequences, medical treatment data, and data related to natural disasters, science and engineering processes data, stocks and markets data, telephone calling patterns, weblog click streams, program execution sequences, DNA sequences and gene expression and structures data are some examples of sequence data. A sequential pattern mining algorithm should A. Find the complete set of patterns, when possible, satisfying the minimum Support (Frequency) threshold, B. Be highly efficient, scalable, involving only a small number of database scans C. Be able to incorporate various kinds of user-specific constraints. There are two major difficulties in sequential pattern mining: (1) Effectiveness: the mining may return a huge number of patterns, many of which could be uninteresting to users, and (2) Efficiency: it often takes substantial computational time and space for mining the complete set of sequential patterns in a large sequence database.
539

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Table 1. Comparative summery of textmining methods


Model/ Method Information Retrieval Approach / Algorithm Rocchio[24] Author Thorsten Joachims Parameters Models, documents and queries as TF-IDF vectors. association rules with high support and confidence Database D for sequence mining consists of a collection Sid,Eid Sequence of document mining consists of a collection Inference Learning is very fast in this method Inputs Set of documents that are not relevant Items in a large database of transactions Text document

Association Rule Mining[8]

Apriori [8]

Agrawal and Srikant 1994 Mohammed J. Zaki et all

Sequential Pattern Mining[16]

SPADE [14]

SPAM[]

Jay Ayres, et all 2002

The proposed algorithms always outperform AIS and SETM SPADE outperforms by a factor of two, and by an order of magnitude with some pre-processed data SPAM outperforms previous works up to an order of magnitude

Close Pattern Mining [7]

CHARM[23]

Mohammed J. Zaki et all

set of all frequent closed item-sets

CHARM performed to discover the longest pattern

Vertical bitmap representation of the database with efficient support counting IBM Almaden, pumsb and pumsb contain census data

CloSpan[7]

X. Yan, J. Han, and R. Afshar

frequent patterns in the dataset

Frequent Itemset Mining[13][27]

FPgrowth[27]

C. Borgelt, 2005

prefix tree representation of the given database of transactions real and synthetic datasets

Maximal Pattern Mining [4][28]

MaxMiner[28]

Mohammed J. Zaki

GenMax[22]

Karam Gouda, Mohammed J. Zaki , 2003

frequent items and thefrequent 2itemsets

Pattern Taxonomy [3]

D-Pattern Mining [3]

NingZhonget all2012

deploying process, which consists of the d-pattern discovery and term support evaluation

It mine long sequence for KDD it produces significantly less number of discovered sequences. This algorithm can save considerable amounts of memory for storing the transactions. MaxMiner shows good performance on some datasets, which were not used in previous studies This algorithm works 2 times faster than other like Mafia PP using dataset Chess and pumsb The proposed technique uses two processes, pattern deploying and pattern evolving, to refine the discovered patterns in text documents.

Sequence of s, Projected DB D8 and min_sup

set of transactions

Frequent itemset

Dataset is in the vertical tidset format

Positive document D+, minimum support min_sup

540

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

CONCLUSIONS We discussed basics of textmining method. We presented an exhaustive survey of different pattern mining methods proposed in the literature. Pattern mining methods have been used to analyze this data and identify patterns. Such patterns have been used to implement efficient systems that can recommend based on previously observed patterns, help in making predictions, improve usability of systems, detect events and in general help in making strategic product decisions. We envision that the power of Textmining mining methods like Sequential pattern mining Pattern taxonomy model has not yet been fully exploited. We hope to see many more strong applications of these methods in a variety of domains in the years to come. REFERENCES
1. 2.

3. 4. 5.

6.

7. 8.

9.

10. 11.

12. 13.

Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang, (2005), Tapping into the Power of Text Mining, Journal of ACM, Blacksburg. N. Kanya and S. Geetha (2007), Information Extraction: A Text Mining Approach, IET-UK International Conference on Information and Communication Technology in Electrical Sciences, IEEE, Dr. M.G.R. University, Chennai, Tamil Nadu, India,11111118. NingZhong, Yuefeng Li Effective pattern discovery in text mining IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. Y. Li and N. Zhong, Interpretations of Association Rules by Granular Computing,Proc. IEEE Third Intl Conf. Data Mining (ICDM 03),pp. 593-596, 2003. OferEgozi, ShaulMarkovitch, and EvgeniyGabrilovichConcept-Based Information Retrieval Using Explicit Semantic Analysis ACM Transactions on Information Systems, Vol. 29, No. 2, Article 8, Publication date: April 2011. Shady Shehata, FakhriKarray, and Mohamed Kamel. Enhancing text clustering using concept-based mining model.In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), pages 10431048, Hong Kong, 2006. X. Yan, J. Han, and R. Afshar, Clospan: Mining Closed Sequential Patterns in Large Datasets,Proc. SIAM Intl Conf. Data Mining (SDM 03), pp. 166-177, 2003. R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases,Proc. 20th Intl Conf. Very Large Data Bases (VLDB 94),pp. 478-499, 1994. J.S. Park, M.S. Chen, and P.S. Yu, An Effective Hash-Based Algorithm for Mining Association Rules,Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 95),pp. 175-186, 1995. R. Srikant and R. Agrawal, Mining Generalized Association Rules,Proc. 21th Intl Conf. Very Large Data Bases (VLDB 95), pp. 407-419, 1995. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, Prefixspan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,Proc. 17th Intl Conf. Data Eng. (ICDE 01),pp. 215-224, 2001. J. Han and K.C.-C. Chang, Data Mining for Web Intelligence, Computer,vol. 35, no. 11, pp. 64-70, Nov. 2002. J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation,Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 00),pp. 1-12, 2000.

541

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME
14. 15.

16.

17. 18.

19.

20. 21.

22.

23. 24.

25. 26. 27. 28.

29.

30.

M. Zaki, Spade: An Efficient Algorithm for Mining Frequent Sequences,Machine Learning,vol. 42, pp. 31-60, 2001. M. Seno and G. Karypis, Slpminer: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint,Proc. IEEE Second Intl Conf. Data Mining (ICDM 02),pp. 418-425, 2002. Y. Huang and S. Lin, Mining Sequential Patterns Using Graph Search Techniques,Proc. 27th Ann. Intl Computer Software and Applications Conf.,pp. 4-9, 2003. M. Gupta andJ. Han Approaches for Pattern Discovery Using Sequential Data Mining , 2011 - Information Science Reference. S.T. Dumais, Improving the Retrieval of Information from External Sources, Behavior Research Methods, Instruments, and Computers,vol. 23, no. 2, pp. 229-236, 1991. FatudimuI.T , Musa A.G and Ayo C.K Knowledge Discovery in Online Repositories: A Text Mining Approach European Journal of Scientific Research ISSN 1450-216X Vol.22 No.2 (2008), pp.241-250 EuroJournals Publishing, Inc. 2008. S. Robertson and I. Soboroff, The Trec 2002 Filtering Track Report, TREC, 2002, trec.nist.gov/ pubs/ trec11/ papers/ OVER. FILTERING.ps.gz. R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and Performance Improvements", 5th Int'l Conf. on Extending Database Technology (EDBT), Avignon, France, March 1996. Karam Gouda, Mohammed J. Zaki GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets, Data Mining and Knowledge Discovery 11, 120, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Mohammed J. Zakiand Ching-Jui Hsiao CHARM: An Efficient Algorithm for Closed Itemset Mining. T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with tfidf for Text Categorization,Proc. 14th Intl Conf. Machine Learning (ICML 97), pp. 143-151, 1997. Bart Goethals Frequent Set Mining Data Mining and Knowledge Discovery Handbook chapter no. 17. GostaGrahne and Jianfei Zhu Efficiently Using Prefix-trees in Mining Frequent Itemsets C. Borgelt, 2005. An Implementation of the FP-growth Algorithm, Workshop Open Source data Mining Software, OSDM'05, Chicago, IL, 1-5.ACM Press, USA. Mohammed J. Zaki Mining Closed & Maximal Frequent Itemsets NSF CAREER Award IIS-0092978, DOE Early Career Award DE-FG02-02ER25538, NSF grant EIA0103708. Prakasha S, Shashidhar Hr and Dr. G T Raju, A Survey on Various Architectures, Models and Methodologies for Information Retrieval, International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 6367, ISSN Online: 0976 6375. M. Karthikeyan, M. Suriya Kumar and Dr. S. Karthikeyan, A Literature Review on the Data Mining and Information Security, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 6367, ISSN Online: 0976 6375.

542

S-ar putea să vă placă și