A Genetic Programming Approach To Record Deduplication

A GENETIC PROGRAMMING APPROACH TO RECORD DEDUPLICATION
ABSTRACT: Most of the system had problems regarding low-response time, availability, security, and quality assurance become more difficult to handle as the amount of data gets larger. So it affects the high quality service such as digital libraries and e-commerce brokers. A major cause is the presence of duplicates, quasi replicas, or near-duplicates in these repositories, mainly those constructed by the aggregation or integration of distinct data sources. The problem of detecting and removing duplicate entries in a repository is generally known as record deduplication. Record deduplication is the task of Identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling Words, typos, different writing styles or even different Schema representations or data types. Thus, there have been large investments from private and government organizations For developing methods for removing replicas from Data repositories. In our project, we present a genetic programming (GP) approach to record deduplication. Our approach combines several different pieces of evidence extracted from the data content to produce a deduplication function that is able to identify whether two or more entries in a repository are replicas or not. Since record deduplication is a time consuming task even for small repositories, our aim is to foster a method that finds a proper combination of the best pieces of evidence, thus yielding a deduplication function that maximizes performance using a small representative portion of the corresponding data for training purposes. Our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the functions are computationally less demanding since they use less evidence. In addition, we show that our approach is also able to adapt the suggested deduplication function to changes on the replica identification boundaries used to classify a pair of records as a match or not. This releases the user from the burden of having to choose and tune these parameter values.
EXISTING SYSTEM: In Existing system, the system had the problems regarding low-response time, availability, security, and quality assurance become more difficult to handle as the amount of data gets larger. So it affects the overall speed and performance. The solutions available for addressing this problem requires more than technical efforts, they need management and cultural changes as well. The removing of repeated data needs the large investments. DRAWBACKS:
1) PERFORMANCE DEGRADATIONas additional useless data demand more
processing, more time is required to answer simple user queries;

2) QUALITY LOSSthe presence of replicas and other inconsistencies leads to distortions
in reports and misleading conclusions based on the existing data;

3) INCREASING OPERATIONAL COSTSbecause of the additional volume of useless
data, investments are required on more storage media and extra computational processing power to keep the response time levels acceptable.
PROPOSED SYSTEM: A major cause is the presence of duplicates, quasi replicas, or near-duplicates in these repositories, mainly those constructed by the aggregation or integration of distinct data sources. In our proposed system, the problem of detecting and removing duplicate entries in a repository is generally known as record deduplication. This is record deduplication process is used to clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process this data. This record deduplication process is done by Genetic Programming approach. outperforms an existing state-of-the-art machine learning based method found in the literature; provides solutions less computationally intensive, since it suggests deduplication functions that use the available evidence more efficiently; Frees the user from the burden of choosing how to combine similarity functions and repository attributes. This distinguishes our approach from all existing methods, since they require user-provided settings; Frees the user from the burden of choosing the replica identification boundary value, since it is able to automatically select the deduplication functions that better fit this deduplication parameter.
SYSTEM REQUIREMENTS
HARDWARE REQUIREMENT:
PROCESSOR RAM HARD DISK
: : :
Pentium III & above 256MB 40GB
SOFTWARE REQUIREMENT:
OPERATING SYSTEM : FRONT END BACK END : :
Windows XP .NET SQL Server
MODULES LIST OF MODULES: UPLOAD DATASET AUTOMATIC DETECT o NORMAL TREE o MODIFIED TREE o NORMAL DATA o MODIFIED DATA MANUAL DETECT GRAPH o AUTO DETECT o MANUAL DETECT
MODULES DESCRIPTION UPLOAD DATASET: Upload dataset process is the process that is used to upload the datasets in to the database. Upload process has the following process are Browse Upload
AUTOMATIC DETECT: Automatic detect is the process of detect the repeated data by tree format and data format. Tree format shows the heading element, after pressing the heeding only we see the data. But in data format it shows all the details under that particular number or name. Automatic detect process gets the following structures are, Normal Tree: Normal tree process fetches all the data in the database and shows all the data to the users. Modified Tree: It fetches all the data from the data base and remove the repeated data from the data. And it shows the repeated data as error data. So user can easily access the data. Normal Tree Modified Tree Normal Data Modified Data
Normal Data: Normal data is the process of produces full data in the table format. So, all data are covered in that process. Modified Data: Modified data is the process to delete the repeated data and produces good result. MANUAL DETECT: Manual detect is the process that are used to detect the particular name or number as we need. This manual detects also shows the Functions of NORMAL TREE, MODIFIED TREE, NORMAL DATA, MODIFIED DATA. GRAPH: Graph function is the process of functions that produces the graph elements for the data base. This graph function is the process that is detected by following steps are Auto Detect Manual Detect
Auto Detect: Graph of the auto detect is the process that are used to provide the graph for overall elements in the data. So it separates the database and repeated data. Manual Detect: Graph of the manual detect is the process that are used to provide the graph for particular elements we need. It also separates the databases and repeated data.

A Genetic Programming Approach To Record Deduplication

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Genetic Programming Approach To Record Deduplication

Încărcat de

Drepturi de autor:

Formate disponibile

A GENETIC PROGRAMMING APPROACH TO RECORD DEDUPLICATION

processing, more time is required to answer simple user queries;

in reports and misleading conclusions based on the existing data;

PROCESSOR RAM HARD DISK

Pentium III & above 256MB 40GB

OPERATING SYSTEM : FRONT END BACK END : :

Windows XP .NET SQL Server

S-ar putea să vă placă și