Sometimes you are in a position to develop your data schema from scratch when you are developing a new system using object-oriented technologies.
If so, consider yourself amongst the lucky few because the vast majority of developers are often forced to tolerate one or more existing legacy data designs.
We applied our methods on three real-world datasets.
In a dataset with 31,023 clusters and 80,451 duplicate records, 72,239 matching rules were generated.
For example, the C# application and its XML database that you deployed last week are now considered to be legacy assets even though they are the built from the most modern technologies within your organization.
A legacy data source is any file, database, or software asset (such as a web service or business application) that supplies or produces data and that has already been deployed.
Four key processes in data integration are: data preparation (i.e., extracting, transforming, and cleaning data), schema integration (i.e., identifying attribute correspondences), entity resolution (i.e., finding clusters of records that represent the same entity) and entity consolidation (i.e., merging each cluster into a 'golden record', which contains the canonical value for each attribute).
Unless consolidation is done carefully, a user organisation can find itself tied into tighter dependency on a small number of suppliers. Infrastructure and applications consolidation have dominated the agendas of the UK's IT departments during the past few years. And consolidation projects are easy to justify financially. Most enlightened IT departments have long realised that they need to demonstrate an appetite for squeezing greater efficiency from their infrastructure and operations.In this paper, we propose a scalable entity consolidation algorithm to merge these clusters into 'golden records'.We first automatically generate matching rules from the clusters and then group these rules into sets with common characteristics to cut down on the number which must be verified by a human.The goal of this article is to introduce both application developers and Agile DBAs to the realities of working with legacy data.For our purposes any computer artifact, including but not limited to data and software, is considered to be a legacy asset once it is deployed and in production.Legacy data is often constrained itself by the other applications that work with it, constraints that are then put on your team. Figure 1 indicates that there are many sources from which you may obtain legacy data.Legacy data is often difficult to work with because of a combination of quality, design, architecture, or political issues. This includes existing databases, often relational, although non-RDBs such as hierarchical, network, object, XML, object/relational databases, and No SQL databases.There is always a risk in changing operational systems as they affect not only the technology, but also the people who use and manage it and, potentially, business processes. As part of Computer Weekly’s 40th anniversary celebrations, we are asking our readers who and what has really made a difference?Risks continue even after the project has been completed. Read article: Council plans early Duet move in SAP consolidation Read article: Move to single datacentre brings savings for Cornhill Vote for your IT greats Who have been the most influential people in IT in the past 40 years?