Finding Duplicate Records In a Database Table

Ideally business rules, that can be re-enforced at the database layer, should be implemented at the database layer to ensure that the integrity of the data can be relied on and does not lead to an unexpected behaviour from the application.

Rules such as

  • validation rules, e.g., which fields/columns cannot be null, etc.
  • common sense rules derived from business requirements, e.g., which combination of columns values are not allowed to repeat, etc.

The fact is, the application sitting on top of the database may not be the only source from which data may be provided to the database. Those sources could be:

  • A database administrator making authorized changes/entries to the database using data manipulation language (DML)
  • Importing data through external files such as comma-separated value (CSV) files
  • Migrating data from other databases or tables

More often than not, developers do not take the time to put these basic requirements in the database layer and, as a result, applications that do not do a good job of handling these uniqueness scenarios result in allowing these unwanted situations arising.

CONTEXTUAL SCENARIO

A requirement that states that there can be only 1 entry for a currency pair for a given day should translate into a unique index being defined on the identified attributes, i.e. the currency_pair, and the date columns of the table.

The forex table below shows a table that does not have the required constraint defined and, as a result, there are duplicate entries which are shown in the highlighted rows with the white background colour.

dup-sql-original-table
Original Table Containing Duplicates

The How-tos

CASE 1: Detecting duplicate entries in given columns:

SELECT
	COUNT(*) duplicate_count, currency_pair, date
FROM forex
GROUP BY currency_pair, date
HAVING COUNT(*) > 1;

Executing the SQL statement above retrieves records that duplicates on the combination of the currency_pair and date columns together with the number of times they duplicate. The result is shown below.

Duplicating Records Summary

Case 2: Selecting duplicate records

To select the all the duplicating records with all columns, perhaps, for data cleaning purposes:

Use the statement in case 1 as a subquery
Hint: A subquery or a nested query, which have been defined in case 1 above, is required that already knows how to detect duplicates in the underlying table.

SELECT
	f.*
FROM forex f
	INNER JOIN (
		SELECT currency_pair, date
		FROM forex
		GROUP BY currency_pair, date
		HAVING COUNT(*) > 1
	) d
	ON d.currency_pair = f.currency_pair AND d.date = f.date
ORDER BY currency_pair, date;

The outcome of the SQL statement above is depicted in the table below which show all the columns for the duplicating records. Note the record count is 5 and not 2, as shown in case 1, as this shows all the duplicating entries and not an aggregation based on the repeating columns namely currency_pair and date.

All Duplicating Records

Case 3: Selecting non-duplicating records

To migrate only the valid records, i.e. those not duplicating, you only have to change the conditional operator from > (greater than) to = (equal to) in the solution above.

SELECT
	f.*
FROM forex f
	INNER JOIN (
		SELECT currency_pair, date
		FROM forex
		GROUP BY currency_pair, date
		HAVING COUNT(*) = 1
	) d
	ON d.currency_pair = f.currency_pair AND d.date = f.date
ORDER BY currency_pair, date;

And expectedly, the resulting records below are the non-repeating ones only.

Non-duplicating Records

As you can see, it is not that difficult at all fishing out duplicates entries in database tables and using that information as a foundation to retrieve other data for various use cases/scenarios.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.