Ideally business rules, that can be re-enforced at the database layer, should be implemented at the database layer to ensure that the integrity of the data can be relied on and does not lead to an unexpected behaviour from the application.
Rules such as
- validation rules, e.g., which fields/columns cannot be null, etc.
- common sense rules derived from business requirements, e.g., which combination of columns values are not allowed to repeat, etc.
The fact is, the application sitting on top of the database may not be the only source from which data may be provided to the database. Those sources could be:
- A database administrator making authorized changes/entries to the database using data manipulation language (DML)
- Importing data through external files such as comma-separated value (CSV) files
- Migrating data from other databases or tables
More often than not, developers do not take the time to put these basic requirements in the database layer and, as a result, applications that do not do a good job of handling these uniqueness scenarios result in allowing these unwanted situations arising.
A requirement that states that there can be only 1 entry for a currency pair for a given day should translate into a unique index being defined on the identified attributes, i.e. the currency_pair, and the date columns of the table.
forex table below shows a table that does not have the required constraint defined and, as a result, there are duplicate entries which are shown in the highlighted rows with the white background colour.
CASE 1: Detecting duplicate entries in given columns:
SELECT COUNT(*) duplicate_count, currency_pair, date FROM forex GROUP BY currency_pair, date HAVING COUNT(*) > 1;
Executing the SQL statement above retrieves records that duplicates on the combination of the currency_pair and date columns together with the number of times they duplicate. The result is shown below.
Case 2: Selecting duplicate records
To select the all the duplicating records with all columns, perhaps, for data cleaning purposes:
Use the statement in case 1 as a subquery
Hint: A subquery or a nested query, which have been defined in case 1 above, is required that already knows how to detect duplicates in the underlying table.
SELECT f.* FROM forex f INNER JOIN ( SELECT currency_pair, date FROM forex GROUP BY currency_pair, date HAVING COUNT(*) > 1 ) d ON d.currency_pair = f.currency_pair AND d.date = f.date ORDER BY currency_pair, date;
The outcome of the SQL statement above is depicted in the table below which show all the columns for the duplicating records. Note the record count is 5 and not 2, as shown in case 1, as this shows all the duplicating entries and not an aggregation based on the repeating columns namely currency_pair and date.
Case 3: Selecting non-duplicating records
To migrate only the valid records, i.e. those not duplicating, you only have to change the conditional operator from > (greater than) to = (equal to) in the solution above.
SELECT f.* FROM forex f INNER JOIN ( SELECT currency_pair, date FROM forex GROUP BY currency_pair, date HAVING COUNT(*) = 1 ) d ON d.currency_pair = f.currency_pair AND d.date = f.date ORDER BY currency_pair, date;
And expectedly, the resulting records below are the non-repeating ones only.
As you can see, it is not that difficult at all fishing out duplicates entries in database tables and using that information as a foundation to retrieve other data for various use cases/scenarios.