Eliminating Duplicates in SQL

Analyzing Duplicates


My Friend David Penton wrote this article like a million years ago and being too lazy to actually commit the logic to memory - when I have duplicates I just go find the article.  Had a bit of an issue today - so when I finally found it, I decided to post it here so that I at least will know where it is - I would abbreviate it but someone might find it more valuable in its entirety as written by the talented Mr. Penton.

15 Seconds : Advanced SQL Techniques - Part 1: Analyzing Duplicate Records

David Penton 
10/09 / 2001 (Right - 2001 !! Does that date us or what!)

This article offers an analysis of Structured Query Language (SQL) and presents techniques that can be used in building SQL statements. People tend to think of SQL as a programming language, but in reality, SQL is a Set language. One could say SQL is Set Theory or Relational Algebra; these terms are interchangeable. Some people think of this in terms of Venn diagrams or DeMorgan's Law. Although those can be quite complicated, the basic forms are not difficult to understand. The cases presented here are intended for certain circumstances, but they can be applied to many situations. Each article in this four-part series will address one case.

Deleting Duplicate Records

The following is an analysis of records in a table, some of which may be duplicated. Many different things can be accomplished with this method, including looking for duplicates, looking for records that are duplicated n times, deleting duplicates, etc.
People develop many ways to delete duplicate records from a table. You can make a new table and select the unique records into that table. You can concatenate fields together, get a distinct representation of those records, and place those in another table. But these methods are normally used when people think they do not have any other way to accomplish this.
Let's suppose we have a table structure with a multiple field Primary Key. The key is from an old ordering system we imported into a table that includes an IDENTITY field:

CREATE TABLE SHIPMENT_ORDERS (
    [id] [int] IDENTITY (1, 1) NOT NULL
    , [shipment] [varchar] (20) NOT NULL
    , [purchase_order] [varchar] (20) NOT NULL
    , [item_sku] [char] (20) NOT NULL
    , [units_ordered] [int] NOT NULL
    , [units_shipped] [int] NOT NULL
    , [shipment_date] [datetime] NOT NULL
    , [order_date] [datetime] NOT NULL
    , [last_name] [varchar] (30) NOT NULL
    , [first_name] [varchar] (30) NOT NULL
    , [full_address] [varchar] (30) NOT NULL
    , [email] [varchar] (30) NOT NULL
)

For sake of argument, there is no Primary Key on this table. Let's assume that some duplicate data was introduced into it. For our discussion, we will assume a target Primary Key to be [shipment], [purchase_order], [item_sku]. We want to get some statistical information from the data first so our end goal will be to remove the duplicate records. I will now provide the base query for all of our operations:

/* 1.  Query heading */
SELECT
  *  /* Remember to never use SELECT * and write out the field names */
FROM
  /* 2.  Table with duplicates */
  SHIPMENT_ORDERS
WHERE
  EXISTS  (
    SELECT
      NULL
    FROM
      /* 3.  Table with duplicates, with an alias */
      SHIPMENT_ORDERS b
    WHERE
      /* 4.  Join each field with *itself*.  These are fields that could be Primary Keys */
      b.[shipment] = SHIPMENT_ORDERS.[shipment]
      AND b.[purchase_order] = SHIPMENT_ORDERS.[purchase_order]
      AND b.[item_sku] = SHIPMENT_ORDERS.[item_sku]
    GROUP BY
            /* 5.  I must GROUP BY these fields because of the HAVING
         clause and because these are the possible PK */
      b.[shipment], b.[purchase_order], b.[item_sku]
    HAVING
      /* 6.  This is the determining factor.  We can control our
         output from here.  In this case, we want to pick records
         where the ID is less than the MAX ID */
      SHIPMENT_ORDERS.[id] < MAX(b.[id])
  )


  1. The kind of query we want will be either SELECT * or DELETE FROM. We will view our results first, then change this to delete the offending records if necessary.
  2. This is the table we are performing the above action on.
  3. This is the same table, duplicated. We are self-referencing our target table.
  4. We must compare each field to its copy. This is 1,000 times better than concatenation, since we do not need to worry about the data lining up. This can be used when you have two (or more) tables you need to compare that have similar data. The way most people do this is concatenate all target fields together. That is the most intensive thing you can do when comparing a group of fields in a table.
  5. We need to GROUP BY the same fields because we want to ensure that our data is unique in our analysis.
  6. This ultimately determines what our condition for the resulting data will be.
After running this query, we would see all of the duplicated records. For instance, if we had the following records in the table:

id   shipment   purchase_order   item_sku
3    435436     A123765          453987001201
46   435436     A123765          453987001201
354  435436     A123765          453987001201
23   981123     C543219          843209132209
613  981123     C543219          843209132209
the result after running the above query would be:

id   shipment   purchase_order   item_sku
3    435436     A123765          453987001201
46   435436     A123765          453987001201
23   981123     C543219          843209132209
This would be a report of the duplicated records that are present in the data. Now, if we look at section 6 in the query:

      SHIPMENT_ORDERS.[id] < MAX(b.[id])

we can see that an alternate query could use MIN() instead of MAX(). Note that this would not affect the outcome, except it would pick the minimum of the field instead of the maximum. In the case of our table, the only field that is not duplicated is [id]. If the table we were dealing with had a field such as a datetime field, it could be used as a pseudo-IDENTITY field.Suppose we want to get a report of all records that have a single, duplicated record. This is easily obtainable by changing one part of this query (6):


      COUNT(b.[id]) = 2

Based on the previous data, our output would be:

id   shipment   purchase_order   item_sku
23   981123     C543219          843209132209
613  981123     C543219          843209132209

Notice that we get both records.Remember the first query we ran? It showed all of the duplicated records in the table. If we changed the first two words (SELECT *) in the SQL sentence to DELETE, we accomplish our goal of removing all duplicates from the table.

id   shipment   purchase_order   item_sku
354  435436     A123765          453987001201
613  981123     C543219          843209132209

Conclusion

For a recap, we did not:
  1. create a new table
  2. concatenate fields
  3. use the DISTINCT keyword
  4. alter the table
and we did:
  1. use the original table to compare
  2. compare field to field without altering them
  3. rely on the existence of data, not generate new data
  4. use the IDENTITY field to ensure unique records
This query template can be used for many different types of reports and operations on data. At first look, that query may seem rather complicated, but when it is dissected, the sections are not complex at all. The next article will discuss the uses/differences of UNION and UNION ALL on single-table and multiple-table instances.
Remember, Set Theory is cool!

Comments

Popular posts from this blog

Linq Exclude from separate list

Sorting Ascending and Descending

Linq Query Syntax vs Method Syntax