Reply to comment

roger's picture

Removing duplicates from a table

Let's assume that you've got a table:

CREATE TABLE #t (
  Id INT IDENTITY,
  Name VARCHAR(50),
  Number INT )

... into which you've accidentally inserted some duplicate rows:

INSERT INTO #t(Name, Number) VALUES('Alice', 3)
INSERT INTO #t(Name, Number) VALUES('Alice', 3)
INSERT INTO #t(Name, Number) VALUES('Alice', 3)
INSERT INTO #t(Name, Number) VALUES('Alice', 4)
INSERT INTO #t(Name, Number) VALUES('Bob', 1)
INSERT INTO #t(Name, Number) VALUES('Bob', 2)
INSERT INTO #t(Name, Number) VALUES('Bob', 3)
INSERT INTO #t(Name, Number) VALUES('Charlie', 1)
INSERT INTO #t(Name, Number) VALUES('Charlie', 2)
INSERT INTO #t(Name, Number) VALUES('Charlie', 3)
INSERT INTO #t(Name, Number) VALUES('Charlie', 4)
INSERT INTO #t(Name, Number) VALUES('David', 1)
INSERT INTO #t(Name, Number) VALUES('David', 1)
INSERT INTO #t(Name, Number) VALUES('David', 2)
INSERT INTO #t(Name, Number) VALUES('David', 2)

(The names are those used when discussing crypto.)

As you can see, I've managed to insert two duplicate rows for Alice, and two duplicate (but different) rows for David.

It's pretty easy to find the non-duplicates, by doing the following:

SELECT MIN(Id) AS Id, Name, Number FROM #t
GROUP BY Name, Number
ORDER BY MIN(Id)

Actually deleting them, on the other hand, is more tricky:

DELETE #t FROM
	( SELECT MIN(L.Id) AS MinId, L.Name, L.Number FROM #t AS L
		INNER JOIN #t AS R
		ON L.Name = R.Name AND L.Number = R.Number
		AND L.Id <> R.Id
		GROUP BY L.Name, L.Number ) AS Q
INNER JOIN #t
	ON #t.Id > Q.MinId
	AND #t.Name = Q.Name
	AND #t.Number = Q.Number

The inner portion...

SELECT MIN(L.Id) AS MinId, L.Name, L.Number FROM #t AS L
	INNER JOIN #t AS R
	ON L.Name = R.Name AND L.Number = R.Number
	AND L.Id <> R.Id
	GROUP BY L.Name, L.Number

...selects the first row of each set of duplicates (in no particular order):

27	David	1
29	David	2
16	Alice	3

If we join this subquery against the original table, using the greater-than operator...

SELECT #t.* FROM
	( SELECT MIN(L.Id) AS MinId, L.Name, L.Number FROM #t AS L
		INNER JOIN #t AS R
		ON L.Name = R.Name AND L.Number = R.Number
		AND L.Id <> R.Id
		GROUP BY L.Name, L.Number ) AS Q
INNER JOIN #t
	ON #t.Id > Q.MinId
	AND #t.Name = Q.Name
	AND #t.Number = Q.Number

...then we end up with the duplicate rows...

17	Alice	3
18	Alice	3
28	David	1
30	David	2

...which we can then delete by changing the SELECT to a DELETE.

Ta-da!

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <b> <blockquote> <br> <code> <dd> <dl> <dt> <hr> <h1> <h2> <h3> <i> <img> <li> <ol> <p> <pre> <table> <td> <th> <tr> <tt> <u> <ul>
  • Images can be added to this post.

More information about formatting options