How to find duplicate rows with PostgreSQL?
I was working on a project where we needed to aggregate information on employees from 10 different tables and make the resulting table clear (no duplicate rows), containing full information on people working in the big company.
While making this I understood that the emergence of duplicates (or duplicate rows) is inevitable when you work with a large amount of data aggregating several tables into one. Fortunately PostgreSQL has some features that are extremely useful while working with detection and elimination of duplicates.
I want to put your attention on these features and help you to never have problems with duplicates.
Duplicate or Duplicate row is a row in a table looking exactly or almost exactly like some another row (original row) in this table.
So we can deal with absolutely identical rows and almost identical rows. For example theirs ids can differ but all other properties are exactly the same.
So, what can you do with the duplicates?
For absolutely identical rows:
- Find them
- Delete them
For almost identical rows (identical except for one or more properties):
- Combine information from duplicate rows into one row
- Select one of the rows according to some criteria and delete the remaining ones.
Finding duplicate rows
If the table has few rows, you can see which ones are duplicate immediately. However, it is not the case with the big table.
The find the duplicate rows, you use the following statement:
COUNT( fruit )
COUNT( fruit )> 1