Kyra's Java Blog: SQL: Distinct vs group by

...Okay, this isn't Java related, but I'm mostly doing PHP + SQL Server queries these days and I'm too lazy to make a new blog.

As I'm still a bit of a beginner when it comes to SQL (mostly just know basic selects and inserts... even my joins are rather rusty) I had a problem with select statement. Even though I wanted to get data cross-referenced from a bunch of tables, I only wanted data where the first column was unique.

"so this is a job for distinct, right?"

Wrong. This is really a job for group by, and this is why:

(quoted from "Jeff's SQL Server Blog") :

http://weblogs.sqlteam.com/jeffs/archive/2007/10/12/sql-distinct-group-by.aspx

I'm reproducing the text below because I hate it when links become outdated.

By The Way ... DISTINCT is not a function ...

Have you ever seen (or written) code like this:

select distinct(employeeID), salary

from salaryhist

That compiles and executes without returning any errors. I've seen that attempted many times over the years, and of course people think DISTINCT is "broken" and "not working" because they see multiple rows for each employeeID. "But I asked for only distinct employeeIDs!" they say.

Well, the DISTINCT has nothing to do with the EmployeeID column; it is not a function that accepts arguments! It is just a tag that you can put after the word SELECT to indicate that you want only distinct combinations of all columns in the result set returned.

That syntax is accepted because (employeeID) is just an expression, a reference to a column, which happens to be surrounded by parenthesis. For example, you could write:

select distinct (employeeID), (salary)
from salaryhist

or:

select (employeeID), (salary)
from salaryhist

or even:

select distinct ((employeeID)), ((salary))
from salaryhist

Nothing is indicating that DISTINCT should be "operating" on the employeeID column; it is just a column reference in the SELECT clause that happens to be surrounded by parenthesis.

So, remember:

DISTINCT always operates on all columns in the final result
DISTINCT is not a function that accepts a column as an argument
When you do want to return multiple columns in your result, but only have them be distinct for a subset of those columns, you would use GROUP BY. And, of course, you must specify how you'd like to summarize all non distinct/grouped columns for any others you'd like to return:

select employeeID, max(salary) as MaxSalary
from salaryhist
group by employeeID

Notice that now we are getting distinct EmployeeID values, and the max salary per EmployeeID. This will truly return exactly one row per EmployeeID, unlike the initial DISTINCT example. And, in this case, MAX() is indeed a function that accepts and acts upon an argument -- unlike DISTINCT!

Kyra's Java Blog

Thursday, April 9, 2009

SQL: Distinct vs group by

No comments:

Blog Archive

About Me