Thursday, March 29, 2012

Alternate to a not in query

-- tested schema below --
-- create tables --
create table tbl_test
(serialnumber char(12))
go
create table tbl_test2
(serialnumber char(12),
exportedflag int)
go
--insert data --
insert into tbl_test2 values ('123456789010',0)
insert into tbl_test2 values ('123456789011',0)
insert into tbl_test2 values ('123456789012',0)
insert into tbl_test2 values ('123456789013',0)
insert into tbl_test2 values ('123456789014',0)
insert into tbl_test2 values ('123456789015',0)
insert into tbl_test2 values ('123456789016',0)
insert into tbl_test2 values ('123456789017',0)
insert into tbl_test2 values ('123456789018',0)
insert into tbl_test2 values ('123456789019',0)

insert into tbl_test values ('123456789011')
insert into tbl_test values ('123456789012')
insert into tbl_test values ('123456789013')
insert into tbl_test values ('123456789014')
insert into tbl_test values ('123456789015')

-- query --
Select serialnumber from tbl_test2
where serialnumber
not in (select serialnumber from tbl_test) and
exportedflag=0

This query runs quite fast with only the data above but when both
tables get million plus rows, the query simply bogs down. Is there a
better way to write this query?Select serialnumber
from tbl_test2 a
left joint tbl_test b on a. serialnumber = b.serialnumber
where (b.serialnumber IS NULL)
AND (a.exportedflag=0)|||There is another way to write the query, but it's not better (in fact,
I think it's worse):

Select tbl_test2.serialnumber from tbl_test2
left join tbl_test on tbl_test2.serialnumber=tbl_test.serialnumber
where exportedflag=0 and tbl_test.serialnumber is null

To improve the performance of this query, you should create primary
keys on the tables. Besides the conceptual benefits of a proper design,
this would accomplish (at least) the following things:
- create an index on the serialnumber column
- declare that the serialnumber column does not allow duplicates
- declare that the serialnumber column does not allow nulls
These things will help the Query Optimizer very much to create a better
execution plan.

Razvan|||
Razvan Socol wrote:
> There is another way to write the query, but it's not better (in fact,
> I think it's worse):
> Select tbl_test2.serialnumber from tbl_test2
> left join tbl_test on tbl_test2.serialnumber=tbl_test.serialnumber
> where exportedflag=0 and tbl_test.serialnumber is null

Razvan,

Why worse?

The common wisdom seems to be that it is always more efficient
eliminate nested subqueries, if possible.

My understanding is that the optimizer will internally eliminate the
subquery by doing a left join as above if it can.|||Ira Gladnick (IraGladnick@.yahoo.com) writes:
> Why worse?
> The common wisdom seems to be that it is always more efficient
> eliminate nested subqueries, if possible.

It's worse, becase it does not express the intent of the query equally
well, and therefore can contribute to higher maintenance costs.

> My understanding is that the optimizer will internally eliminate the
> subquery by doing a left join as above if it can.

I don't know if this is the case, but in such case there is even less
reason to rewrite the query in an obscure way.

I would write the query as:

Select serialnumber
from tbl_test2 t2
where not exists (select *
from tbl_test t
where t2.serialnuber = t.serialnumber)
and exportedflag=0

In SQL 6.5 this would typically perform better than NOT IN. But I believe
SQL 2000 will rewrite NOT IN to NOT EXISTS internally, so it is not that
much of an issue for performance. But NOT EXISTS is more general to use
than NOT IN, because you can handle multi-column conditions. Furthermore,
if there are NULL values involved, NOT IN can give you surpriese.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||(kjaggi@.hotmail.com) writes:
> -- query --
> Select serialnumber from tbl_test2
> where serialnumber
> not in (select serialnumber from tbl_test) and
> exportedflag=0
> This query runs quite fast with only the data above but when both
> tables get million plus rows, the query simply bogs down. Is there a
> better way to write this query?

Beside the obvious point from Razvan about indexes, if you are on a multi-
CPU box, you can try this at the end of the query:

OPTION (MAXDOP 1)

this turns off parallelism. I've seen SQL Server use massive parallel
plans for this type of query, when a non-parallel plan have been much
faster.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.aspsql

No comments:

Post a Comment