Sunday, February 19, 2012

in vs. exists

What is the difference between in and exists performance wise? I have
tuned(aatempted to tune) queries where a subselect was introduced with an in
.
I have replaced th in with an exists and found the performance and plan to
be the same.> What is the difference between in and exists performance wise?
It depends on the selectivity of the column(s) being checked.
IN() can never perform better than EXISTS(). EXISTS() *can* perform better
since it will "short circuit" (loose definition here) as soon as it sees a
single row that satisfies the condition. IN(), on the other hand, has to
continue resolving the entire query.|||Hi,
In some situation EXISTS will be good and some times IN will be good.
When Exists is good:-
If your second table (Table used inside inner table) contains more distinct
values (High cardinality) then EXISTS will be good.
When IN is good :-
If your second table (Table used inside inner table) contains more duplicate
values (Low cardinality) then IN will be good.
Thanks
Hari
SQL Server MVP
"Jeffrey K. Ericson" <JeffreyKEricson@.discussions.microsoft.com> wrote in
message news:5F6D97A3-94A2-4236-A467-13D53496A726@.microsoft.com...
> What is the difference between in and exists performance wise? I have
> tuned(aatempted to tune) queries where a subselect was introduced with an
> in.
> I have replaced th in with an exists and found the performance and plan to
> be the same.|||> When IN is good :-
> If your second table (Table used inside inner table) contains more
> duplicate values (Low cardinality) then IN will be good.
I may be mistaken, but I don't believe there is any case where IN can
perform better than an equivalent query using either EXISTS() or NOT
EXISTS().|||>> What is the difference between in and exists performance wise?
Nothing. These are simply SQL operators asking its difference would be like
asking the performance difference between say, addition and subtraction.
Without testing your queries under the specific setup ( indexes, dataset
size, hints, processor involvements, memory etc. ) you have, it is
impossible to consistently claim one performs better than the other.
For a very small illustration, check the plans for two queries with no
"actual" disk access performs for NOT IN & NOT EXISTS:
SELECT *
FROM ( SELECT 1 UNION SELECT 2 UNION
SELECT 3 UNION SELECT 4 ) t1 ( c )
WHERE t1.c NOT IN ( SELECT t2.c FROM ( SELECT 1 ) t2 ( c ) ) ;
SELECT *
FROM ( SELECT 1 UNION SELECT 2 UNION
SELECT 3 UNION SELECT 4 ) t1 ( c )
WHERE NOT EXISTS ( SELECT * FROM ( SELECT 1 ) t2 ( c )
WHERE t2.c = t1.c ) ;
Simpy change the number of union-ed items in the derived tables and see how
the plan varies.
Anith|||I used to see such differences in the plans in 7.0, but in 2000 IN is
usually treated like EXISTS, where a similar short circuit you're referring
to is used also for IN.
Check out the plans and io stats for both queries:
set statistics io on;
use northwind;
select * from employees as e
where exists
(select * from orders as o
where o.employeeid = e.employeeid);
select * from employees as e
where employeeid in
(select employeeid from orders);
And you will see that in both cases the plans perform a single s
operation in the index on orders.employeeid for each outer employee.
The index is not fully scanned to retrieve the list of employees. The
optimizer is more sofisticated now.
I'd just add a caution when considering whether to use EXISTS or IN, that
they are not logically identical when NULLs are involved.
BG, SQL Server MVP
www.SolidQualityLearning.com
"Aaron Bertrand [SQL Server MVP]" <ten.xoc@.dnartreb.noraa> wrote in message
news:uY3zMUfmFHA.4056@.TK2MSFTNGP10.phx.gbl...
> It depends on the selectivity of the column(s) being checked.
> IN() can never perform better than EXISTS(). EXISTS() *can* perform
> better since it will "short circuit" (loose definition here) as soon as it
> sees a single row that satisfies the condition. IN(), on the other hand,
> has to continue resolving the entire query.
>|||Jeffrey,
That is my experience too (that SQL2K handles IN as efficiently as
EXISTS).
In theory there are cases in which IN would be faster than EXISTS (for
example when the selection in the IN-clause has many duplicates).
However, in practice the query optimizer is smart enough to determine
whether the IN strategy (selection with aggregation to remove
duplicates) or the EXISTS (lookups) is faster, regardless whether you
specify IN or EXISTS.
Gert-Jan|||I agree. Post your DDL along with some sample data and we can help you find
the optimal solution for your specific situation.
Aincent Romans created a name for this approach: casuistic (by case). And
it's still used today... :)
ML|||> In theory there are cases in which IN would be faster than EXISTS (for
> example when the selection in the IN-clause has many duplicates).
How? In would have duplicates too? I always use EXISTS because it is just
as easy to code and you never have to worry about the fact that IN only
works with single values and EXISTS covers all.
----
Louis Davidson - http://spaces.msn.com/members/drsql/
SQL Server MVP
"Gert-Jan Strik" <sorry@.toomuchspamalready.nl> wrote in message
news:42F3F200.3BB4CE81@.toomuchspamalready.nl...
> Jeffrey,
> That is my experience too (that SQL2K handles IN as efficiently as
> EXISTS).
> In theory there are cases in which IN would be faster than EXISTS (for
> example when the selection in the IN-clause has many duplicates).
> However, in practice the query optimizer is smart enough to determine
> whether the IN strategy (selection with aggregation to remove
> duplicates) or the EXISTS (lookups) is faster, regardless whether you
> specify IN or EXISTS.
> Gert-Jan|||It is just academic because (as explained before) the optimizer will
select the best plan regardless of the IN/EXISTS syntax. But here it
goes...
For an IN subquery, only the unique values are relevant, so under the
covers the query will contain the DISTINCT keyword. If there are very
few distinct values, this will result in a set of hashing keys which is
is very cheap to match against the main query and will use very limited
memory (very compact).
The same query with an EXISTS would have to do a lookup for each row of
the main query, so there will be many logical reads. If there is a good
index in place for the lookups in the correlated subquery (and again
assuming few distinct values), then in practice you will get good
performance too, because all relevant pages are likely to be cached.
However, if the key is narrow and the distinct values are scattered over
the pages (with on average one distinct key per page), then the EXISTS
version will require much more memory to get the same performance as the
IN-variant, because on average each (cached) page will only contain one
distinct key, whereas the hash set of the IN version is a very compact
set, not dependant on the page cache.
The IN version will have to read all those pages too, but only needs to
do that once, something which is not true for the EXISTS version.
If there is no useful index for the subquery, then theoretically the
EXISTS version will run like a dog (a table scan for each lookup), and
the IN version will need 'just' one table scan.
Gert-Jan
Louis Davidson wrote:
>
> How? In would have duplicates too? I always use EXISTS because it is jus
t
> as easy to code and you never have to worry about the fact that IN only
> works with single values and EXISTS covers all.
> --
> ----
--
> Louis Davidson - http://spaces.msn.com/members/drsql/
> SQL Server MVP
> "Gert-Jan Strik" <sorry@.toomuchspamalready.nl> wrote in message
> news:42F3F200.3BB4CE81@.toomuchspamalready.nl...

No comments:

Post a Comment