PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案

2023-07-06 22:48:03 274

测试环境：

postgres=#selectversion();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL11.9onx86_64-pc-linux-gnu,compiledbygcc(GCC)4.8.520150623(RedHat4.8.5-39),64-bit
(1row)
postgres=#

数据准备：

$pgbench-i-s10

postgres=#\d
Listofrelations
Schema|Name|Type|Owner
--------+------------------+-------+----------
public|pgbench_accounts|table|postgres
public|pgbench_branches|table|postgres
public|pgbench_history|table|postgres
public|pgbench_tellers|table|postgres
(4rows)

postgres=#select*frompgbench_accountslimit1;
aid|bid|abalance|filler
-----+-----+----------+--------------------------------------------------------------------------------------
1|1|0|
(1row)

postgres=#select*frompgbench_brancheslimit1;
bid|bbalance|filler
-----+----------+--------
1|0|
(1row)

postgres=#select*frompgbench_historylimit1;
tid|bid|aid|delta|mtime|filler
-----+-----+-----+-------+-------+--------
(0rows)

postgres=#select*frompgbench_tellerslimit1;
tid|bid|tbalance|filler
-----+-----+----------+--------
1|1|0|
(1row)

postgres=#select*frompgbench_branches;
bid|bbalance|filler
-----+----------+--------
1|0|
2|0|
3|0|
4|0|
5|0|
6|0|
7|0|
8|0|
9|0|
10|0|
(10rows)

postgres=#updatepgbench_branchessetbbalance=4500000wherebidin(4,7);
UPDATE2
postgres=#

IN语句

查询要求：找出那些余额(balance)大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用IN子句

SELECT
count(aid),bid
FROM
pgbench_accounts
WHERE
bidIN(SELECTbidFROMpgbench_branchesWHEREbbalance>0)
GROUPBY
bid;

2.使用ANY子句

SELECT
count(aid),bid
FROM
pgbench_accounts
WHERE
bid=ANY(SELECTbidFROMpgbench_branchesWHEREbbalance>0)
GROUPBY
bid;

3.使用EXISTS子句

SELECT
count(aid),bid
FROM
pgbench_accounts
WHERE
EXISTS(SELECTbidFROMpgbench_branchesWHEREbbalance>0ANDpgbench_accounts.bid=pgbench_branches.bid)
GROUPBY
bid;

4.使用INNERJOIN

SELECT
count(aid),a.bid
FROM
pgbench_accountsa
JOINpgbench_branchesbONa.bid=b.bid
WHERE
b.bbalance>0
GROUPBY
a.bid;

在完成这个查询要求的时候，有人可能会假设exists和innerjoin性能可能会更好，因为他们可以使用两表连接的逻辑和优化。而IN和ANY子句需要使用子查询。

然而，PostgreSQL(10版本之后)已经智能的足以对上面四种写法产生相同的执行计划！

所有上面的写法都会产生相同的执行计划：

QUERYPLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
FinalizeGroupAggregate(cost=23327.73..23330.26rows=10width=12)(actualtime=97.199..99.014rows=2loops=1)
GroupKey:a.bid
->GatherMerge(cost=23327.73..23330.06rows=20width=12)(actualtime=97.191..99.006rows=6loops=1)
WorkersPlanned:2
WorkersLaunched:2
->Sort(cost=22327.70..22327.73rows=10width=12)(actualtime=93.762..93.766rows=2loops=3)
SortKey:a.bid
SortMethod:quicksortMemory:25kB
Worker0:SortMethod:quicksortMemory:25kB
Worker1:SortMethod:quicksortMemory:25kB
->PartialHashAggregate(cost=22327.44..22327.54rows=10width=12)(actualtime=93.723..93.727rows=2loops=3)
GroupKey:a.bid
->HashJoin(cost=1.14..22119.10rows=41667width=8)(actualtime=24.024..83.263rows=66667loops=3)
HashCond:(a.bid=b.bid)
->ParallelSeqScanonpgbench_accountsa(cost=0.00..20560.67rows=416667width=8)(actualtime=0.023..43.151rows=333333loops=3)
->Hash(cost=1.12..1.12rows=1width=4)(actualtime=0.027..0.028rows=2loops=3)
Buckets:1024Batches:1MemoryUsage:9kB
->SeqScanonpgbench_branchesb(cost=0.00..1.12rows=1width=4)(actualtime=0.018..0.020rows=2loops=3)
Filter:(bbalance>0)
RowsRemovedbyFilter:8
PlanningTime:0.342ms
ExecutionTime:99.164ms
(22rows)

那么，我们是否可以得出这样的结论:我们可以随意地编写查询，而PostgreSQL的智能将会处理其余的问题?！

等等!

如果我们考虑排除情况，事情会变得不同。

排除查询

查询要求：找出那些余额(balance)不大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户

1.使用NOTIN

SELECT
count(aid),bid
FROM
pgbench_accounts
WHERE
bidNOTIN(SELECTbidFROMpgbench_branchesWHEREbbalance>0)
GROUPBY
bid;

执行计划:

QUERYPLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
FinalizeGroupAggregate(cost=23645.42..23647.95rows=10width=12)(actualtime=128.606..130.502rows=8loops=1)
GroupKey:pgbench_accounts.bid
->GatherMerge(cost=23645.42..23647.75rows=20width=12)(actualtime=128.598..130.490rows=24loops=1)
WorkersPlanned:2
WorkersLaunched:2
->Sort(cost=22645.39..22645.42rows=10width=12)(actualtime=124.960..124.963rows=8loops=3)
SortKey:pgbench_accounts.bid
SortMethod:quicksortMemory:25kB
Worker0:SortMethod:quicksortMemory:25kB
Worker1:SortMethod:quicksortMemory:25kB
->PartialHashAggregate(cost=22645.13..22645.23rows=10width=12)(actualtime=124.917..124.920rows=8loops=3)
GroupKey:pgbench_accounts.bid
->ParallelSeqScanonpgbench_accounts(cost=1.13..21603.46rows=208333width=8)(actualtime=0.078..83.134rows=266667loops=3)
Filter:(NOT(hashedSubPlan1))
RowsRemovedbyFilter:66667
SubPlan1
->SeqScanonpgbench_branches(cost=0.00..1.12rows=1width=4)(actualtime=0.020..0.021rows=2loops=3)
Filter:(bbalance>0)
RowsRemovedbyFilter:8
PlanningTime:0.310ms
ExecutionTime:130.620ms
(21rows)

postgres=#

2.使用<>ALL

SELECT
count(aid),bid
FROM
pgbench_accounts
WHERE
bid<>ALL(SELECTbidFROMpgbench_branchesWHEREbbalance>0)
GROUPBY
bid;

执行计划:

QUERYPLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
FinalizeGroupAggregate(cost=259581.79..259584.32rows=10width=12)(actualtime=418.220..419.913rows=8loops=1)
GroupKey:pgbench_accounts.bid
->GatherMerge(cost=259581.79..259584.12rows=20width=12)(actualtime=418.212..419.902rows=24loops=1)
WorkersPlanned:2
WorkersLaunched:2
->Sort(cost=258581.76..258581.79rows=10width=12)(actualtime=413.906..413.909rows=8loops=3)
SortKey:pgbench_accounts.bid
SortMethod:quicksortMemory:25kB
Worker0:SortMethod:quicksortMemory:25kB
Worker1:SortMethod:quicksortMemory:25kB
->PartialHashAggregate(cost=258581.50..258581.60rows=10width=12)(actualtime=413.872..413.875rows=8loops=3)
GroupKey:pgbench_accounts.bid
->ParallelSeqScanonpgbench_accounts(cost=0.00..257539.83rows=208333width=8)(actualtime=0.054..367.244rows=266667loops=3)
Filter:(SubPlan1)
RowsRemovedbyFilter:66667
SubPlan1
->Materialize(cost=0.00..1.13rows=1width=4)(actualtime=0.000..0.001rows=2loops=1000000)
->SeqScanonpgbench_branches(cost=0.00..1.12rows=1width=4)(actualtime=0.001..0.001rows=2loops=337880)
Filter:(bbalance>0)
RowsRemovedbyFilter:8
PlanningTime:0.218ms
ExecutionTime:420.035ms
(22rows)
postgres=#

3.使用NOTEXISTS

SELECT
count(aid),bid
FROM
pgbench_accounts
WHERE
NOTEXISTS(SELECTbidFROMpgbench_branchesWHEREbbalance>0ANDpgbench_accounts.bid=pgbench_branches.bid)
GROUPBY
bid;

执行计划:

QUERYPLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
FinalizeGroupAggregate(cost=28327.72..28330.25rows=10width=12)(actualtime=152.024..153.931rows=8loops=1)
GroupKey:pgbench_accounts.bid
->GatherMerge(cost=28327.72..28330.05rows=20width=12)(actualtime=152.014..153.917rows=24loops=1)
WorkersPlanned:2
WorkersLaunched:2
->Sort(cost=27327.70..27327.72rows=10width=12)(actualtime=147.782..147.786rows=8loops=3)
SortKey:pgbench_accounts.bid
SortMethod:quicksortMemory:25kB
Worker0:SortMethod:quicksortMemory:25kB
Worker1:SortMethod:quicksortMemory:25kB
->PartialHashAggregate(cost=27327.43..27327.53rows=10width=12)(actualtime=147.732..147.737rows=8loops=3)
GroupKey:pgbench_accounts.bid
->HashAntiJoin(cost=1.14..25452.43rows=375000width=8)(actualtime=0.134..101.884rows=266667loops=3)
HashCond:(pgbench_accounts.bid=pgbench_branches.bid)
->ParallelSeqScanonpgbench_accounts(cost=0.00..20560.67rows=416667width=8)(actualtime=0.032..45.174rows=333333loops=3)
->Hash(cost=1.12..1.12rows=1width=4)(actualtime=0.036..0.037rows=2loops=3)
Buckets:1024Batches:1MemoryUsage:9kB
->SeqScanonpgbench_branches(cost=0.00..1.12rows=1width=4)(actualtime=0.025..0.027rows=2loops=3)
Filter:(bbalance>0)
RowsRemovedbyFilter:8
PlanningTime:0.322ms
ExecutionTime:154.040ms
(22rows)
postgres=#

4.使用LEFTJOIN和ISNULL

SELECT
count(aid),a.bid
FROM
pgbench_accountsa
LEFTJOINpgbench_branchesbONa.bid=b.bidANDb.bbalance>0
WHERE
b.bidISNULL
GROUPBY
a.bid;

执行计划:

QUERYPLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
FinalizeGroupAggregate(cost=28327.72..28330.25rows=10width=12)(actualtime=145.298..147.096rows=8loops=1)
GroupKey:a.bid
->GatherMerge(cost=28327.72..28330.05rows=20width=12)(actualtime=145.288..147.083rows=24loops=1)
WorkersPlanned:2
WorkersLaunched:2
->Sort(cost=27327.70..27327.72rows=10width=12)(actualtime=141.883..141.887rows=8loops=3)
SortKey:a.bid
SortMethod:quicksortMemory:25kB
Worker0:SortMethod:quicksortMemory:25kB
Worker1:SortMethod:quicksortMemory:25kB
->PartialHashAggregate(cost=27327.43..27327.53rows=10width=12)(actualtime=141.842..141.847rows=8loops=3)
GroupKey:a.bid
->HashAntiJoin(cost=1.14..25452.43rows=375000width=8)(actualtime=0.087..99.535rows=266667loops=3)
HashCond:(a.bid=b.bid)
->ParallelSeqScanonpgbench_accountsa(cost=0.00..20560.67rows=416667width=8)(actualtime=0.025..44.337rows=333333loops=3)
->Hash(cost=1.12..1.12rows=1width=4)(actualtime=0.026..0.027rows=2loops=3)
Buckets:1024Batches:1MemoryUsage:9kB
->SeqScanonpgbench_branchesb(cost=0.00..1.12rows=1width=4)(actualtime=0.019..0.020rows=2loops=3)
Filter:(bbalance>0)
RowsRemovedbyFilter:8
PlanningTime:0.231ms
ExecutionTime:147.180ms
(22rows)
postgres=#

NOTIN和<>ALL生成执行计划都包含了一个子查询。他们是各自独立的。

而NOTEXISTS和LEFTJOIN生成了相同的执行计划。

这些hash连接(或hashantijoin)是完成查询要求的最灵活的方式。这也是推荐exists或join的原因。因此，推荐使用exists或join的经验法则是有效的。

但是，我们继续往下看!即使有了子查询执行计划，NOTIN子句的执行时间也会更好?

是的。PostgreSQL做了出色的优化，PostgreSQL将子查询计划进行了hash处理。因此PostgreSQL对如何处理IN子句有了更好的理解，这是一种逻辑思维方式，因为很多人倾向于使用IN子句。子查询返回的行很少，但即使子查询返回几百行，也会发生同样的情况。

但是，如果子查询返回大量行(几十万行)怎么办?让我们尝试一个简单的测试：

CREATETABLEt1AS
SELECT*FROMgenerate_series(0,500000)id;

CREATETABLEt2AS
SELECT(random()*4000000)::integerid
FROMgenerate_series(0,4000000);

ANALYZEt1;
ANALYZEt2;

EXPLAINSELECTid
FROMt1
WHEREidNOTIN(SELECTidFROMt2);

执行计划：

QUERYPLAN
--------------------------------------------------------------------------------
Gather(cost=1000.00..15195064853.01rows=250000width=4)
WorkersPlanned:1
->ParallelSeqScanont1(cost=0.00..15195038853.01rows=147059width=4)
Filter:(NOT(SubPlan1))
SubPlan1
->Materialize(cost=0.00..93326.01rows=4000001width=4)
->SeqScanont2(cost=0.00..57700.01rows=4000001width=4)
(7rows)

postgres=#

这里，执行计划将子查询进行了物化。代价评估变成了15195038853.01。(PostgreSQL的默认设置，如果t2表的行低于100k，会将子查询进行hash)。这样就会严重影响性能。因此，对于那种子查询返回的行数很少的场景，IN子句可以起到很好的作用。

其它注意点

有的！在我们用不同的方式写查询的时候，可能有数据类型的转换。

比如，语句：

EXPLAINANALYZESELECT*FROMempWHEREgen=ANY(ARRAY['M','F']);

就会发生隐式的类型转换：

SeqScanonemp(cost=0.00..1.04rows=2width=43)(actualtime=0.023..0.026rows=3loops=1)
Filter:((gen)::text=ANY('{M,F}'::text[]))

这里的(gen)::text就发生了类型转换。如果在大表上，这种类型转换的代价会很高，因此，PostgreSQL对IN子句做了更好的处理。

EXPLAINANALYZESELECT*FROMempWHEREgenIN('M','F');

SeqScanonemp(cost=0.00..1.04rows=3width=43)(actualtime=0.030..0.034rows=3loops=1)
Filter:(gen=ANY('{M,F}'::bpchar[]))

将IN子句转换成了ANY子句，没有对gen列进行类型转换。而是将M\F转成了bpchar(内部等价于char)

总结

简单来说，exists和直接join表通常比较好。

很多情况下，PostgreSQL将IN子句换成被hash的子计划。在一些特殊场景下，IN可以获得更好的执行计划。

以上为个人经验，希望能给大家一个参考，也希望大家多多支持毛票票。如有错误或未考虑完全的地方，望不吝赐教。

PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案

IN语句

1.使用IN子句

2.使用ANY子句

3.使用EXISTS子句

4.使用INNERJOIN

排除查询

1.使用NOTIN

2.使用<>ALL

3.使用NOTEXISTS

4.使用LEFTJOIN和ISNULL

其它注意点

总结

热门推荐

随机推荐