PostgreSQL 对IN,EXISTS,ANY/ALL,JOIN的sql优化方案
测试环境:
postgres=#selectversion(); version --------------------------------------------------------------------------------------------------------- PostgreSQL11.9onx86_64-pc-linux-gnu,compiledbygcc(GCC)4.8.520150623(RedHat4.8.5-39),64-bit (1row) postgres=#
数据准备:
$pgbench-i-s10
postgres=#\d Listofrelations Schema|Name|Type|Owner --------+------------------+-------+---------- public|pgbench_accounts|table|postgres public|pgbench_branches|table|postgres public|pgbench_history|table|postgres public|pgbench_tellers|table|postgres (4rows) postgres=#select*frompgbench_accountslimit1; aid|bid|abalance|filler -----+-----+----------+-------------------------------------------------------------------------------------- 1|1|0| (1row) postgres=#select*frompgbench_brancheslimit1; bid|bbalance|filler -----+----------+-------- 1|0| (1row) postgres=#select*frompgbench_historylimit1; tid|bid|aid|delta|mtime|filler -----+-----+-----+-------+-------+-------- (0rows) postgres=#select*frompgbench_tellerslimit1; tid|bid|tbalance|filler -----+-----+----------+-------- 1|1|0| (1row) postgres=#select*frompgbench_branches; bid|bbalance|filler -----+----------+-------- 1|0| 2|0| 3|0| 4|0| 5|0| 6|0| 7|0| 8|0| 9|0| 10|0| (10rows) postgres=#updatepgbench_branchessetbbalance=4500000wherebidin(4,7); UPDATE2 postgres=#
IN语句
查询要求:找出那些余额(balance)大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户
1.使用IN子句
SELECT count(aid),bid FROM pgbench_accounts WHERE bidIN(SELECTbidFROMpgbench_branchesWHEREbbalance>0) GROUPBY bid;
2.使用ANY子句
SELECT count(aid),bid FROM pgbench_accounts WHERE bid=ANY(SELECTbidFROMpgbench_branchesWHEREbbalance>0) GROUPBY bid;
3.使用EXISTS子句
SELECT count(aid),bid FROM pgbench_accounts WHERE EXISTS(SELECTbidFROMpgbench_branchesWHEREbbalance>0ANDpgbench_accounts.bid=pgbench_branches.bid) GROUPBY bid;
4.使用INNERJOIN
SELECT count(aid),a.bid FROM pgbench_accountsa JOINpgbench_branchesbONa.bid=b.bid WHERE b.bbalance>0 GROUPBY a.bid;
在完成这个查询要求的时候,有人可能会假设exists和innerjoin性能可能会更好,因为他们可以使用两表连接的逻辑和优化。而IN和ANY子句需要使用子查询。
然而,PostgreSQL(10版本之后)已经智能的足以对上面四种写法产生相同的执行计划!
所有上面的写法都会产生相同的执行计划:
QUERYPLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------ FinalizeGroupAggregate(cost=23327.73..23330.26rows=10width=12)(actualtime=97.199..99.014rows=2loops=1) GroupKey:a.bid ->GatherMerge(cost=23327.73..23330.06rows=20width=12)(actualtime=97.191..99.006rows=6loops=1) WorkersPlanned:2 WorkersLaunched:2 ->Sort(cost=22327.70..22327.73rows=10width=12)(actualtime=93.762..93.766rows=2loops=3) SortKey:a.bid SortMethod:quicksortMemory:25kB Worker0:SortMethod:quicksortMemory:25kB Worker1:SortMethod:quicksortMemory:25kB ->PartialHashAggregate(cost=22327.44..22327.54rows=10width=12)(actualtime=93.723..93.727rows=2loops=3) GroupKey:a.bid ->HashJoin(cost=1.14..22119.10rows=41667width=8)(actualtime=24.024..83.263rows=66667loops=3) HashCond:(a.bid=b.bid) ->ParallelSeqScanonpgbench_accountsa(cost=0.00..20560.67rows=416667width=8)(actualtime=0.023..43.151rows=333333loops=3) ->Hash(cost=1.12..1.12rows=1width=4)(actualtime=0.027..0.028rows=2loops=3) Buckets:1024Batches:1MemoryUsage:9kB ->SeqScanonpgbench_branchesb(cost=0.00..1.12rows=1width=4)(actualtime=0.018..0.020rows=2loops=3) Filter:(bbalance>0) RowsRemovedbyFilter:8 PlanningTime:0.342ms ExecutionTime:99.164ms (22rows)
那么,我们是否可以得出这样的结论:我们可以随意地编写查询,而PostgreSQL的智能将会处理其余的问题?!
等等!
如果我们考虑排除情况,事情会变得不同。
排除查询
查询要求:找出那些余额(balance)不大于0的每个分支(branch)在表在pgbench_accounts中有多少个账户
1.使用NOTIN
SELECT count(aid),bid FROM pgbench_accounts WHERE bidNOTIN(SELECTbidFROMpgbench_branchesWHEREbbalance>0) GROUPBY bid;
执行计划:
QUERYPLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- FinalizeGroupAggregate(cost=23645.42..23647.95rows=10width=12)(actualtime=128.606..130.502rows=8loops=1) GroupKey:pgbench_accounts.bid ->GatherMerge(cost=23645.42..23647.75rows=20width=12)(actualtime=128.598..130.490rows=24loops=1) WorkersPlanned:2 WorkersLaunched:2 ->Sort(cost=22645.39..22645.42rows=10width=12)(actualtime=124.960..124.963rows=8loops=3) SortKey:pgbench_accounts.bid SortMethod:quicksortMemory:25kB Worker0:SortMethod:quicksortMemory:25kB Worker1:SortMethod:quicksortMemory:25kB ->PartialHashAggregate(cost=22645.13..22645.23rows=10width=12)(actualtime=124.917..124.920rows=8loops=3) GroupKey:pgbench_accounts.bid ->ParallelSeqScanonpgbench_accounts(cost=1.13..21603.46rows=208333width=8)(actualtime=0.078..83.134rows=266667loops=3) Filter:(NOT(hashedSubPlan1)) RowsRemovedbyFilter:66667 SubPlan1 ->SeqScanonpgbench_branches(cost=0.00..1.12rows=1width=4)(actualtime=0.020..0.021rows=2loops=3) Filter:(bbalance>0) RowsRemovedbyFilter:8 PlanningTime:0.310ms ExecutionTime:130.620ms (21rows) postgres=#
2.使用<>ALL
SELECT count(aid),bid FROM pgbench_accounts WHERE bid<>ALL(SELECTbidFROMpgbench_branchesWHEREbbalance>0) GROUPBY bid;
执行计划:
QUERYPLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ FinalizeGroupAggregate(cost=259581.79..259584.32rows=10width=12)(actualtime=418.220..419.913rows=8loops=1) GroupKey:pgbench_accounts.bid ->GatherMerge(cost=259581.79..259584.12rows=20width=12)(actualtime=418.212..419.902rows=24loops=1) WorkersPlanned:2 WorkersLaunched:2 ->Sort(cost=258581.76..258581.79rows=10width=12)(actualtime=413.906..413.909rows=8loops=3) SortKey:pgbench_accounts.bid SortMethod:quicksortMemory:25kB Worker0:SortMethod:quicksortMemory:25kB Worker1:SortMethod:quicksortMemory:25kB ->PartialHashAggregate(cost=258581.50..258581.60rows=10width=12)(actualtime=413.872..413.875rows=8loops=3) GroupKey:pgbench_accounts.bid ->ParallelSeqScanonpgbench_accounts(cost=0.00..257539.83rows=208333width=8)(actualtime=0.054..367.244rows=266667loops=3) Filter:(SubPlan1) RowsRemovedbyFilter:66667 SubPlan1 ->Materialize(cost=0.00..1.13rows=1width=4)(actualtime=0.000..0.001rows=2loops=1000000) ->SeqScanonpgbench_branches(cost=0.00..1.12rows=1width=4)(actualtime=0.001..0.001rows=2loops=337880) Filter:(bbalance>0) RowsRemovedbyFilter:8 PlanningTime:0.218ms ExecutionTime:420.035ms (22rows) postgres=#
3.使用NOTEXISTS
SELECT count(aid),bid FROM pgbench_accounts WHERE NOTEXISTS(SELECTbidFROMpgbench_branchesWHEREbbalance>0ANDpgbench_accounts.bid=pgbench_branches.bid) GROUPBY bid;
执行计划:
QUERYPLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- FinalizeGroupAggregate(cost=28327.72..28330.25rows=10width=12)(actualtime=152.024..153.931rows=8loops=1) GroupKey:pgbench_accounts.bid ->GatherMerge(cost=28327.72..28330.05rows=20width=12)(actualtime=152.014..153.917rows=24loops=1) WorkersPlanned:2 WorkersLaunched:2 ->Sort(cost=27327.70..27327.72rows=10width=12)(actualtime=147.782..147.786rows=8loops=3) SortKey:pgbench_accounts.bid SortMethod:quicksortMemory:25kB Worker0:SortMethod:quicksortMemory:25kB Worker1:SortMethod:quicksortMemory:25kB ->PartialHashAggregate(cost=27327.43..27327.53rows=10width=12)(actualtime=147.732..147.737rows=8loops=3) GroupKey:pgbench_accounts.bid ->HashAntiJoin(cost=1.14..25452.43rows=375000width=8)(actualtime=0.134..101.884rows=266667loops=3) HashCond:(pgbench_accounts.bid=pgbench_branches.bid) ->ParallelSeqScanonpgbench_accounts(cost=0.00..20560.67rows=416667width=8)(actualtime=0.032..45.174rows=333333loops=3) ->Hash(cost=1.12..1.12rows=1width=4)(actualtime=0.036..0.037rows=2loops=3) Buckets:1024Batches:1MemoryUsage:9kB ->SeqScanonpgbench_branches(cost=0.00..1.12rows=1width=4)(actualtime=0.025..0.027rows=2loops=3) Filter:(bbalance>0) RowsRemovedbyFilter:8 PlanningTime:0.322ms ExecutionTime:154.040ms (22rows) postgres=#
4.使用LEFTJOIN和ISNULL
SELECT count(aid),a.bid FROM pgbench_accountsa LEFTJOINpgbench_branchesbONa.bid=b.bidANDb.bbalance>0 WHERE b.bidISNULL GROUPBY a.bid;
执行计划:
QUERYPLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------ FinalizeGroupAggregate(cost=28327.72..28330.25rows=10width=12)(actualtime=145.298..147.096rows=8loops=1) GroupKey:a.bid ->GatherMerge(cost=28327.72..28330.05rows=20width=12)(actualtime=145.288..147.083rows=24loops=1) WorkersPlanned:2 WorkersLaunched:2 ->Sort(cost=27327.70..27327.72rows=10width=12)(actualtime=141.883..141.887rows=8loops=3) SortKey:a.bid SortMethod:quicksortMemory:25kB Worker0:SortMethod:quicksortMemory:25kB Worker1:SortMethod:quicksortMemory:25kB ->PartialHashAggregate(cost=27327.43..27327.53rows=10width=12)(actualtime=141.842..141.847rows=8loops=3) GroupKey:a.bid ->HashAntiJoin(cost=1.14..25452.43rows=375000width=8)(actualtime=0.087..99.535rows=266667loops=3) HashCond:(a.bid=b.bid) ->ParallelSeqScanonpgbench_accountsa(cost=0.00..20560.67rows=416667width=8)(actualtime=0.025..44.337rows=333333loops=3) ->Hash(cost=1.12..1.12rows=1width=4)(actualtime=0.026..0.027rows=2loops=3) Buckets:1024Batches:1MemoryUsage:9kB ->SeqScanonpgbench_branchesb(cost=0.00..1.12rows=1width=4)(actualtime=0.019..0.020rows=2loops=3) Filter:(bbalance>0) RowsRemovedbyFilter:8 PlanningTime:0.231ms ExecutionTime:147.180ms (22rows) postgres=#
NOTIN和<>ALL生成执行计划都包含了一个子查询。他们是各自独立的。
而NOTEXISTS和LEFTJOIN生成了相同的执行计划。
这些hash连接(或hashantijoin)是完成查询要求的最灵活的方式。这也是推荐exists或join的原因。因此,推荐使用exists或join的经验法则是有效的。
但是,我们继续往下看!即使有了子查询执行计划,NOTIN子句的执行时间也会更好?
是的。PostgreSQL做了出色的优化,PostgreSQL将子查询计划进行了hash处理。因此PostgreSQL对如何处理IN子句有了更好的理解,这是一种逻辑思维方式,因为很多人倾向于使用IN子句。子查询返回的行很少,但即使子查询返回几百行,也会发生同样的情况。
但是,如果子查询返回大量行(几十万行)怎么办?让我们尝试一个简单的测试:
CREATETABLEt1AS SELECT*FROMgenerate_series(0,500000)id; CREATETABLEt2AS SELECT(random()*4000000)::integerid FROMgenerate_series(0,4000000); ANALYZEt1; ANALYZEt2; EXPLAINSELECTid FROMt1 WHEREidNOTIN(SELECTidFROMt2);
执行计划:
QUERYPLAN -------------------------------------------------------------------------------- Gather(cost=1000.00..15195064853.01rows=250000width=4) WorkersPlanned:1 ->ParallelSeqScanont1(cost=0.00..15195038853.01rows=147059width=4) Filter:(NOT(SubPlan1)) SubPlan1 ->Materialize(cost=0.00..93326.01rows=4000001width=4) ->SeqScanont2(cost=0.00..57700.01rows=4000001width=4) (7rows) postgres=#
这里,执行计划将子查询进行了物化。代价评估变成了15195038853.01。(PostgreSQL的默认设置,如果t2表的行低于100k,会将子查询进行hash)。这样就会严重影响性能。因此,对于那种子查询返回的行数很少的场景,IN子句可以起到很好的作用。
其它注意点
有的!在我们用不同的方式写查询的时候,可能有数据类型的转换。
比如,语句:
EXPLAINANALYZESELECT*FROMempWHEREgen=ANY(ARRAY['M','F']);
就会发生隐式的类型转换:
SeqScanonemp(cost=0.00..1.04rows=2width=43)(actualtime=0.023..0.026rows=3loops=1) Filter:((gen)::text=ANY('{M,F}'::text[]))
这里的(gen)::text就发生了类型转换。如果在大表上,这种类型转换的代价会很高,因此,PostgreSQL对IN子句做了更好的处理。
EXPLAINANALYZESELECT*FROMempWHEREgenIN('M','F'); SeqScanonemp(cost=0.00..1.04rows=3width=43)(actualtime=0.030..0.034rows=3loops=1) Filter:(gen=ANY('{M,F}'::bpchar[]))
将IN子句转换成了ANY子句,没有对gen列进行类型转换。而是将M\F转成了bpchar(内部等价于char)
总结
简单来说,exists和直接join表通常比较好。
很多情况下,PostgreSQL将IN子句换成被hash的子计划。在一些特殊场景下,IN可以获得更好的执行计划。
以上为个人经验,希望能给大家一个参考,也希望大家多多支持毛票票。如有错误或未考虑完全的地方,望不吝赐教。