SQL Server数据迁移至PostgreSQL出错的解释以及解决方案
问题重现:
1、PG客户端:
postgres=#createtabletext_test(idint,infotext); CREATETABLE postgres=#insertintotext_testvalues(1,E'\0x00'); ERROR:invalidbytesequenceforencoding"UTF8":0x00
2、SQLServer产生数据
createtabletest_varchar(idint,namevarchar(20)); insertintotest_varcharvalues(1,'name'+char(0)); insertintotest_varcharvalues(1,'name'+'');
然后通过java程序进行获取数据并插入到PG,同样会得到错误信息:
invalidbytesequenceforencoding"UTF8":0x00
首先我们认为此为gb2312转化到UTF8时,发生了无法转化的错误。经查UTF8是变长的,1-6个字节。他的编码规则如下:
Bits
Lastcodepoint
Byte1
Byte2
Byte3
Byte4
Byte5
Byte6
7
U+007F
0xxxxxxx
11
U+07FF
110xxxxx
10xxxxxx
16
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
21
U+1FFFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
26
U+3FFFFFF
111110xx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
31
U+7FFFFFFF
1111110x
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
而0x00是符合UTF8规则的。这就使我们非常诧异。然后我们发现有两点继而确认了问题:
1、
PostgreSQLdoesn'tsupportstoringNULL(\0x00)charactersintextfields(thisisobviouslydifferentfromthedatabaseNULLvalue,whichisfullysupported). IfyouneedtostoretheNULLcharacter,youmustuseabyteafield-whichshouldstoreanythingyouwant,butwon'tsupporttextoperationsonit. GiventhatPostgreSQLdoesn'tsupportitintextvalues,there'snogoodwaytogetittoremoveit.Youcouldimportyourdataintobyteaandlaterconvertittotextusingaspecialfunction(inperlorsomething,maybe?),butit'slikelygoingtobeeasiertodothatinpreprocessingbeforeyouloadit. Source:http://stackoverflow.com/questions/1347646/postgres-error-on-insert-error-invalid-byte-sequence-for-encoding-utf8-0x0
2、
Tab \t Thisisthedefaultfieldterminator. Newlinecharacter \n Thisisthedefaultrowterminator. Carriagereturn/linefeed \r Backslash1 \\ Nullterminator(nonvisibleterminator)2 \0 Anyprintablecharacter(controlcharactersarenotprintable,exceptnull,tab,newline,andcarriagereturn) (*,A,t,l,andsoon) Stringofupto10printablecharacters,includingsomeoralloftheterminatorslistedearlier (**\t**,end,!!!!!!!!!!,\t—\n,andsoon)
Source:http://msdn.microsoft.com/en-us/library/ms191485.aspx
由此我们确定,是pg对null的处理和SQLServer处理是不相同的,所以在这里出现了错误。
而导致这一问题的PG具体代码如下(src/backend/utils/mb/wchar.c的pg_verify_mbstr_len):
if(!IS_HIGHBIT_SET(*mbstr)) { if(*mbstr!='\0') { mb_len++; mbstr++; len--; continue; } if(noError) return-1; report_invalid_encoding(encoding,mbstr,len); }
#defineIS_HIGHBIT_SET(ch)((unsignedchar)(ch)&HIGHBIT) #defineHIGHBIT(0x80)
report_invalid_encoding函数是将错误信息返回,也就是
invalidbytesequenceforencoding"UTF8":0x00
而真正导致这一问题的就是:
!IS_HIGHBIT_SET(*mbstr)当*mbstr为0x00时进入判断,然后进而判断*mbstr是否为\0,当为\0时,直接进入函数report_invalid_encoding报错。
所以出现此问题的原因是PG和SQLServer对null的处理是不相同的。
处理方案:
1、将SQLServer源数据进行修改方法,
UPDATE:Thisseemstowork: Select*fromTABLE whereUNICODE(SUBSTRING(naughtyField,LEN(naughtyField),1))=0 So: UpdateTABLE SETnaughtyField=SUBSTRING(naughtyField,1,LEN(naughtyField)-1) whereUNICODE(SUBSTRING(naughtyField,LEN(naughtyField),1))=0 Source:http://stackoverflow.com/questions/3533320/sql-server-remove-end-string-character-0-from-data
2、对应用进行修改,获取到SQLServer数据时,将数据进行转化,和第一种方法异曲同工。