SQL Server数据迁移至PostgreSQL出错的解释以及解决方案

2024-04-08 03:44:03 288

问题重现：

1、PG客户端:

postgres=#createtabletext_test(idint,infotext);
CREATETABLE
postgres=#insertintotext_testvalues(1,E'\0x00');
ERROR:invalidbytesequenceforencoding"UTF8":0x00

2、SQLServer产生数据

createtabletest_varchar(idint,namevarchar(20));
insertintotest_varcharvalues(1,'name'+char(0));
insertintotest_varcharvalues(1,'name'+'');

然后通过java程序进行获取数据并插入到PG，同样会得到错误信息：

invalidbytesequenceforencoding"UTF8":0x00

首先我们认为此为gb2312转化到UTF8时，发生了无法转化的错误。经查UTF8是变长的,1-6个字节。他的编码规则如下：

Bits Lastcodepoint Byte1 Byte2
Byte3
Byte4
Byte5
Byte6
7 U+007F 0xxxxxxx

11 U+07FF
110xxxxx 10xxxxxx

16 U+FFFF
1110xxxx 10xxxxxx
10xxxxxx

21 U+1FFFFF
11110xxx 10xxxxxx
10xxxxxx
10xxxxxx

26 U+3FFFFFF
111110xx 10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx

31 U+7FFFFFFF
1111110x 10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx

而0x00是符合UTF8规则的。这就使我们非常诧异。然后我们发现有两点继而确认了问题：
1、

PostgreSQLdoesn'tsupportstoringNULL(\0x00)charactersintextfields(thisisobviouslydifferentfromthedatabaseNULLvalue,whichisfullysupported).

IfyouneedtostoretheNULLcharacter,youmustuseabyteafield-whichshouldstoreanythingyouwant,butwon'tsupporttextoperationsonit.

GiventhatPostgreSQLdoesn'tsupportitintextvalues,there'snogoodwaytogetittoremoveit.Youcouldimportyourdataintobyteaandlaterconvertittotextusingaspecialfunction(inperlorsomething,maybe?),butit'slikelygoingtobeeasiertodothatinpreprocessingbeforeyouloadit.

Source:http://stackoverflow.com/questions/1347646/postgres-error-on-insert-error-invalid-byte-sequence-for-encoding-utf8-0x0

2、

Terminatingcharacter

Indicatedby

Tab

Thisisthedefaultfieldterminator.

Newlinecharacter

Thisisthedefaultrowterminator.

Carriagereturn/linefeed

Backslash1

Nullterminator(nonvisibleterminator)2

Anyprintablecharacter(controlcharactersarenotprintable,exceptnull,tab,newline,andcarriagereturn)

(*,A,t,l,andsoon)

Stringofupto10printablecharacters,includingsomeoralloftheterminatorslistedearlier

(**\t**,end,!!!!!!!!!!,\t—\n,andsoon)

Source:http://msdn.microsoft.com/en-us/library/ms191485.aspx

由此我们确定，是pg对null的处理和SQLServer处理是不相同的，所以在这里出现了错误。

而导致这一问题的PG具体代码如下（src/backend/utils/mb/wchar.c的pg_verify_mbstr_len）：

if(!IS_HIGHBIT_SET(*mbstr))
{
if(*mbstr!='\0')
{
mb_len++;
mbstr++;
len--;
continue;
}
if(noError)
return-1;
report_invalid_encoding(encoding,mbstr,len);
}

#defineIS_HIGHBIT_SET(ch)((unsignedchar)(ch)&HIGHBIT)
#defineHIGHBIT(0x80)

report_invalid_encoding函数是将错误信息返回，也就是

invalidbytesequenceforencoding"UTF8":0x00
而真正导致这一问题的就是：
!IS_HIGHBIT_SET(*mbstr)当*mbstr为0x00时进入判断，然后进而判断*mbstr是否为\0,当为\0时，直接进入函数report_invalid_encoding报错。

所以出现此问题的原因是PG和SQLServer对null的处理是不相同的。

处理方案：

1、将SQLServer源数据进行修改方法，

UPDATE:Thisseemstowork:

Select*fromTABLE
whereUNICODE(SUBSTRING(naughtyField,LEN(naughtyField),1))=0
So:

UpdateTABLE
SETnaughtyField=SUBSTRING(naughtyField,1,LEN(naughtyField)-1)
whereUNICODE(SUBSTRING(naughtyField,LEN(naughtyField),1))=0
Source:http://stackoverflow.com/questions/3533320/sql-server-remove-end-string-character-0-from-data

2、对应用进行修改，获取到SQLServer数据时，将数据进行转化，和第一种方法异曲同工。

SQL Server数据迁移至PostgreSQL出错的解释以及解决方案

热门推荐

随机推荐