无法将csv导入到postgres(在位置4194303找到的意外字符)

时间:2022-12-25 23:06:23

When I try to import a csv to my Redshift database, I get this error

当我尝试将csv导入到红移数据库时,我得到了这个错误

Missing newline: Unexpected character 0x75 found at location 4194303                                

Everything seems to be fine with the csv itself. The stl table tells me the error is on line 70269 of the csv, which contains this string

csv本身似乎没有问题。stl表告诉我错误在csv的70269行,其中包含这个字符串

10:00:10,2014-07-28,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0),Not Listed,Not Listed,Not Listed,Not Listed,multiRetrieve,Not Listed,OS-Preview-logItemUsage,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,"[{""PubEndDate""=>""2013/12/31"", ""ItmId""=>""1353296053"", ""SourceType""=>""Scholarly Journals"", ""ReasonCode""=>""Free"", ""MyResearchUser""=>""246763"", ""ProjectCode""=>"""", ""PublicationCode""=>"""", ""PubStartDate""=>""2013/01/01"", ""ItmFrmt""=>""AbstractPreview"", ""Subrole""=>""AbstractPreview"", ""PaymentType""=>""Transactional"", ""UsageInfo""=>""P-1008275-154977-CUSTOMER-10000137-2950635"", ""Role""=>""AbstractPreview"", ""RetailPrice""=>0, ""EffectivePrice""=>0, ""ParentItemId""=>""53628""}]","[""optype:Online"", ""location:null"", ""target:null""]",192.234.111.8,DIALOG,20140728131712007:882391,1119643,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,Not Listed,"2014-07-28 10:00:10-0400,421 {""Items"":[{""PubEndDate"":""2013/12/31"",""ItmId"":""1353296053"",""SourceType"":""Scholarly Journals"",""ReasonCode"":""Free"",""MyResearchUser"":""246763"",""ProjectCode"":"""",""PublicationCode"":"""",""PubStartDate"":""2013/01/01"",""ItmFrmt"":""AbstractPreview"",""Subrole"":""AbstractPreview"",""PaymentType"":""Transactional"",""UsageInfo"":""P-1008275-154977-CUSTOMER-10000137-2950635"",""Role"":""AbstractPreview"",""RetailPrice"":0,""EffectivePrice"":0,""ParentItemId"":""53628""}],""Operation"":[""optype:Online"",""location:null"",""target:null""],""UserAgent"":""Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"",""UserInfo"":{""IP"":""192.234.111.8"",""AppId"":""DIALOG"",""SessId"":""20140728131712007:882391"",""UsageGroupId"":""1119643""},""UsageType"":""multiRetrieve"",""BreadCrumb"":""OS-Preview-logItemUsage""}

Any ideas why it won't load?

你知道为什么它装不下吗?

EDIT: It clearly has to do with the number '4194303'. Many of my redshift uploads have failed, here is a brief sample of my stl_load_errors

编辑:这显然与数字“4194303”有关。我的许多红移上传都失败了,这里有一个stl_load_errors的简单示例

Missing newline: Unexpected character 0x3a found at location 4194303                                
Missing newline: Unexpected character 0x63 found at location 4194303                                
Missing newline: Unexpected character 0x6c found at location 4194303                                
Missing newline: Unexpected character 0x22 found at location 4194303                                

All entries in the table where these errors occur of type 'text', and there are about 30 columns. The csv itself contains many thousands of records (quite large csv file).

表中的所有条目都出现了“text”类型的错误,并且有大约30个列。csv本身包含数千条记录(相当大的csv文件)。

WORKAROUND (not a solution)

解决方案(而不是一个解决方案)

I've found that the number 4194303 comes from the 4MB limit set by the TRUNCATECOLUMNS feature of Redshift copying. By disabling this feature, I get a "String length exceeds DDL length" error (which is why I use TRUNCATECOLUMNS in the first place).

我发现数字4194303来自于红移复制的truncatecolns特性设置的4MB限制。通过禁用这个特性,我得到一个“字符串长度超过DDL长度”的错误(这就是为什么我首先使用TRUNCATECOLUMNS)。

So the problem is that many of my records are over 4MB, and redshift does not support such records if any of the attributes need to be truncated.

因此,问题是我的许多记录超过了4MB,如果需要截断任何属性,redshift不支持这样的记录。

However, by using the MAXERROR 1000 option of the copy command, I am able to ignore the 4MB+ records and be left with a database that only contains the rows I wanted that are less than 4MB.

但是,通过使用copy命令的MAXERROR 1000选项,我可以忽略4MB+记录,并保留一个只包含小于4MB的行的数据库。

2 个解决方案

#1


0  

Can you try your copy command with below options added

你能尝试你的拷贝命令与下面的选项添加吗

ACCEPTINVCHARS ESCAPE

ACCEPTINVCHARS逃脱

Some times when you create CSV files from mac or windows they may contain special characters.

有些时候,当您从mac或windows创建CSV文件时,它们可能包含特殊字符。

#2


0  

The issue is with the EOL (end of line) character. I had the same problem today and the issue was that my csv had MAC EOL (probably a CR). I changed it to Unix (which uses a LF) and the copy went through.

问题是EOL(行尾)字符。我今天遇到了同样的问题,问题是我的csv有MAC EOL(可能是CR)。我将它更改为Unix(使用LF),然后复制通过。

#1


0  

Can you try your copy command with below options added

你能尝试你的拷贝命令与下面的选项添加吗

ACCEPTINVCHARS ESCAPE

ACCEPTINVCHARS逃脱

Some times when you create CSV files from mac or windows they may contain special characters.

有些时候,当您从mac或windows创建CSV文件时,它们可能包含特殊字符。

#2


0  

The issue is with the EOL (end of line) character. I had the same problem today and the issue was that my csv had MAC EOL (probably a CR). I changed it to Unix (which uses a LF) and the copy went through.

问题是EOL(行尾)字符。我今天遇到了同样的问题,问题是我的csv有MAC EOL(可能是CR)。我将它更改为Unix(使用LF),然后复制通过。