使用CSV文件上的PHP替换或删除新行,但仅在单引号或双引号之间

时间:2022-09-15 15:18:49

I have a CSV file that holds about 200,000 - 300,000 records. Most of the records can be separated and inserted into a MySQL database with a simple

我有一个CSV文件,可容纳大约200,000 - 300,000条记录。大多数记录可以通过简单的方式分离并插入到MySQL数据库中

$line = explode("\n", $fileData);

and then the values separated with

然后用...分隔值

$lineValues = explode(',', $line);

and then inserted into the database using the proper data type i.e int, float, string, text, etc.

然后使用适当的数据类型插入数据库,即int,float,string,text等。

However, some of the records have a text column that includes a \n in the string. Which breaks when using the $line = explode("\n", $fileData); method. Each line of data that needs to be inserted into the database has approximately 216 columns. not every line has a record with a \n in the string. However, each time a \n is found in the line it is enclosed between a pair of single quotes (')

但是,某些记录的文本列包含字符串中的\ n。使用$ line = explode(“\ n”,$ fileData)时会中断;方法。需要插入数据库的每行数据大约有216列。并非每一行都有一个字符串中带有\ n的记录。但是,每次在行中找到\ n时,它都包含在一对单引号(')之间

each line is set up in the following format:

每一行都按以下格式设置:

id,data,data,data,text,more data

example:

1,0,0,0,'Hello World,0
2,0,0,0,'Hello
    World',0
3,0,0,0,'Hi',0
4,0,0,0,,0

As you can see from the example, most records can be easily split with the methods shown above. Its the second record in the example that causes the problem.

从示例中可以看出,大多数记录可以使用上面显示的方法轻松拆分。它是导致问题的示例中的第二条记录。

New lines are only \n and the file does not include \r in the file at all.

新行仅为\ n,文件根本不包含\ r \ n。

5 个解决方案

#1


1  

If the csv data is in a file, you can just use fgetcsv() as others have pointed out. fgetcsv handles embedded newlines correctly.

如果csv数据在文件中,您可以像其他人指出的那样使用fgetcsv()。 fgetcsv正确处理嵌入的换行符。

However if your csv data is in a string (like $fileData in your example) the following method may be useful as str_getcsv() only works on a row at a time and cannot split a whole file into records.

但是,如果您的csv数据位于字符串中(如示例中的$ fileData),则以下方法可能很有用,因为str_getcsv()一次只能处理一行,并且无法将整个文件拆分为记录。

You can detect the embedded newlines by counting the quotes in each line. If there are an odd number of quotes, you have an incomplete line, so concatenate this line with the following line. Once you have an even number of quotes, you have a complete record.

您可以通过计算每行中的引号来检测嵌入的换行符。如果有奇数引号,则表示行不完整,因此将此行连接到以下行。一旦你有了偶数的报价,你就有了完整的记录。

Once you have a complete record, split it at the quotes (again using explode()). Odd-numbered fields are quoted (thus embedded commas are not special), even-numbered fields are not.

获得完整记录后,将其拆分为引号(再次使用explode())。引用了奇数编号的字段(因此嵌入的逗号不是特殊的),偶数编号的字段则不是。

Example:

# Split file into physical lines (records may span lines)
$lines = explode("\n", $fileData);

# Re-assemble records
$records = array ();
$record = '';
$lineSep = '';
foreach ($lines as $line) {
  # Escape @ symbol so we can use it as a marker (as it does not conflict with
  # any special CSV character.)
  $line = str_replace('@', '@a', $line);

  # Escape commas as we don't yet know which ones are separators
  $line = str_replace(',', '@c', $line);

  # Escape quotes in a form that uses no special characters
  $line = str_replace("\\'", '@q', $line);
  $line = str_replace('\\', '@b', $line);

  $record .= $lineSep . $line;
  $lineSep = "\n";

  # Must have an even number of quotes in a complete record!
  if (substr_count($record, "'") % 2 == 0) {
    $records[] = $record;
    $record = '';
    $lineSep = '';
  }
}
if (strlen($record) > 0) {
  $records[] = $record;
}

$rows = array ();

foreach ($records as $record) {
  $chunks_in = explode("'", $record);
  $chunks_out = array ();

  # Decode escaped quotes/backslashes.
  # Decode field-separating commas (unless quoted)
  foreach ($chunks_in as $i => $chunk) {
    # Unescape quotes & backslashes
    $chunk = str_replace('@q', "'", $chunk);
    $chunk = str_replace('@b', '\\', $chunk);
    if ($i % 2 == 0) {
      # Unescape commas
      $chunk = str_replace('@c', ',', $chunk);
    }
    $chunks_out[] = $chunk;
  }

  # Join back together, discarding unescaped quotes
  $record = join('', $chunks_out);

  $chunks_in = explode(',', $record);
  $row = array ();
  foreach ($chunks_in as $chunk) {
    $chunk = str_replace('@c', ',', $chunk);
    $chunk = str_replace('@a', '@', $chunk);
    $row[] = $chunk;
  }
  $rows[] = $row;
}

#2


3  

The other advice here is, of course, valid, especially if you aim to write your own CSV parser, however, if you just want to get the data out, use fgetcsv() function and don't worry about implementation details.

这里的其他建议当然是有效的,特别是如果您打算编写自己的CSV解析器,但是,如果您只想获取数据,请使用fgetcsv()函数,不要担心实现细节。

#3


1  

how about manually iterating through the data, from start to finish, with a for-loop or two? It's slower than explode(), but it's easier to get consistent and reliable results regarding quotes.

如何手动迭代数据,从开始到结束,使用for循环还是两个?它比explode()慢,但是更容易获得关于引号的一致和可靠的结果。

If you choose this method, remeber to take escaped quotes into account.

如果您选择此方法,请记住将转义的引号考虑在内。

#4


0  

If you could be guaranteed that each new line beginning with a number is a valid new-line (i.e. not in the middle of a text description) then you could try something like the below:

如果可以保证每个以数字开头的新行都是有效的换行符(即不在文本描述的中间),那么您可以尝试类似下面的内容:

// Replace all new-line then id patterns with new-line 0+id
$line = preg_replace('/\n(\d)/',"\n0$1",$line);

// Split on new-line then id
$linevalues = preg_split("/\n\d/",$data);

The first step identifies all lines which have a new line followed by a numeric value. It then prepends "0" to this numeric value. The second line splits where it find a new-line then integer.

第一步标识所有具有新行后跟数值的行。然后它将“0”加上此数值。第二行在找到新行然后整数的地方分割。

The "0" is added to the front of the id as preg_split removes the chars it matches from the subsequent matches.

由于preg_split从后续匹配中删除它匹配的字符,因此将“0”添加到id的前面。

As I say, this will only work if you're sure that the text which breaks a line won't start a new line with a number.

正如我所说的,只有当你确定打破一行的文本不会开始带有数字的新行时,这才有效。

#5


0  

Use fgetcsv and it'll take care of all of that for you. Unless there's some overriding reason you need to have your own CSV parser.

使用fgetcsv,它将为您处理所有这些。除非有一些压倒一切的原因,否则您需要拥有自己的CSV解析器。

#1


1  

If the csv data is in a file, you can just use fgetcsv() as others have pointed out. fgetcsv handles embedded newlines correctly.

如果csv数据在文件中,您可以像其他人指出的那样使用fgetcsv()。 fgetcsv正确处理嵌入的换行符。

However if your csv data is in a string (like $fileData in your example) the following method may be useful as str_getcsv() only works on a row at a time and cannot split a whole file into records.

但是,如果您的csv数据位于字符串中(如示例中的$ fileData),则以下方法可能很有用,因为str_getcsv()一次只能处理一行,并且无法将整个文件拆分为记录。

You can detect the embedded newlines by counting the quotes in each line. If there are an odd number of quotes, you have an incomplete line, so concatenate this line with the following line. Once you have an even number of quotes, you have a complete record.

您可以通过计算每行中的引号来检测嵌入的换行符。如果有奇数引号,则表示行不完整,因此将此行连接到以下行。一旦你有了偶数的报价,你就有了完整的记录。

Once you have a complete record, split it at the quotes (again using explode()). Odd-numbered fields are quoted (thus embedded commas are not special), even-numbered fields are not.

获得完整记录后,将其拆分为引号(再次使用explode())。引用了奇数编号的字段(因此嵌入的逗号不是特殊的),偶数编号的字段则不是。

Example:

# Split file into physical lines (records may span lines)
$lines = explode("\n", $fileData);

# Re-assemble records
$records = array ();
$record = '';
$lineSep = '';
foreach ($lines as $line) {
  # Escape @ symbol so we can use it as a marker (as it does not conflict with
  # any special CSV character.)
  $line = str_replace('@', '@a', $line);

  # Escape commas as we don't yet know which ones are separators
  $line = str_replace(',', '@c', $line);

  # Escape quotes in a form that uses no special characters
  $line = str_replace("\\'", '@q', $line);
  $line = str_replace('\\', '@b', $line);

  $record .= $lineSep . $line;
  $lineSep = "\n";

  # Must have an even number of quotes in a complete record!
  if (substr_count($record, "'") % 2 == 0) {
    $records[] = $record;
    $record = '';
    $lineSep = '';
  }
}
if (strlen($record) > 0) {
  $records[] = $record;
}

$rows = array ();

foreach ($records as $record) {
  $chunks_in = explode("'", $record);
  $chunks_out = array ();

  # Decode escaped quotes/backslashes.
  # Decode field-separating commas (unless quoted)
  foreach ($chunks_in as $i => $chunk) {
    # Unescape quotes & backslashes
    $chunk = str_replace('@q', "'", $chunk);
    $chunk = str_replace('@b', '\\', $chunk);
    if ($i % 2 == 0) {
      # Unescape commas
      $chunk = str_replace('@c', ',', $chunk);
    }
    $chunks_out[] = $chunk;
  }

  # Join back together, discarding unescaped quotes
  $record = join('', $chunks_out);

  $chunks_in = explode(',', $record);
  $row = array ();
  foreach ($chunks_in as $chunk) {
    $chunk = str_replace('@c', ',', $chunk);
    $chunk = str_replace('@a', '@', $chunk);
    $row[] = $chunk;
  }
  $rows[] = $row;
}

#2


3  

The other advice here is, of course, valid, especially if you aim to write your own CSV parser, however, if you just want to get the data out, use fgetcsv() function and don't worry about implementation details.

这里的其他建议当然是有效的,特别是如果您打算编写自己的CSV解析器,但是,如果您只想获取数据,请使用fgetcsv()函数,不要担心实现细节。

#3


1  

how about manually iterating through the data, from start to finish, with a for-loop or two? It's slower than explode(), but it's easier to get consistent and reliable results regarding quotes.

如何手动迭代数据,从开始到结束,使用for循环还是两个?它比explode()慢,但是更容易获得关于引号的一致和可靠的结果。

If you choose this method, remeber to take escaped quotes into account.

如果您选择此方法,请记住将转义的引号考虑在内。

#4


0  

If you could be guaranteed that each new line beginning with a number is a valid new-line (i.e. not in the middle of a text description) then you could try something like the below:

如果可以保证每个以数字开头的新行都是有效的换行符(即不在文本描述的中间),那么您可以尝试类似下面的内容:

// Replace all new-line then id patterns with new-line 0+id
$line = preg_replace('/\n(\d)/',"\n0$1",$line);

// Split on new-line then id
$linevalues = preg_split("/\n\d/",$data);

The first step identifies all lines which have a new line followed by a numeric value. It then prepends "0" to this numeric value. The second line splits where it find a new-line then integer.

第一步标识所有具有新行后跟数值的行。然后它将“0”加上此数值。第二行在找到新行然后整数的地方分割。

The "0" is added to the front of the id as preg_split removes the chars it matches from the subsequent matches.

由于preg_split从后续匹配中删除它匹配的字符,因此将“0”添加到id的前面。

As I say, this will only work if you're sure that the text which breaks a line won't start a new line with a number.

正如我所说的,只有当你确定打破一行的文本不会开始带有数字的新行时,这才有效。

#5


0  

Use fgetcsv and it'll take care of all of that for you. Unless there's some overriding reason you need to have your own CSV parser.

使用fgetcsv,它将为您处理所有这些。除非有一些压倒一切的原因,否则您需要拥有自己的CSV解析器。