使用Java验证格式棘手文件

时间:2021-07-13 22:29:12

I need to parse and validate a file whose format is a little bit tricky.

我需要解析并验证格式有点棘手的文件。

Basically the file comes in this format:

基本上该文件采用以下格式:

   \n -- just to make clear it may have empty lines
   CLIENT_ID
   A_NUMERIC_VALUE
   ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT
   ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT
   \n
   \n
   CLIENT_ID_2
   A_NUMERIC_VALUE_2
   ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT_2
   ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT_2
   OHH_THIS_ONE_HAS_THREE_LINES_OF_COMMENTS

The file will be big very seldom (10 mb is probably the biggest file I've ever seen - usually they have around 900kb-1mb).

该文件很少很大(10 MB可能是我见过的最大文件 - 通常它们大约有900kb-1mb)。

So I have two problems:

所以我有两个问题:

1) How can I effectively validate the format of the file? Using regex + scanner? (I see this one as a very feasible option if I can transform each client entry into only one string - so I can apply the regex upon it).

1)如何有效地验证文件的格式?使用正则表达式+扫描仪? (如果我可以将每个客户端条目转换为只有一个字符串,我认为这个是一个非常可行的选项 - 所以我可以在它上面应用正则表达式)。

2) I need to transform each of the entries in the file into Client objects. Should I validate the whole file before transforming it into Java objects? Or should I validate the file as I go on transforming its entry into Java objects? (Bear in mind that if any client entry is invalid, the processing halts immediately and an exception is thrown - hence any object that was created will be discarded).

2)我需要将文件中的每个条目转换为Client对象。我应该在将整个文件转换为Java对象之前验证它吗?或者我应该在将其条目转换为Java对象时验证文件? (请记住,如果任何客户端条目无效,则处理立即停止并抛出异常 - 因此将丢弃创建的任何对象)。

I'm really keen to see your suggestions about question #1. Question #2 is more a curiosity of mine on how you would handle this situation. Ignore #2 if you will, but please answer #1 =)

我真的很想看到你对问题#1的建议。问题2更多是我对如何处理这种情况的好奇心。如果愿意,请忽略#2,但请回答#1 =)

Does anyone know any framework to help me on handling the file by the way?

有没有人知道任何框架来帮助我处理文件?

Thanks.

Update:

I saw this question and the problem is very similar to mine, but I'm not sure whether regex is the best way out to this problem. There might be quite a lot of "\n" throughout the file, varying number of comments for each client entry and an optional ID - hence the regex would have to be quite complex. That's why I mentioned transforming each entry into one row in the question #1 because this way would be much easier to create a regex to validate... nevertheless, this solution does not sound very elegant to my ears :(

我看到了这个问题,问题与我的问题非常相似,但我不确定正则表达式是否是解决这个问题的最佳方法。整个文件中可能存在相当多的“\ n”,每个客户端条目的注释数量不同以及可选ID - 因此正则表达式必须非常复杂。这就是为什么我提到在问题#1中将每个条目转换为一行的原因,因为这样可以更容易地创建一个正则表达式来验证...然而,这个解决方案对我的耳朵听起来不是很优雅:(

Cheers.

1 个解决方案

#1


0  

If you intend to fail the batch if any part is found invalid, then validate the file first.

如果您发现任何部件无效,则打算使批处理失败,请先验证该文件。

There are several advantages. One is that validation and processing need not be synchronous. If, for example, you process batches daily, but receive files throughout the day, you can validate them throughout the day and notify to correct problems before your scheduled processing. Another is that validation of whether a file is well-formed is very fast.

有几个优点。一个是验证和处理不需要是同步的。例如,如果您每天处理批次,但在一天中接收文件,则可以在一天内对其进行验证,并在计划处理之前通知纠正问题。另一个是验证文件是否格式良好非常快。

A short, simple perl script would certainly do the job. No need to transform the data, if I understand the pattern correctly, and it's all read-forward.

一个简短,简单的perl脚本肯定能完成这项工作。如果我正确理解了模式,则无需转换数据,而且所有这些都是前瞻性的。

read past any newlines
read and validate a client id
read and validate a numeric value
read and validate one or more comments until a blank line is found
repeat the above four steps until EOF or invalid data detected

#1


0  

If you intend to fail the batch if any part is found invalid, then validate the file first.

如果您发现任何部件无效,则打算使批处理失败,请先验证该文件。

There are several advantages. One is that validation and processing need not be synchronous. If, for example, you process batches daily, but receive files throughout the day, you can validate them throughout the day and notify to correct problems before your scheduled processing. Another is that validation of whether a file is well-formed is very fast.

有几个优点。一个是验证和处理不需要是同步的。例如,如果您每天处理批次,但在一天中接收文件,则可以在一天内对其进行验证,并在计划处理之前通知纠正问题。另一个是验证文件是否格式良好非常快。

A short, simple perl script would certainly do the job. No need to transform the data, if I understand the pattern correctly, and it's all read-forward.

一个简短,简单的perl脚本肯定能完成这项工作。如果我正确理解了模式,则无需转换数据,而且所有这些都是前瞻性的。

read past any newlines
read and validate a client id
read and validate a numeric value
read and validate one or more comments until a blank line is found
repeat the above four steps until EOF or invalid data detected