MS Word以其XML格式分割单词

时间:2023-01-14 08:12:06

I have a Word 2003 document saved as a XML in WordProcessingML format. It contains few placeholders which will be dynamically replaced by an appropriate content. But, the problem is that Word seemingly randomly splits them in the separate words. For example, instead of this:

我有一个Word 2003文档以WordProcessingML格式作为XML保存。它包含很少的占位符,将被适当的内容动态替换。但问题是,这个词似乎是随机地将它们分成不同的词。例如,与其这样:

<w:t>${dl.d.out.ecs_rev}</w:t>

I have this:

我有这个:

...
<w:t>${</w:t>
 </w:r>
 <w:r wsp:rsidR="005D11C0">
  <w:rPr>
   <w:sz w:val="20" />
   <w:sz-cs w:val="20" />
  </w:rPr>
  <w:t>dl.</w:t>
 </w:r>
<w:r wsp:rsidRPr="00696324">
 <w:rPr>
  <w:sz w:val="20" />
  <w:sz-cs w:val="20" />
 </w:rPr>
<w:t>d.out.ecs_rev}</w:t>
...

Is there any way to save a "clean" XML document using Word 2003, or is there any existing solution which can do the cleaning?

有什么方法可以使用Word 2003保存“clean”XML文档,或者有任何现有的解决方案可以进行清理吗?

I tried to program a method in Java which will concatenate separated parts of the placeholders, but because the number of different cutting combinations is relatively big, the algorithm for that is far more complex than a original task that I have to do, so it is problem for itself.

我尝试用Java编写一个方法,它将占位符的分离部分连接起来,但是由于不同的切割组合的数量相对较大,所以这个算法要比原来的任务复杂得多,所以它本身就是一个问题。

3 个解决方案

#1


2  

You can use Aspose.Words and invoke this:

您可以使用Aspose。单词和调用:

Document.JoinRunsWithSameFormatting.

Document.JoinRunsWithSameFormatting。

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/aspose.words.document.joinrunswithsameformatting.html

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/aspose.words.document.joinrunswithsameformatting.html

#2


3  

If you have control over the original Word documents, you can stop Word from inserting rsid and highlighting grammar/spelling errors.

如果您控制了原始的Word文档,您可以停止插入rsid并突出语法/拼写错误。

         Word.Options opts = Word.Options;
            opts.CheckGrammarAsYouType = false;
            opts.CheckGrammarWithSpelling = false;
            opts.CheckSpellingAsYouType = false;
            opts.StoreRSIDOnSave = false;

Words will still get split, if for example you change font part way through the word.

单词仍然会被分割,例如,如果你在单词中改变字体部分。

Hmmm, I have a simple+ugly bit of xslt which I've used to clean WordML like the example you posted. I could commit it to docx4j if you want it, but as you say, there are various combinations which wouldn't be covered. Anyway, if you want it, please post to the docx4j forum.

嗯,我有一个简单的+丑陋的xslt,我用它来清理WordML,就像您所贴出的示例一样。如果您需要的话,我可以将它提交给docx4j,但是正如您所说,有各种不同的组合不会被涵盖。无论如何,如果你想要它,请贴在docx4j论坛上。

A more robust approach would be to extract the plain text, and relate the plain text to the XML, so you can search the plain text, and go from there to the XML.

一种更健壮的方法是提取纯文本,并将纯文本与XML关联起来,这样您就可以搜索纯文本,然后从那里到XML。

#3


1  

Word 2003 XML is unusually complex and hard to decode. The reason you are getting multiple tags is because Word ML generates tags called runs (the w:r tag). As far as I know, there is no easy way to do the clean the XML above. I would recommend using HTML instead of WordML. It is way easier to manipulate and replace your placeholders with appropriate content. If cost is not an objective, use a product like Aspose. It does everything for you and is simple to use.

Word 2003 XML异常复杂,难以解码。您获得多个标记的原因是Word ML生成称为run的标记(w:r标记)。就我所知,没有简单的方法来清除上面提到的XML。我建议使用HTML而不是WordML。用适当的内容来操作和替换占位符要容易得多。如果成本不是目标,使用类似Aspose的产品。它为你做每件事,而且使用起来很简单。

#1


2  

You can use Aspose.Words and invoke this:

您可以使用Aspose。单词和调用:

Document.JoinRunsWithSameFormatting.

Document.JoinRunsWithSameFormatting。

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/aspose.words.document.joinrunswithsameformatting.html

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/aspose.words.document.joinrunswithsameformatting.html

#2


3  

If you have control over the original Word documents, you can stop Word from inserting rsid and highlighting grammar/spelling errors.

如果您控制了原始的Word文档,您可以停止插入rsid并突出语法/拼写错误。

         Word.Options opts = Word.Options;
            opts.CheckGrammarAsYouType = false;
            opts.CheckGrammarWithSpelling = false;
            opts.CheckSpellingAsYouType = false;
            opts.StoreRSIDOnSave = false;

Words will still get split, if for example you change font part way through the word.

单词仍然会被分割,例如,如果你在单词中改变字体部分。

Hmmm, I have a simple+ugly bit of xslt which I've used to clean WordML like the example you posted. I could commit it to docx4j if you want it, but as you say, there are various combinations which wouldn't be covered. Anyway, if you want it, please post to the docx4j forum.

嗯,我有一个简单的+丑陋的xslt,我用它来清理WordML,就像您所贴出的示例一样。如果您需要的话,我可以将它提交给docx4j,但是正如您所说,有各种不同的组合不会被涵盖。无论如何,如果你想要它,请贴在docx4j论坛上。

A more robust approach would be to extract the plain text, and relate the plain text to the XML, so you can search the plain text, and go from there to the XML.

一种更健壮的方法是提取纯文本,并将纯文本与XML关联起来,这样您就可以搜索纯文本,然后从那里到XML。

#3


1  

Word 2003 XML is unusually complex and hard to decode. The reason you are getting multiple tags is because Word ML generates tags called runs (the w:r tag). As far as I know, there is no easy way to do the clean the XML above. I would recommend using HTML instead of WordML. It is way easier to manipulate and replace your placeholders with appropriate content. If cost is not an objective, use a product like Aspose. It does everything for you and is simple to use.

Word 2003 XML异常复杂,难以解码。您获得多个标记的原因是Word ML生成称为run的标记(w:r标记)。就我所知,没有简单的方法来清除上面提到的XML。我建议使用HTML而不是WordML。用适当的内容来操作和替换占位符要容易得多。如果成本不是目标,使用类似Aspose的产品。它为你做每件事,而且使用起来很简单。