删除特殊字符并使用xslt进行搜索和替换

时间:2022-02-28 17:07:34

I'm trying to replace certain strings of text and then remove all RTF tags from the same text string.

我正在尝试替换某些文本字符串,然后从同一文本字符串中删除所有RTF标记。

So the initial value is:

所以初始值是:

<test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0     Times New Roman;}{\f2\fcharset0 Segoe UI;}{\f3\fcharset0 arial;}}{\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{{\ltrch Ingredients: roast British chicken breast \'b7 chicken stock mayo and smoked  \'b7 prawns with mayo on malted brown bread \'b7 smoked British ham with mustard mayo on oatmeal bread \'b7 .}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch }{\ltrch }{\ltrch  }\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch roast British chicken breast \'b7 chicken stock mayo and smoked  : Chicken Breast (25.89%) \'b7 }{\ltrch {\b Wheatflour}}{\ltrch  contains }{\ltrch {\b Gluten}}{\ltrch  (with Wheatflour \'b7 Calcium Carbonate \'b7 Iron \'b7 Niacin \'b7 Thiamin) \'b7 Water \'b7 Pork (10.32%) \'b7 Malted }{\ltrch {\b Wheatflakes}}{\ltrch  (contain }{\ltrch {\b Gluten}}{\ltrch ) \'b7 Rapeseed Oil \'b7 }{\ltrch {\b Wheat}}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch }{\ltrch }{\ltrch  }\li0\ri0\sa0\sb0\fi0\ql\par}

}
}
</test>

So what needs to be done:

那么需要做些什么:

  1. Values like {\b Wheat} should become <bold>Wheat</bold> - where the Wheat can be anything or different.
  2. 像{\ b Wheat}这样的值应该变成 <粗体> 小麦 - 小麦可以是任何东西或不同的东西。

  3. \'b7 should become a comma (',')
  4. 'b7应该变成逗号(',')

The result would be:

结果将是:

<test>
<data>Ingredients: roast British chicken breast , chicken stock mayo and smoked  , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .
roast British chicken breast , chicken stock mayo and smoked  : Chicken Breast (25.89%) , <bold> Wheatflour</bold> contains <bold>Gluten</bold>(with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold> Wheatflakes</bold>contain <bold> Gluten</bold>, Rapeseed Oil , <bold> Wheat</bold>
</data>
</test>

Can this be done? If so, how?

可以这样做吗?如果是这样,怎么样?

1 个解决方案

#1


0  

This isn't terribly difficult if you can use XSLT 2.0 or newer, which includes regular expression functionality. The key for you is the replace() function.

如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。

Here's a snippet of code that begins to clean up your RTF mess:

这是一段代码,开始清理你的RTF混乱:

<xsl:template match="data">
    <xsl:copy>
        <!-- Note: XSL variables are _immutable_: once created, their values 
            cannot be changed.  I use a chain of variables here simply for 
            purposes of illustration, as a means of showing each regex 
            replacement operation on its own.  These could all be stacked
            into a single statement, but that is somewhat harder for
            humans to read. :) -->
        <xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '&lt;bold&gt;$1&lt;/bold&gt;')"/>
        <xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
        <xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
        <xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
        <xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
        <xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
    </xsl:copy>
</xsl:template>

This currently outputs (after adding the closing </data> tag that's missing in your sample input XML):

这当前输出(在添加示例输入XML中缺少的结束 标记之后):

<?xml version="1.0" encoding="UTF-8"?><test>
    <data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked  , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
        { \li0\ri0\sa0\sb0\fi0\ql\par}
        {roast British chicken breast , chicken stock mayo and smoked  : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
        {{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}

        }
        }</data>
</test>

At this point, you just need to figure out the rest of the regular expressions needed to strip out the remainder of the RTF codes.

此时,您只需要找出去除剩余RTF代码所需的其余正则表达式。

#1


0  

This isn't terribly difficult if you can use XSLT 2.0 or newer, which includes regular expression functionality. The key for you is the replace() function.

如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。

Here's a snippet of code that begins to clean up your RTF mess:

这是一段代码,开始清理你的RTF混乱:

<xsl:template match="data">
    <xsl:copy>
        <!-- Note: XSL variables are _immutable_: once created, their values 
            cannot be changed.  I use a chain of variables here simply for 
            purposes of illustration, as a means of showing each regex 
            replacement operation on its own.  These could all be stacked
            into a single statement, but that is somewhat harder for
            humans to read. :) -->
        <xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '&lt;bold&gt;$1&lt;/bold&gt;')"/>
        <xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
        <xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
        <xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
        <xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
        <xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
    </xsl:copy>
</xsl:template>

This currently outputs (after adding the closing </data> tag that's missing in your sample input XML):

这当前输出(在添加示例输入XML中缺少的结束 标记之后):

<?xml version="1.0" encoding="UTF-8"?><test>
    <data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked  , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
        { \li0\ri0\sa0\sb0\fi0\ql\par}
        {roast British chicken breast , chicken stock mayo and smoked  : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
        {{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}

        }
        }</data>
</test>

At this point, you just need to figure out the rest of the regular expressions needed to strip out the remainder of the RTF codes.

此时,您只需要找出去除剩余RTF代码所需的其余正则表达式。