OpenRefine GREL将更改为

时间:2022-10-31 11:20:20

I'm using OpenRefine to clean about 300 records and have some html text that has multiple paragraph tags with a specific class (class="essay-header") that wraps text that I'd like to convert to h2 tags. What kind of GREL would I need to use to transform these cells properly? I figure my html selector is probably "p.essay-header" but I'm having trouble sorting out the way to replace the tag element without losing the inner text of the paragraph.

我正在使用OpenRefine来清理大约300条记录,并且有一些html文本有多个段落标记,其中包含一个特定的类(class =“essay-header”),它包含了我想要转换为h2标签的文本。我需要使用什么样的GREL来正确转换这些细胞?我认为我的html选择器可能是“p.essay-header”但是我在排除替换标签元素的方法时遇到了麻烦,而没有丢失段落的内部文本。

Example Text to Transform <div> <p>Some text of lesser importance.</p> <p class="essay-header">Text to Make a Header</p>. <p>More less important text.</p><p class="essay-header">Again with the Important Text.</p> </div>

要转换的示例文本

一些不太重要的文本。

制作标题的文本 。

更重要的文字。

再次使用重要文字。

1 个解决方案

#1


0  

While it generally a bad idea to try to parse HTML with regular expressions, if you want to do this with GREL you could use:

虽然尝试使用正则表达式解析HTML通常是一个坏主意,但如果您想使用GREL执行此操作,则可以使用:

with(value.match(/(.*)<p class="essay-header">(.*?)<\/p>(.*)/),v,if(v.length()>0,v[0]+"<h2>"+v[1]+"</h2>"+v[2],v))

Because there is no 'global' option for regex in OpenRefine you'd have to use the option to "Re-transform up to X times" to do match multiple occurrences of the in a single cell.

由于OpenRefine中没有正则表达式的“全局”选项,因此您必须使用“重新转换多达X次”选项来匹配单个单元格中的多次出现。

Another option is to split the HTML into segments first then parse each segment to replace the essay-header paras with h2:

另一种选择是首先将HTML拆分成段然后解析每个段以用h2替换essay-header段:

forEach(value.split("<p").join("|<p").split("|"),v,if(v.parseHtml().select(".essay-header").length()>0,v.parseHtml().select(".essay-header")[0].replace('<p class="essay-header">',"<h2>").replace("</p>","</h2>"),v)).join("")

#1


0  

While it generally a bad idea to try to parse HTML with regular expressions, if you want to do this with GREL you could use:

虽然尝试使用正则表达式解析HTML通常是一个坏主意,但如果您想使用GREL执行此操作,则可以使用:

with(value.match(/(.*)<p class="essay-header">(.*?)<\/p>(.*)/),v,if(v.length()>0,v[0]+"<h2>"+v[1]+"</h2>"+v[2],v))

Because there is no 'global' option for regex in OpenRefine you'd have to use the option to "Re-transform up to X times" to do match multiple occurrences of the in a single cell.

由于OpenRefine中没有正则表达式的“全局”选项,因此您必须使用“重新转换多达X次”选项来匹配单个单元格中的多次出现。

Another option is to split the HTML into segments first then parse each segment to replace the essay-header paras with h2:

另一种选择是首先将HTML拆分成段然后解析每个段以用h2替换essay-header段:

forEach(value.split("<p").join("|<p").split("|"),v,if(v.parseHtml().select(".essay-header").length()>0,v.parseHtml().select(".essay-header")[0].replace('<p class="essay-header">',"<h2>").replace("</p>","</h2>"),v)).join("")