如何将HTML转换为有效的XHTML?

时间:2020-12-16 22:30:18

I have a string of HTML, in this example it looks like

我有一串HTML,在这个例子中它看起来像

<img src="somepic.jpg" someAtrib="1" >

I am trying to workout a peice of regex that will match the 'img' node and apply a slash to the end of the node so it looks like.

我正在尝试锻炼一个与'img'节点匹配的正则表达式,并在节点的末尾应用斜杠,使其看起来像。

<img src="somepic.jpg" someAtrib="1" />

Essentially the end goal here is to ensure that the node is closed, open nodes are valid in HTML but not XML obviously. Are there any regex buff's out there able to help?

本质上,最终目标是确保节点关闭,开放节点在HTML中有效,但显然不是XML。是否有任何正则表达式buff可以提供帮助?

5 个解决方案

#1


12  

Don't use a Regular expression, but dedicated parsers. In JavaScript, create a document using the DOMParser, then serialize it using the XMLSerializer:

不要使用正则表达式,而是使用专用解析器。在JavaScript中,使用DOMParser创建文档,然后使用XMLSerializer对其进行序列化:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);
// result:
// <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body> (no line break)
// <img src="foo" /></body></html>

#2


3  

You can create a xhtml document and import/adopt html elements. Html strings can be parsed by HTMLElement.innerHTML property, of cause. The key point is using Document.importNode() or Document.adoptNode() method to convert html nodes to xhtml nodes:

您可以创建一个xhtml文档并导入/采用html元素。 Html字符串可以由HTMLElement.innerHTML属性解析。关键是使用Document.importNode()或Document.adoptNode()方法将html节点转换为xhtml节点:

var di = document.implementation;
var hd = di.createHTMLDocument();
var xd = di.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
hd.body.innerHTML = '<img>';
var img = hd.body.firstElementChild;
var xb = xd.createElement('body');
xd.documentElement.appendChild(xb);
console.log('html doc:\n' + hd.documentElement.outerHTML + '\n');
console.log('xhtml doc:\n' + xd.documentElement.outerHTML + '\n');
img = xd.importNode(img); //or xd.adoptNode(img). Now img is a xhtml element
xb.appendChild(img);
console.log('xhtml doc after import/adopt img from html:\n' + xd.documentElement.outerHTML + '\n');

The output should be:

输出应该是:

html doc:
<html><head></head><body><img></body></html>

xhtml doc:
<html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>

xhtml doc after import/adopt img from html:
<html xmlns="http://www.w3.org/1999/xhtml"><body><img /></body></html>

Rob W's answer does not work in chrome (at least 29 and below) because DOMParser does not support 'text/html' type and XMLSerializer generates html syntax(NOT xhtml) for html document in chrome.

Rob W的答案在chrome(至少29及以下)中不起作用,因为DOMParser不支持'text / html'类型,而XMLSerializer为chrome中的html文档生成html语法(NOT xhtml)。

#3


2  

In addition to Rob W's answer, you can extract the body content using RegEx:

除了Rob W的答案,您还可以使用RegEx提取正文内容:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);

/<body>(.*)<\/body>/im.exec(result);
result = RegExp.$1;

// result:
// <img src="foo" />

Note: parseFromString(htmlString, 'text/html'); would throw error in IE9 because text/html mimeType is not supported in IE9. Works with IE10 and IE11 though.

注意:parseFromString(htmlString,'text / html');会在IE9中抛出错误,因为IE9中不支持text / html mimeType。适用于IE10和IE11。

#4


1  

This will do a pretty good job:

这样做会很好:

result = text.replace(/(<img\b[^<>]*[^<>\/])>/ig, "$1 />");

Addendum: In the (unlikely) event that your code contains tag attributes containing angle brackets (which is not vaild XML/XHTML BTW), then this one will do a little better job:

附录:在(不太可能的)事件中,您的代码包含包含尖括号的标记属性(不是vaild XML / XHTML BTW),那么这个代码会做得更好:

result = text.replace(/(<img\b(?:[^<>"'\/]+|'[^']*'|"[^"]*")*)>/ig, "$1 />");

#5


0  

Why would u wanna fix in browser DOM a HTML document that's XHTML invalid?

为什么你想在浏览器DOM中修复一个XHTML无效的HTML文档?

It was already served and parsed and you already have DOM available. Any parsing error that an invalid/bad formed document would cause, already happened and it won't be a regex on DOM that will fix it.

它已经被提供和解析,你已经有了DOM。无效/错误形成的文档将导致的任何解析错误已经发生,并且它不会是将修复它的DOM上的正则表达式。

Also, remember that almost all documents are parsed as HTML tag-soup. If you can't fix the document on server-side, just ignore its validity/wellformeness on client-side.

另外,请记住,几乎所有文档都被解析为HTML标记 - 汤。如果您无法在服务器端修复文档,请忽略其在客户端的有效性/良好性。

#1


12  

Don't use a Regular expression, but dedicated parsers. In JavaScript, create a document using the DOMParser, then serialize it using the XMLSerializer:

不要使用正则表达式,而是使用专用解析器。在JavaScript中,使用DOMParser创建文档,然后使用XMLSerializer对其进行序列化:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);
// result:
// <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body> (no line break)
// <img src="foo" /></body></html>

#2


3  

You can create a xhtml document and import/adopt html elements. Html strings can be parsed by HTMLElement.innerHTML property, of cause. The key point is using Document.importNode() or Document.adoptNode() method to convert html nodes to xhtml nodes:

您可以创建一个xhtml文档并导入/采用html元素。 Html字符串可以由HTMLElement.innerHTML属性解析。关键是使用Document.importNode()或Document.adoptNode()方法将html节点转换为xhtml节点:

var di = document.implementation;
var hd = di.createHTMLDocument();
var xd = di.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
hd.body.innerHTML = '<img>';
var img = hd.body.firstElementChild;
var xb = xd.createElement('body');
xd.documentElement.appendChild(xb);
console.log('html doc:\n' + hd.documentElement.outerHTML + '\n');
console.log('xhtml doc:\n' + xd.documentElement.outerHTML + '\n');
img = xd.importNode(img); //or xd.adoptNode(img). Now img is a xhtml element
xb.appendChild(img);
console.log('xhtml doc after import/adopt img from html:\n' + xd.documentElement.outerHTML + '\n');

The output should be:

输出应该是:

html doc:
<html><head></head><body><img></body></html>

xhtml doc:
<html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>

xhtml doc after import/adopt img from html:
<html xmlns="http://www.w3.org/1999/xhtml"><body><img /></body></html>

Rob W's answer does not work in chrome (at least 29 and below) because DOMParser does not support 'text/html' type and XMLSerializer generates html syntax(NOT xhtml) for html document in chrome.

Rob W的答案在chrome(至少29及以下)中不起作用,因为DOMParser不支持'text / html'类型,而XMLSerializer为chrome中的html文档生成html语法(NOT xhtml)。

#3


2  

In addition to Rob W's answer, you can extract the body content using RegEx:

除了Rob W的答案,您还可以使用RegEx提取正文内容:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);

/<body>(.*)<\/body>/im.exec(result);
result = RegExp.$1;

// result:
// <img src="foo" />

Note: parseFromString(htmlString, 'text/html'); would throw error in IE9 because text/html mimeType is not supported in IE9. Works with IE10 and IE11 though.

注意:parseFromString(htmlString,'text / html');会在IE9中抛出错误,因为IE9中不支持text / html mimeType。适用于IE10和IE11。

#4


1  

This will do a pretty good job:

这样做会很好:

result = text.replace(/(<img\b[^<>]*[^<>\/])>/ig, "$1 />");

Addendum: In the (unlikely) event that your code contains tag attributes containing angle brackets (which is not vaild XML/XHTML BTW), then this one will do a little better job:

附录:在(不太可能的)事件中,您的代码包含包含尖括号的标记属性(不是vaild XML / XHTML BTW),那么这个代码会做得更好:

result = text.replace(/(<img\b(?:[^<>"'\/]+|'[^']*'|"[^"]*")*)>/ig, "$1 />");

#5


0  

Why would u wanna fix in browser DOM a HTML document that's XHTML invalid?

为什么你想在浏览器DOM中修复一个XHTML无效的HTML文档?

It was already served and parsed and you already have DOM available. Any parsing error that an invalid/bad formed document would cause, already happened and it won't be a regex on DOM that will fix it.

它已经被提供和解析,你已经有了DOM。无效/错误形成的文档将导致的任何解析错误已经发生,并且它不会是将修复它的DOM上的正则表达式。

Also, remember that almost all documents are parsed as HTML tag-soup. If you can't fix the document on server-side, just ignore its validity/wellformeness on client-side.

另外,请记住,几乎所有文档都被解析为HTML标记 - 汤。如果您无法在服务器端修复文档,请忽略其在客户端的有效性/良好性。