如何使用regex从URL获取域?

时间:2022-08-23 10:37:00

I need to display a word doc on a webpage. I am using a library named Docx4j to convert .doc to html. This is working fine. But, I'm getting the hyperlinks in the below format.

我需要在网页上显示word doc。我正在使用一个名为Docx4j的库将.doc转换为html。这是工作正常。但是,我得到的超链接是下面的格式。

To search on google go to this link [#?] HYPERLINK \"http://www.google.com/\" [#?][#?] google[#?] and type the text.

I'm able to convert it to

我可以把它转换成

To search on google go to this link  (http://www.google.com) google and type the text.

using the below code

使用以下代码

String myText = "To search on google go to this link [#?] HYPERLINK \"http://www.google.com/\" [#?][#?] google[#?] and type the text.";
System.out.println(myText);
String firstReplace = myText.replaceAll("\\[", "").replaceAll("\\]", "").replaceAll("#\\?", "");
System.out.println(firstReplace);
String secondReplace = firstReplace.replaceAll("HYPER\\S+\\s+\"", "(");
System.out.println(secondReplace);
String finalReplace = secondReplace.replaceAll("/*\".", ")");
System.out.println("\n" + finalReplace);

Can someone please provide me a regex to convert the above string to

是否可以提供一个regex将上面的字符串转换为

To search on google go to this link google (http://www.google.com) and type the text.

--EDIT--

——编辑

There are some links which show up as

有一些链接显示为

[#?] HYPERLINK \"http://www.google.com/\" [#?][#?] google page[#?]

I should change them to

我应该把它们换成

google page (http://www.google.com)

How do I do this?

我该怎么做呢?

2 个解决方案

#1


2  

You can use a group reference to match the word google which comes after the parenthesis.

您可以使用组引用来匹配圆括号后面的单词谷歌。

You can replace the result of following regex:

您可以替换以下regex的结果:

'(\([^)]*\))\s?(\w+)'

With following :

后:

'$2 $1'

You can use str.replaceAll() function for this aim.

您可以为此目的使用string . replaceall()函数。

Elaboration:

细化:

The first capture group (\([^)]*\)) will match the part between parenthesis, [^)]* is a negated character class which match any combination of characters except closing parenthesis.

第一个捕获组(\((^))* \))将匹配括号之间的部分,(^))*否定字符类,除了关闭括号匹配的任意组合字符。

And the second one (\w+) will match the words after that part, \w+ will match any combination of word characters.

第二个(\w+)将匹配后面的单词,\w+将匹配任何组合的单词字符。

#2


0  

Removing the [#?] markers as early as you do in your question, means that you lose essential information to make the required text adjustments later. The basic template of your input is:

删除(# ?[英语背诵文选在你回答问题的时候就做记号,这就意味着你在以后做必要的文本调整时失去了必要的信息。你输入的基本模板是:

[#?] HYPERLINK *target* [#?] [#?] *clickable textual description of link* [#?]

So why don't you use those markers to your advantage?

所以你为什么不利用这些标记来为自己谋利呢?

Some regexp like this (NOTE: not tested, probably wrong, but just to give you the basic idea):

类似这样的一些regexp(注意:没有经过测试,可能是错误的,但只是为了让您了解基本的思想):

mystring.replaceAll("\\[#\\?\\] HYPERLINK (.*) \\[#\\?\\] \\[#\\?\\] (.*) \\[#\\?\\]", "$2 ($1)");

The above is designed to give you "google page (http://www.google.com)". But I would also question why you want to display it like that. Normally for HTML web pages you want it to be <a href="http://www.google.com">google page</a>. To do that, just change the above code.

上面的设计是为了给你“谷歌页面(http://www.google.com)”。但我也会问你为什么要这样显示。通常对于HTML网页,你希望它是谷歌页面。要做到这一点,只需修改上面的代码。

#1


2  

You can use a group reference to match the word google which comes after the parenthesis.

您可以使用组引用来匹配圆括号后面的单词谷歌。

You can replace the result of following regex:

您可以替换以下regex的结果:

'(\([^)]*\))\s?(\w+)'

With following :

后:

'$2 $1'

You can use str.replaceAll() function for this aim.

您可以为此目的使用string . replaceall()函数。

Elaboration:

细化:

The first capture group (\([^)]*\)) will match the part between parenthesis, [^)]* is a negated character class which match any combination of characters except closing parenthesis.

第一个捕获组(\((^))* \))将匹配括号之间的部分,(^))*否定字符类,除了关闭括号匹配的任意组合字符。

And the second one (\w+) will match the words after that part, \w+ will match any combination of word characters.

第二个(\w+)将匹配后面的单词,\w+将匹配任何组合的单词字符。

#2


0  

Removing the [#?] markers as early as you do in your question, means that you lose essential information to make the required text adjustments later. The basic template of your input is:

删除(# ?[英语背诵文选在你回答问题的时候就做记号,这就意味着你在以后做必要的文本调整时失去了必要的信息。你输入的基本模板是:

[#?] HYPERLINK *target* [#?] [#?] *clickable textual description of link* [#?]

So why don't you use those markers to your advantage?

所以你为什么不利用这些标记来为自己谋利呢?

Some regexp like this (NOTE: not tested, probably wrong, but just to give you the basic idea):

类似这样的一些regexp(注意:没有经过测试,可能是错误的,但只是为了让您了解基本的思想):

mystring.replaceAll("\\[#\\?\\] HYPERLINK (.*) \\[#\\?\\] \\[#\\?\\] (.*) \\[#\\?\\]", "$2 ($1)");

The above is designed to give you "google page (http://www.google.com)". But I would also question why you want to display it like that. Normally for HTML web pages you want it to be <a href="http://www.google.com">google page</a>. To do that, just change the above code.

上面的设计是为了给你“谷歌页面(http://www.google.com)”。但我也会问你为什么要这样显示。通常对于HTML网页,你希望它是谷歌页面。要做到这一点,只需修改上面的代码。