帮助在PHP中使用正则表达式(解析Wikipedia标记)

时间:2023-01-14 23:12:06

I have this bit of text that I want to remove from a page I am fetching from Wikipedia.

我有这段文字,我想从我从*抓取的页面中删除。

{{Historical populations|type=USA
| 1698|4937
| 1712|5840
| 1723|7248
| 1737|10664
| 1746|11717
| 1756|13046
| 1771|21863
| 1790|33131
| 1800|60515
| 1810|96373
| 1820|123706
| 1830|202589
| 1840|312710
| 1850|515547
| 1860|813669
| 1870|942292
| 1880|1206299
| 1890|1515301
| 1900|3437202
| 1910|4766883
| 1920|5620048
| 1930|6930446
| 1940|7454995
| 1950|7891957
| 1960|7781984
| 1970|7894862
| 1980|7071639
| 1990|7322564
| 2000|8008288
| 2008*|8363710
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null&reg=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007.
}}

The following part I wish to keep as plain text also (but not including parts wrapped with "{{" and "}}"

下面的部分我也希望保持为纯文本(但不包括用"{"和"}"包装的部分)

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}}

Thanks.

谢谢。

3 个解决方案

#1


2  

The current code I am using is the following to clean a Wiki Page, for example this one:

我现在使用的代码是以下代码来清理Wiki页面,例如:

http://en.wikipedia.org/wiki/Tel_Aviv (you can see the markup by clicking "edit this page"

http://en.wikipedia.org/wiki/Tel_Aviv

I get this returned:

我得到返回:

"and given way to its reputation as a "Mediterranean metropolis that never sleeps". Haaretz Editorial It is the country's financial capital and a major performing arts and business center. Tel Aviv's urban area is the Middle East's second biggest city economy, and is ranked 42nd among global cities by Foreign Policys 2008 Global Cities Index. It is also the most expensive city in the region, and 17th most expensive city in the world. The cost of living in Israel is high, with Tel Aviv being its most expensive city to live in. According to Mercer, a human resources consulting firm based in New York, as of 2008 Tel Aviv is the most expensive city in the Middle East and the 14th most expensive in the world. It falls just behind Singapore and Paris and just ahead of Sydney and Dublin in this respect. By comparison, New York City is 22nd."

被“从不睡觉的地中海大都市”的名声所取代。Haaretz的社论它是国家的金融中心和主要的表演艺术和商业中心。特拉维夫的城区是中东第二大城市经济体,根据外国警察2008年全球城市指数,它在全球城市中排名第42位。它也是该地区最贵的城市,也是世界第17贵的城市。以色列的生活成本很高,特拉维夫是生活成本最高的城市。根据位于纽约的人力资源咨询公司美世(Mercer)的数据,截至2008年,特拉维夫是中东最昂贵的城市,也是全球第14昂贵的城市。在这方面,它仅落后于新加坡和巴黎,就在悉尼和都柏林之前。相比之下,纽约市只有22岁。

Which isn't correct, expected result should be:

这是不正确的,预期的结果应该是:

Tel Aviv-Yafo (Hebrew: תֵּל־אָבִיב-יָפוֹ; Arabic: تل أبيب‎, Tall ʼAbīb), usually called Tel Aviv, is the second largest city in Israel, with an estimated population of 393,900. The city is situated on the Israeli Mediterranean coastline, with a land area of 51.8 square kilometres (20.0 sq mi). It is the largest and most populous city in the metropolitan area of Gush Dan, home to 3.15 million people as of 2008. The city is governed by the Tel Aviv-Yafo municipality, headed by Ron Huldai.

Tel Aviv-Yafo(希伯来语:תֵּל־אָבִיב——יָפוֹ;阿拉伯语:تلأبيب‎,高ʼAbīb),通常被称为特拉维夫,以色列的第二大城市,估计有393900人口。这座城市位于以色列地中海沿岸,陆地面积51.8平方公里(20.0平方英里)。它是古什丹市区最大、人口最多的城市,2008年有315万人口。这座城市由泰尔阿维-雅弗市(Tel Aviv-Yafo)管辖,由罗恩•胡尔代(Ron Huldai)领导。

For this PHP code:

PHP代码:

function clean_wiki_text($text)
  {
    // first get rid of UGC HTML tags
    $text = strip_tags($text);

    // keep convert tag
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text);

    // remove large blocks (treat as tags)
    $text = preg_replace("/(<![^>]+>)/", '', $text);
    $text = preg_replace('/\{\{\s?/', '<', $text);
    $text = str_replace('}}', ' />', $text);

    $text = str_replace('<! />', '', $text);

    // more wiki formatting
    $text = preg_replace("/'{2,6}/", '', $text);
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text);
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text);

    // drop page link text
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text);
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text);

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text);
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text);
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text);
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text);
    $text = preg_replace('/\n(\*+\s?)/', '', $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text);
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text);

    $text = preg_replace('/={2,}/', '', $text);
    $text = preg_replace('/{?class="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text);

    $text = trim($text);

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text);
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text);
/*
    $config = array(
      'show-body-only' => true,
      'clean'          => false, 
      'wrap'           => 0, 
      'show-warnings'  => 0,
      'show-errors'    => 0,
      'enclose-block-text'   => false,
      'vertical-space' => true,
      'output-html'    => true
    );

    // Tidy
    $tidy = new tidy;
    $tidy->parseString($text, $config, 'utf8');
    $tidy->cleanRepair();

    $text = $tidy->value;
*/
    $extras = array(
  //  "/\((.*?)\)/is" => "",
      "/\[(.*?)\]/is" => ""
    );
    $text = preg_replace(array_keys($extras), array_values($extras), $text);

    $text = str_replace(" ,", ',', $text);
    $text = str_replace(", ", ',', $text);
    $text = str_replace(",", ', ', $text);
    $text = str_replace("(, ", '(', $text);
    $text = str_replace(";,", ',', $text);

    // lets keep it plain plain plain
    $text = strip_tags($text);
//    $text = preg_replace('/\s\s+/', ' ', $text);

    $text = str_replace("|-", '', $text);
    $text = str_replace("|}", '', $text);
    $text = str_replace("|", '', $text);
    $text = str_replace('()', '', $text);
    $text = str_replace('&nbsp;', ' ', $text);

    $text = trim($text);

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    foreach ($text_arr as $paragraph) {
      if ( mb_strlen(trim($paragraph)) > 30 ) {
      $result[] = $paragraph;
      }
    }
    return $result;
  }

#2


1  

Just guessing here, but wouldn't it be easier and safer to use Wikipedia's markup library (bundled with Mediawiki), turn that into HTML then parse it using whatever XML library you happen to be comfortable with ?

这里只是猜测一下,但是使用Wikipedia的标记库(与Mediawiki捆绑在一起)并将其转换为HTML,然后使用您碰巧熟悉的任何XML库解析它不是更容易、更安全吗?

API documentation can be found at http://svn.wikimedia.org/doc/ (in the Parser module) and it doesn't look very complicated. Basically, all you'd have to do is something like the following:

API文档可以在http://svn.wikimedia.org/doc/(在解析器模块中)找到,看起来并不复杂。基本上,你所要做的就是:

<?php

require_once '/path/to/mediawiki/Parser.php';
// also include whatver classes Parser depends on or use Mediawiki's autoload
// mechanism if it has any

// retrieve the content of your page in $content

$parser = new Parser();
$html   = $parser->parse($content);

$simplexml = simplexml_load_string($html);

Now you have a very handy SimpleXML object to play with. Of course, this only works if Mediawiki's parser produces valid XML (which I bet it does).

现在您有了一个非常方便的SimpleXML对象。当然,这只有在Mediawiki的解析器生成有效的XML(我敢打赌它会生成)时才有效。

Also, should Mediawiki include some kind of autoload mechanism, it would be easy to find it by looking for __autoload or spl_autoload_register in Mediawiki's codebase.

此外,如果Mediawiki包含某种自动加载机制,那么通过在Mediawiki的代码库中查找__autoload或spl_autoload_register就很容易找到它。

Hope it helps!

希望它可以帮助!

#3


0  

It's really hard to make a regex when only one example is provided - from my own experience with cleeaning wikipedia pages I know that other pages will very probably look a little different. Just to match your example is simply:

如果只提供了一个示例,就很难创建regex——根据我自己清理*页面的经验,我知道其他页面看起来可能会有一些不同。为了匹配你的例子,简单地说:

{{.+?}}\n

This only works if there is a newline after the part to be removed and if you specifiy DOTALL and MULTILINE. Match all pairs of double curly brackets and the stuff inside with:

这只适用于在要删除的部分后面有一个换行,以及指定DOTALL和MULTILINE。将所有双花括号和里面的东西匹配为:

{{[^}]+}}

You may try to do several runs, each removing another unwanted part - I doubt it's well doable to match all you need inside a single regex.

您可以尝试执行几次运行,每一次删除另一个不需要的部分——我怀疑它是否能够匹配您在单个正则表达式中所需要的所有内容。

#1


2  

The current code I am using is the following to clean a Wiki Page, for example this one:

我现在使用的代码是以下代码来清理Wiki页面,例如:

http://en.wikipedia.org/wiki/Tel_Aviv (you can see the markup by clicking "edit this page"

http://en.wikipedia.org/wiki/Tel_Aviv

I get this returned:

我得到返回:

"and given way to its reputation as a "Mediterranean metropolis that never sleeps". Haaretz Editorial It is the country's financial capital and a major performing arts and business center. Tel Aviv's urban area is the Middle East's second biggest city economy, and is ranked 42nd among global cities by Foreign Policys 2008 Global Cities Index. It is also the most expensive city in the region, and 17th most expensive city in the world. The cost of living in Israel is high, with Tel Aviv being its most expensive city to live in. According to Mercer, a human resources consulting firm based in New York, as of 2008 Tel Aviv is the most expensive city in the Middle East and the 14th most expensive in the world. It falls just behind Singapore and Paris and just ahead of Sydney and Dublin in this respect. By comparison, New York City is 22nd."

被“从不睡觉的地中海大都市”的名声所取代。Haaretz的社论它是国家的金融中心和主要的表演艺术和商业中心。特拉维夫的城区是中东第二大城市经济体,根据外国警察2008年全球城市指数,它在全球城市中排名第42位。它也是该地区最贵的城市,也是世界第17贵的城市。以色列的生活成本很高,特拉维夫是生活成本最高的城市。根据位于纽约的人力资源咨询公司美世(Mercer)的数据,截至2008年,特拉维夫是中东最昂贵的城市,也是全球第14昂贵的城市。在这方面,它仅落后于新加坡和巴黎,就在悉尼和都柏林之前。相比之下,纽约市只有22岁。

Which isn't correct, expected result should be:

这是不正确的,预期的结果应该是:

Tel Aviv-Yafo (Hebrew: תֵּל־אָבִיב-יָפוֹ; Arabic: تل أبيب‎, Tall ʼAbīb), usually called Tel Aviv, is the second largest city in Israel, with an estimated population of 393,900. The city is situated on the Israeli Mediterranean coastline, with a land area of 51.8 square kilometres (20.0 sq mi). It is the largest and most populous city in the metropolitan area of Gush Dan, home to 3.15 million people as of 2008. The city is governed by the Tel Aviv-Yafo municipality, headed by Ron Huldai.

Tel Aviv-Yafo(希伯来语:תֵּל־אָבִיב——יָפוֹ;阿拉伯语:تلأبيب‎,高ʼAbīb),通常被称为特拉维夫,以色列的第二大城市,估计有393900人口。这座城市位于以色列地中海沿岸,陆地面积51.8平方公里(20.0平方英里)。它是古什丹市区最大、人口最多的城市,2008年有315万人口。这座城市由泰尔阿维-雅弗市(Tel Aviv-Yafo)管辖,由罗恩•胡尔代(Ron Huldai)领导。

For this PHP code:

PHP代码:

function clean_wiki_text($text)
  {
    // first get rid of UGC HTML tags
    $text = strip_tags($text);

    // keep convert tag
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text);

    // remove large blocks (treat as tags)
    $text = preg_replace("/(<![^>]+>)/", '', $text);
    $text = preg_replace('/\{\{\s?/', '<', $text);
    $text = str_replace('}}', ' />', $text);

    $text = str_replace('<! />', '', $text);

    // more wiki formatting
    $text = preg_replace("/'{2,6}/", '', $text);
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text);
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text);

    // drop page link text
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text);
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text);

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text);
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text);
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text);
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text);
    $text = preg_replace('/\n(\*+\s?)/', '', $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text);
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text);

    $text = preg_replace('/={2,}/', '', $text);
    $text = preg_replace('/{?class="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text);

    $text = trim($text);

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text);
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text);
/*
    $config = array(
      'show-body-only' => true,
      'clean'          => false, 
      'wrap'           => 0, 
      'show-warnings'  => 0,
      'show-errors'    => 0,
      'enclose-block-text'   => false,
      'vertical-space' => true,
      'output-html'    => true
    );

    // Tidy
    $tidy = new tidy;
    $tidy->parseString($text, $config, 'utf8');
    $tidy->cleanRepair();

    $text = $tidy->value;
*/
    $extras = array(
  //  "/\((.*?)\)/is" => "",
      "/\[(.*?)\]/is" => ""
    );
    $text = preg_replace(array_keys($extras), array_values($extras), $text);

    $text = str_replace(" ,", ',', $text);
    $text = str_replace(", ", ',', $text);
    $text = str_replace(",", ', ', $text);
    $text = str_replace("(, ", '(', $text);
    $text = str_replace(";,", ',', $text);

    // lets keep it plain plain plain
    $text = strip_tags($text);
//    $text = preg_replace('/\s\s+/', ' ', $text);

    $text = str_replace("|-", '', $text);
    $text = str_replace("|}", '', $text);
    $text = str_replace("|", '', $text);
    $text = str_replace('()', '', $text);
    $text = str_replace('&nbsp;', ' ', $text);

    $text = trim($text);

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    foreach ($text_arr as $paragraph) {
      if ( mb_strlen(trim($paragraph)) > 30 ) {
      $result[] = $paragraph;
      }
    }
    return $result;
  }

#2


1  

Just guessing here, but wouldn't it be easier and safer to use Wikipedia's markup library (bundled with Mediawiki), turn that into HTML then parse it using whatever XML library you happen to be comfortable with ?

这里只是猜测一下,但是使用Wikipedia的标记库(与Mediawiki捆绑在一起)并将其转换为HTML,然后使用您碰巧熟悉的任何XML库解析它不是更容易、更安全吗?

API documentation can be found at http://svn.wikimedia.org/doc/ (in the Parser module) and it doesn't look very complicated. Basically, all you'd have to do is something like the following:

API文档可以在http://svn.wikimedia.org/doc/(在解析器模块中)找到,看起来并不复杂。基本上,你所要做的就是:

<?php

require_once '/path/to/mediawiki/Parser.php';
// also include whatver classes Parser depends on or use Mediawiki's autoload
// mechanism if it has any

// retrieve the content of your page in $content

$parser = new Parser();
$html   = $parser->parse($content);

$simplexml = simplexml_load_string($html);

Now you have a very handy SimpleXML object to play with. Of course, this only works if Mediawiki's parser produces valid XML (which I bet it does).

现在您有了一个非常方便的SimpleXML对象。当然,这只有在Mediawiki的解析器生成有效的XML(我敢打赌它会生成)时才有效。

Also, should Mediawiki include some kind of autoload mechanism, it would be easy to find it by looking for __autoload or spl_autoload_register in Mediawiki's codebase.

此外,如果Mediawiki包含某种自动加载机制,那么通过在Mediawiki的代码库中查找__autoload或spl_autoload_register就很容易找到它。

Hope it helps!

希望它可以帮助!

#3


0  

It's really hard to make a regex when only one example is provided - from my own experience with cleeaning wikipedia pages I know that other pages will very probably look a little different. Just to match your example is simply:

如果只提供了一个示例,就很难创建regex——根据我自己清理*页面的经验,我知道其他页面看起来可能会有一些不同。为了匹配你的例子,简单地说:

{{.+?}}\n

This only works if there is a newline after the part to be removed and if you specifiy DOTALL and MULTILINE. Match all pairs of double curly brackets and the stuff inside with:

这只适用于在要删除的部分后面有一个换行,以及指定DOTALL和MULTILINE。将所有双花括号和里面的东西匹配为:

{{[^}]+}}

You may try to do several runs, each removing another unwanted part - I doubt it's well doable to match all you need inside a single regex.

您可以尝试执行几次运行,每一次删除另一个不需要的部分——我怀疑它是否能够匹配您在单个正则表达式中所需要的所有内容。