如何从写成单词的数字中读取值?

时间:2022-06-24 23:27:33

As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.

众所周知,数字可以用数字写成,也可以用它们的名字来表示。虽然有很多例子可以将123转换为123,但我找不到如何以相反的方式转换它的好例子。

Some of the caveats:

一些警告:

  1. cardinal/nominal or ordinal: "one" and "first"
  2. 基数/名义或序数:“一”和“第一”

  3. common spelling mistakes: "forty"/"fourty"
  4. 常见的拼写错误:“四十”/“四十”

  5. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
  6. 数百/数千:2100 - >“二十一”,还有“二千一百”

  7. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
  8. 分隔符:“十一二五二”,还有“十一五十二”或“十一二十二”等等

  9. colloquialisms: "thirty-something"
  10. fractions: 'one third', 'two fifths'
  11. 分数:'三分之一','五分之二'

  12. common names: 'a dozen', 'half'
  13. 俗名:'打打','半'

And there are probably more caveats possible that are not yet listed. Suppose the algorithm needs to be very robust, and even understand spelling mistakes.

并且可能还有更多可能尚未列出的警告。假设算法需要非常健壮,甚至可以理解拼写错误。

What fields/papers/studies/algorithms should I read to learn how to write all this? Where is the information?

我应该阅读哪些领域/论文/研究/算法来学习如何写这些?信息在哪里?

PS: My final parser should actually understand 3 different languages, English, Russian and Hebrew. And maybe at a later stage more languages will be added. Hebrew also has male/female numbers, like "one man" and "one woman" have a different "one" — "ehad" and "ahat". Russian also has some of its own complexities.

PS:我的最终解析器应该真正理解3种不同的语言,英语,俄语和希伯来语。也许在稍后阶段会添加更多语言。希伯来语也有男性/女性数字,如“一个男人”和“一个女人”有不同的“一个” - “ehad”和“ahat”。俄罗斯也有一些自己的复杂性。

Google does a great job at this. For example:

谷歌在这方面做得很好。例如:

http://www.google.com/search?q=two+thousand+and+one+hundred+plus+five+dozen+and+four+fifths+in+decimal

(the reverse is also possible http://www.google.com/search?q=999999999999+in+english)

(反过来也可以http://www.google.com/search?q=999999999999+in+english)

12 个解决方案

#1


43  

I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.

当我注意到有一个非常简单的算法可以很好地处理英语,西班牙语和英语中的常见数字形式时,我正在玩一个PEG解析器来做你想做的事情(可能会在以后单独发布)。德国人,至少。

Working with English for example, you need a dictionary that maps words to values in the obvious way:

例如,使用英语,您需要一个以明显的方式将单词映射到值的字典:

"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000

...and so forth

......等等

The algorithm is just:

算法只是:

total = 0
prior = null
for each word w
    v <- value(w) or next if no value defined
    prior <- case
        when prior is null:       v
        when prior > v:     prior+v
        else                prior*v
        else
    if w in {thousand,million,billion,trillion...}
        total <- total + prior
        prior <- null
total = total + prior unless prior is null

For example, this progresses as follows:

例如,这进展如下:

total    prior      v     unconsumed string
    0      _              four score and seven 
                    4     score and seven 
    0      4              
                   20     and seven 
    0     80      
                    _     seven 
    0     80      
                    7 
    0     87      
   87

total    prior      v     unconsumed string
    0        _            two million four hundred twelve thousand eight hundred seven
                    2     million four hundred twelve thousand eight hundred seven
    0        2
                  1000000 four hundred twelve thousand eight hundred seven
2000000      _
                    4     hundred twelve thousand eight hundred seven
2000000      4
                    100   twelve thousand eight hundred seven
2000000    400
                    12    thousand eight hundred seven
2000000    412
                    1000  eight hundred seven
2000000  412000
                    1000  eight hundred seven
2412000     _
                      8   hundred seven
2412000     8
                     100  seven
2412000   800
                     7
2412000   807
2412807

And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.

等等。我并不是说这是完美的,但是对于快速而肮脏的它来说它确实很好。


Addressing your specific list on edit:

在编辑时解决您的特定列表:

  1. cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
  2. 基数/名义或序数:“一”和“第一” - 只需将它们放入字典中

  3. english/british: "fourty"/"forty" -- ditto
  4. 英语/英语:“fourty”/“forty” - 同上

  5. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
  6. 数百/数千:2100 - >“二十一”,还有“二千一百” - 按原样工作

  7. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
  8. 分隔符:“十一二百五十二”,还有“十一五十二”或“十一二百五十二”等等 - 只是将“下一个单词”定义为与定义的单词匹配的最长前缀,或者直到下一个单词如果没有,那就是非单词,一开始

  9. colloqialisms: "thirty-something" -- works
  10. colloqialisms:“三十多岁” - 作品

  11. fragments: 'one third', 'two fifths' -- uh, not yet...
  12. 碎片:“三分之一”,“五分之二” - 呃,还没......

  13. common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"
  14. 俗名:'打打','半' - 作品;你甚至可以做“半打”这样的事情

Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.

6号是唯一一个我没有准备好答案的人,这是因为序数和分数之间的模糊性(至少在英语中)增加了我的最后一杯咖啡在几个小时前的事实。

#2


11  

It's not an easy issue, and I know of no library to do it. I might sit down and try to write something like this sometime. I'd do it in either Prolog, Java or Haskell, though. As far as I can see, there are several issues:

这不是一个简单的问题,我知道没有图书馆可以做到这一点。我可能会坐下来尝试写一些这样的东西。不过,我会在Prolog,Java或Haskell中做到这一点。据我所知,有几个问题:

  • Tokenization: sometimes, numbers are written eleven hundred fifty two, but I've seen elevenhundred fiftytwo or eleven-hundred-fifty-two and whatnot. One would have to conduct a survey on what forms are actually in use. This might be especially tricky for Hebrew.
  • 标记化:有时,数字写成1125,但我已经看过十一五十二或十一点五十二以及诸如此类的东西。人们必须对实际使用的形式进行调查。这对希伯来语来说可能特别棘手。

  • Spelling mistakes: that's not so hard. You have a limited amount of words, and a bit of Levenshtein-distance magic should do the trick.
  • 拼写错误:这不是那么难。你的词数有限,而且一点Levenshtein距离法术应该可以解决问题。

  • Alternate forms, like you already mentioned, exist. This includes ordinal/cardinal numbers, as well as forty/fourty and...
  • 像您已经提到的替代形式存在。这包括序数/基数,以及四十四十和......

  • ... common names or commonly used phrases and NEs (named entities). Would you want to extract 30 from the Thirty Years War or 2 from World War II?
  • ...通用名称或常用短语和NE(命名实体)。你想从三十年战争中抽取30或从第二次世界大战中抽取2吗?

  • Roman numerals, too?
  • 罗马数字呢?

  • Colloqialisms, such as "thirty-something" and "three Euro and shrapnel", which I wouldn't know how to treat.
  • Colloqialisms,如“三十多岁”和“三欧元和弹片”,我不知道如何对待。

If you are interested in this, I could give it a shot this weekend. My idea is probably using UIMA and tokenizing with it, then going on to further tokenize/disambiguate and finally translate. There might be more issues, let's see if I can come up with some more interesting things.

如果你对此感兴趣,我可以在本周末试一试。我的想法可能是使用UIMA并使用它进行标记,然后继续进行标记化/消除歧义并最终翻译。可能会有更多问题,让我们看看我是否可以提出一些更有趣的事情。

Sorry, this is not a real answer yet, just an extension to your question. I'll let you know if I find/write something.

对不起,这还不是一个真正的答案,只是你问题的扩展。如果我发现/写东西,我会告诉你的。

By the way, if you are interested in the semantics of numerals, I just found an interesting paper by Friederike Moltmann, discussing some issues regarding the logic interpretation of numerals.

顺便说一句,如果你对数字的语义感兴趣,我刚刚发现了Friederike Moltmann的一篇有趣的论文,讨论了关于数字逻辑解释的一些问题。

#3


10  

I have some code I wrote a while ago: text2num. This does some of what you want, except it does not handle ordinal numbers. I haven't actually used this code for anything, so it's largely untested!

我有一些我刚才写的代码:text2num。这可以做你想要的一些,除了它不处理序数。我实际上并没有使用这个代码,所以它基本上没有经过测试!

#4


7  

Use the Python pattern-en library:

使用Python pattern-en库:

>>> from pattern.en import number
>>> number('two thousand fifty and a half') => 2050.5

#5


5  

You should keep in mind that Europe and America count differently.

你应该记住,欧洲和美国的数量不同。

European standard:

One Thousand
One Million
One Thousand Millions (British also use Milliard)
One Billion
One Thousand Billions
One Trillion
One Thousand Trillions

Here is a small reference on it.

这是一个小参考。


A simple way to see the difference is the following:

查看差异的简单方法如下:

(American counting Trillion) == (European counting Billion)

#6


4  

Ordinal numbers are not applicable because they cant be joined in meaningful ways with other numbers in language (...at least in English)

序数不适用,因为它们不能以有意义的方式与其他语言数字相结合(...至少用英语)

e.g. one hundred and first, eleven second, etc...

例如一百零一,十一秒等...

However, there is another English/American caveat with the word 'and'

然而,还有另一个英语/美国警告,其中包含'和'

i.e.

one hundred and one (English) one hundred one (American)

一百零一(英语)一百一(美国)

Also, the use of 'a' to mean one in English

另外,使用'a'表示英语中的一个

a thousand = one thousand

一千=一千

...On a side note Google's calculator does an amazing job of this.

...另一方面,谷歌的计算器做得非常出色。

one hundred and three thousand times the speed of light

光速的一万三千倍

And even...

two thousand and one hundred plus a dozen

两千一百加一打

...wtf?!? a score plus a dozen in roman numerals

... WTF?!?一个分数加十几个罗马数字

#7


3  

Here is an extremely robust solution in Clojure.

这是Clojure中非常强大的解决方案。

AFAIK it is a unique implementation approach.

AFAIK是一种独特的实施方法。

;----------------------------------------------------------------------
; numbers.clj
; written by: Mike Mattie codermattie@gmail.com
;----------------------------------------------------------------------
(ns operator.numbers
  (:use compojure.core)

  (:require
    [clojure.string     :as string] ))

(def number-word-table {
  "zero"          0
  "one"           1
  "two"           2
  "three"         3
  "four"          4
  "five"          5
  "six"           6
  "seven"         7
  "eight"         8
  "nine"          9
  "ten"           10
  "eleven"        11
  "twelve"        12
  "thirteen"      13
  "fourteen"      14
  "fifteen"       15
  "sixteen"       16
  "seventeen"     17
  "eighteen"      18
  "nineteen"      19
  "twenty"        20
  "thirty"        30
  "fourty"        40
  "fifty"         50
  "sixty"         60
  "seventy"       70
  "eighty"        80
  "ninety"        90
})

(def multiplier-word-table {
  "hundred"       100
  "thousand"      1000
})

(defn sum-words-to-number [ words ]
  (apply + (map (fn [ word ] (number-word-table word)) words)) )

; are you down with the sickness ?
(defn words-to-number [ words ]
  (let
    [ n           (count words)

      multipliers (filter (fn [x] (not (false? x))) (map-indexed
                                                      (fn [ i word ]
                                                        (if (contains? multiplier-word-table word)
                                                          (vector i (multiplier-word-table word))
                                                          false))
                                                      words) )

      x           (ref 0) ]

    (loop [ indices (reverse (conj (reverse multipliers) (vector n 1)))
            left    0
            combine + ]
      (let
        [ right (first indices) ]

        (dosync (alter x combine (* (if (> (- (first right) left) 0)
                                      (sum-words-to-number (subvec words left (first right)))
                                      1)
                                    (second right)) ))

        (when (> (count (rest indices)) 0)
          (recur (rest indices) (inc (first right))
            (if (= (inc (first right)) (first (second indices)))
              *
              +))) ) )
    @x ))

Here are some examples

这里有些例子

(operator.numbers/words-to-number ["six" "thousand" "five" "hundred" "twenty" "two"])
(operator.numbers/words-to-number ["fifty" "seven" "hundred"])
(operator.numbers/words-to-number ["hundred"])

#8


2  

My LPC implementation of some of your requirements (American English only):

我的LPC实现了您的一些要求(仅限美国英语):

internal mapping inordinal = ([]);
internal mapping number = ([]);

#define Numbers ([\
    "zero"        : 0, \
    "one"         : 1, \
    "two"         : 2, \
    "three"       : 3, \
    "four"        : 4, \
    "five"        : 5, \
    "six"         : 6, \
    "seven"       : 7, \
    "eight"       : 8, \
    "nine"        : 9, \
    "ten"         : 10, \
    "eleven"      : 11, \
    "twelve"      : 12, \
    "thirteen"    : 13, \
    "fourteen"    : 14, \
    "fifteen"     : 15, \
    "sixteen"     : 16, \
    "seventeen"   : 17, \
    "eighteen"    : 18, \
    "nineteen"    : 19, \
    "twenty"      : 20, \
    "thirty"      : 30, \
    "forty"       : 40, \
    "fifty"       : 50, \
    "sixty"       : 60, \
    "seventy"     : 70, \
    "eighty"      : 80, \
    "ninety"      : 90, \
    "hundred"     : 100, \
    "thousand"    : 1000, \
    "million"     : 1000000, \
    "billion"     : 1000000000, \
])

#define Ordinals ([\
    "zeroth"        : 0, \
    "first"         : 1, \
    "second"        : 2, \
    "third"         : 3, \
    "fourth"        : 4, \
    "fifth"         : 5, \
    "sixth"         : 6, \
    "seventh"       : 7, \
    "eighth"        : 8, \
    "ninth"         : 9, \
    "tenth"         : 10, \
    "eleventh"      : 11, \
    "twelfth"       : 12, \
    "thirteenth"    : 13, \
    "fourteenth"    : 14, \
    "fifteenth"     : 15, \
    "sixteenth"     : 16, \
    "seventeenth"   : 17, \
    "eighteenth"    : 18, \
    "nineteenth"    : 19, \
    "twentieth"     : 20, \
    "thirtieth"     : 30, \
    "fortieth"      : 40, \
    "fiftieth"      : 50, \
    "sixtieth"      : 60, \
    "seventieth"    : 70, \
    "eightieth"     : 80, \
    "ninetieth"     : 90, \
    "hundredth"     : 100, \
    "thousandth"    : 1000, \
    "millionth"     : 1000000, \
    "billionth"     : 1000000000, \
])

varargs int denumerical(string num, status ordinal) {
    if(ordinal) {
        if(member(inordinal, num))
            return inordinal[num];
    } else {
        if(member(number, num))
            return number[num];
    }
    int sign = 1;
    int total = 0;
    int sub = 0;
    int value;
    string array parts = regexplode(num, " |-");
    if(sizeof(parts) >= 2 && parts[0] == "" && parts[1] == "-")
        sign = -1;
    for(int ix = 0, int iix = sizeof(parts); ix < iix; ix++) {
        string part = parts[ix];
        switch(part) {
        case "negative" :
        case "minus"    :
            sign = -1;
            continue;
        case ""         :
            continue;
        }
        if(ordinal && ix == iix - 1) {
            if(part[0] >= '0' && part[0] <= '9' && ends_with(part, "th"))
                value = to_int(part[..<3]);
            else if(member(Ordinals, part))
                value = Ordinals[part];
            else
                continue;
        } else {
            if(part[0] >= '0' && part[0] <= '9')
                value = to_int(part);
            else if(member(Numbers, part))
                value = Numbers[part];
            else
                continue;
        }
        if(value < 0) {
            sign = -1;
            value = - value;
        }
        if(value < 10) {
            if(sub >= 1000) {
                total += sub;
                sub = value;
            } else {
                sub += value;
            }
        } else if(value < 100) {
            if(sub < 10) {
                sub = 100 * sub + value;
            } else if(sub >= 1000) {
                total += sub;
                sub = value;
            } else {
                sub *= value;
            }
        } else if(value < sub) {
            total += sub;
            sub = value;
        } else if(sub == 0) {
            sub = value;
        } else {
            sub *= value;
        }
    }
    total += sub;
    return sign * total;
}

#9


2  

Well, I was too late on the answer for this question, but I was working a little test scenario that seems to have worked very well for me. I used a (simple, but ugly, and large) regular expression to locate all the words for me. The expression is as follows:

好吧,我对这个问题的回答太迟了,但我正在做一个似乎对我来说效果很好的小测试场景。我使用(简单但丑陋,大)正则表达式为我找到所有单词。表达式如下:

(?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)|
(?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)|
(?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)|
(?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)|
(?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)|
(?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)|
(?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)|
(?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)|
(?:billion|billionth)))

Shown here with line breaks for formatting purposes..

为了格式化目的,此处显示换行符..

Anyways, my method was to execute this RegEx with a library like PCRE, and then read back the named matches. And it worked on all of the different examples listed in this question, minus the "One Half", types, as I didn't add them in, but as you can see, it wouldn't be hard to do so. This addresses a lot of issues. For example, it addresses the following items in the original question and other answers:

无论如何,我的方法是使用像PCRE这样的库来执行这个RegEx,然后回读命名匹配。它适用于这个问题中列出的所有不同的例子,减去“一半”类型,因为我没有添加它们,但正如你所看到的,这并不难。这解决了很多问题。例如,它解决了原始问题和其他答案中的以下项目:

  1. cardinal/nominal or ordinal: "one" and "first"
  2. 基数/名义或序数:“一”和“第一”

  3. common spelling mistakes: "forty"/"fourty" (Note that it does not EXPLICITLY address this, that would be something you'd want to do before you passed the string to this parser. This parser sees this example as "FOUR"...)
  4. 常见的拼写错误:“四十”/“十四”(请注意,它并不明确地解决这个问题,在将字符串传递给此解析器之前,这是您想要做的事情。此解析器将此示例视为“四个”。 ..)

  5. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
  6. 数百/数千:2100 - >“二十一”,还有“二千一百”

  7. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
  8. 分隔符:“十一二五二”,还有“十一五十二”或“十一二十二”等等

  9. colloqialisms: "thirty-something" (This also is not TOTALLY addressed, as what IS "something"? Well, this code finds this number as simply "30").**
  10. colloqialisms:“三十多岁”(这也没有完全解决,因为什么是“某事”?嗯,这段代码发现这个数字只是“30”)。**

Now, rather than store this monster of a regular expression in your source, I was considering building this RegEx at runtime, using something like the following:

现在,我不是在源代码中存储正则表达式的怪物,而是考虑在运行时使用以下内容构建此RegEx:

char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve",
  "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"};
char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"};
char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" };
char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" };
and so on...

The easy part here is we are only storing the words that matter. In the case of SIXTH, you'll notice that there isn't an entry for it, because it's just it's normal number with TH tacked on... But ones like TWELVE need different attention.

这里最简单的部分是我们只存储重要的单词。在SIXTH的情况下,你会注意到它没有一个条目,因为它只是正常的数字,TH加上......但像TWELVE这样的东西需要不同的关注。

Ok, so now we have the code to build our (ugly) RegEx, now we just execute it on our number strings.

好的,现在我们有了构建我们(丑陋)RegEx的代码,现在我们只需要在我们的数字字符串上执行它。

One thing I would recommend, is to filter, or eat the word "AND". It's not necessary, and only leads to other issues.

我建议的一件事是过滤,或吃“和”这个词。这没有必要,只会导致其他问题。

So, what you are going to want to do is setup a function that passes the named matches for "Magnitude" into a function that looks at all the possible magnitude values, and multiplies your current result by that value of magnitude. Then, you create a function that looks at the "Value" named matches, and returns an int (or whatever you are using), based on the value discovered there.

因此,您要做的是设置一个函数,将“Magnitude”的命名匹配传递给一个函数,该函数查看所有可能的幅度值,并将当前结果乘以该值。然后,创建一个查看名为matches的“Value”的函数,并根据在那里发现的值返回一个int(或者你正在使用的任何东西)。

All VALUE matches are ADDED to your result, while magnitutde matches multiply the result by the mag value. So, Two Hundred Fifty Thousand becomes "2", then "2 * 100", then "200 + 50", then "250 * 1000", ending up with 250000...

所有VALUE匹配都添加到您的结果中,而magnitutde匹配则将结果乘以mag值。那么,二十五万变成“2”,然后是“2 * 100”,然后是“200 + 50”,然后是“250 * 1000”,最终变成250000 ......

Just for fun, I wrote a vbScript version of this and it worked great with all the examples provided. Now, it doesn't support named matches, so I had to work a little harder getting the correct result, but I got it. Bottom line is, if it's a "VALUE" match, add it your accumulator. If it's a magnitude match, multiply your accumulator by 100, 1000, 1000000, 1000000000, etc... This will provide you with some pretty amazing results, and all you have to do to adjust for things like "one half" is add them to your RegEx, put in a code marker for them, and handle them.

只是为了好玩,我写了一个vbScript版本,并且它提供了很好的所有示例。现在,它不支持命名匹配,所以我必须更努力地获得正确的结果,但我得到了它。底线是,如果是“VALUE”匹配,则将其添加到累加器中。如果它是一个幅度匹配,将你的累加器乘以100,1000,1000000,1000000000等......这将为你提供一些非常惊人的结果,你需要做的就是调整像“一半”之类的东西来添加它们到您的RegEx,为它们添加代码标记,并处理它们。

Well, I hope this post helps SOMEONE out there. If anyone want, I can post by vbScript pseudo code that I used to test this with, however, it's not pretty code, and NOT production code.

好吧,我希望这篇文章可以帮助SOMEONE。如果有人想要,我可以通过我用来测试它的vbScript伪代码发布,但是,它不是漂亮的代码,而不是生产代码。

If I may.. What is the final language this will be written in? C++, or something like a scripted language? Greg Hewgill's source will go a long way in helping understand how all of this comes together.

如果我可以..这将写入最终语言是什么? C ++,还是像脚本语言? Greg Hewgill的资料来源将有助于理解所有这些如何融合在一起。

Let me know if I can be of any other help. Sorry, I only know English/American, so I can't help you with the other languages.

如果我有任何其他帮助,请告诉我。对不起,我只懂英语/美国,所以我无法帮助你使用其他语言。

#10


0  

I was converting ordinal edition statements from early modern books (e.g. "2nd edition", "Editio quarta") to integers and needed support for ordinals 1-100 in English and ordinals 1-10 in a few Romance languages. Here's what I came up with in Python:

我正在将早期现代书籍(例如“第2版”,“Editio quarta”)中的序数版声明转换为整数,并且需要支持英语中的1-100和第1章1-10中的一些罗马语言。这是我在Python中提出的:

def get_data_mapping():
  data_mapping = {
    "1st": 1,
    "2nd": 2,
    "3rd": 3,

    "tenth": 10,
    "eleventh": 11,
    "twelfth": 12,
    "thirteenth": 13,
    "fourteenth": 14,
    "fifteenth": 15,
    "sixteenth": 16,
    "seventeenth": 17,
    "eighteenth": 18,
    "nineteenth": 19,
    "twentieth": 20,

    "new": 2,
    "newly": 2,
    "nova": 2,
    "nouvelle": 2,
    "altera": 2,
    "andere": 2,

    # latin
    "primus": 1,
    "secunda": 2,
    "tertia": 3,
    "quarta": 4,
    "quinta": 5,
    "sexta": 6,
    "septima": 7,
    "octava": 8,
    "nona": 9,
    "decima": 10,

    # italian
    "primo": 1,
    "secondo": 2,
    "terzo": 3,
    "quarto": 4,
    "quinto": 5,
    "sesto": 6,
    "settimo": 7,
    "ottavo": 8,
    "nono": 9,
    "decimo": 10,

    # french
    "premier": 1,
    "deuxième": 2,
    "troisième": 3,
    "quatrième": 4,
    "cinquième": 5,
    "sixième": 6,
    "septième": 7,
    "huitième": 8,
    "neuvième": 9,
    "dixième": 10,

    # spanish
    "primero": 1,
    "segundo": 2,
    "tercero": 3,
    "cuarto": 4,
    "quinto": 5,
    "sexto": 6,
    "septimo": 7,
    "octavo": 8,
    "noveno": 9,
    "decimo": 10
  }

  # create 4th, 5th, ... 20th
  for i in xrange(16):
    data_mapping[str(4+i) + "th"] = 4+i

  # create 21st, 22nd, ... 99th
  for i in xrange(79):
    last_char = str(i)[-1]

    if last_char == "0":
      data_mapping[str(20+i) + "th"] = 20+i

    elif last_char == "1":
      data_mapping[str(20+i) + "st"] = 20+i

    elif last_char == "2":
      data_mapping[str(20+i) + "nd"] = 20+i

    elif last_char == "3":
      data_mapping[str(20+i) + "rd"] = 20+i

    else:
      data_mapping[str(20+i) + "th"] = 20+i

  ordinals = [
    "first", "second", "third", 
    "fourth", "fifth", "sixth", 
    "seventh", "eighth", "ninth"
  ]

  # create first, second ... ninth
  for c, i in enumerate(ordinals):
    data_mapping[i] = c+1

  # create twenty-first, twenty-second ... ninty-ninth
  for ci, i in enumerate([
    "twenty", "thirty", "forty", 
    "fifty", "sixty", "seventy", 
    "eighty", "ninety"
  ]):
    for cj, j in enumerate(ordinals):
      data_mapping[i + "-" + j] = 20 + (ci*10) + (cj+1)
    data_mapping[i.replace("y", "ieth")] = 20 + (ci*10)

  return data_mapping

#11


-1  

Try

  1. Open an HTTP Request to "http://www.google.com/search?q=" + number + "+in+decimal".

    打开“http://www.google.com/search?q=”+ number +“+ in + decimal”的HTTP请求。

  2. Parse the result for your number.

    解析您的号码的结果。

  3. Cache the number / result pairs to lesson the requests over time.

    缓存数字/结果对以随时间推移请求。

#12


-2  

One place to start looking is the gnu get_date lib, which can parse just about any English textual date into a timestamp. While not exactly what you're looking for, their solution to a similar problem could provide a lot of useful clues.

一个开始寻找的地方是gnu get_date lib,它可以将任何英文文本日期解析为时间戳。虽然不完全是您正在寻找的,但他们对类似问题的解决方案可以提供许多有用的线索。

#1


43  

I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.

当我注意到有一个非常简单的算法可以很好地处理英语,西班牙语和英语中的常见数字形式时,我正在玩一个PEG解析器来做你想做的事情(可能会在以后单独发布)。德国人,至少。

Working with English for example, you need a dictionary that maps words to values in the obvious way:

例如,使用英语,您需要一个以明显的方式将单词映射到值的字典:

"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000

...and so forth

......等等

The algorithm is just:

算法只是:

total = 0
prior = null
for each word w
    v <- value(w) or next if no value defined
    prior <- case
        when prior is null:       v
        when prior > v:     prior+v
        else                prior*v
        else
    if w in {thousand,million,billion,trillion...}
        total <- total + prior
        prior <- null
total = total + prior unless prior is null

For example, this progresses as follows:

例如,这进展如下:

total    prior      v     unconsumed string
    0      _              four score and seven 
                    4     score and seven 
    0      4              
                   20     and seven 
    0     80      
                    _     seven 
    0     80      
                    7 
    0     87      
   87

total    prior      v     unconsumed string
    0        _            two million four hundred twelve thousand eight hundred seven
                    2     million four hundred twelve thousand eight hundred seven
    0        2
                  1000000 four hundred twelve thousand eight hundred seven
2000000      _
                    4     hundred twelve thousand eight hundred seven
2000000      4
                    100   twelve thousand eight hundred seven
2000000    400
                    12    thousand eight hundred seven
2000000    412
                    1000  eight hundred seven
2000000  412000
                    1000  eight hundred seven
2412000     _
                      8   hundred seven
2412000     8
                     100  seven
2412000   800
                     7
2412000   807
2412807

And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.

等等。我并不是说这是完美的,但是对于快速而肮脏的它来说它确实很好。


Addressing your specific list on edit:

在编辑时解决您的特定列表:

  1. cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
  2. 基数/名义或序数:“一”和“第一” - 只需将它们放入字典中

  3. english/british: "fourty"/"forty" -- ditto
  4. 英语/英语:“fourty”/“forty” - 同上

  5. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
  6. 数百/数千:2100 - >“二十一”,还有“二千一百” - 按原样工作

  7. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
  8. 分隔符:“十一二百五十二”,还有“十一五十二”或“十一二百五十二”等等 - 只是将“下一个单词”定义为与定义的单词匹配的最长前缀,或者直到下一个单词如果没有,那就是非单词,一开始

  9. colloqialisms: "thirty-something" -- works
  10. colloqialisms:“三十多岁” - 作品

  11. fragments: 'one third', 'two fifths' -- uh, not yet...
  12. 碎片:“三分之一”,“五分之二” - 呃,还没......

  13. common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"
  14. 俗名:'打打','半' - 作品;你甚至可以做“半打”这样的事情

Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.

6号是唯一一个我没有准备好答案的人,这是因为序数和分数之间的模糊性(至少在英语中)增加了我的最后一杯咖啡在几个小时前的事实。

#2


11  

It's not an easy issue, and I know of no library to do it. I might sit down and try to write something like this sometime. I'd do it in either Prolog, Java or Haskell, though. As far as I can see, there are several issues:

这不是一个简单的问题,我知道没有图书馆可以做到这一点。我可能会坐下来尝试写一些这样的东西。不过,我会在Prolog,Java或Haskell中做到这一点。据我所知,有几个问题:

  • Tokenization: sometimes, numbers are written eleven hundred fifty two, but I've seen elevenhundred fiftytwo or eleven-hundred-fifty-two and whatnot. One would have to conduct a survey on what forms are actually in use. This might be especially tricky for Hebrew.
  • 标记化:有时,数字写成1125,但我已经看过十一五十二或十一点五十二以及诸如此类的东西。人们必须对实际使用的形式进行调查。这对希伯来语来说可能特别棘手。

  • Spelling mistakes: that's not so hard. You have a limited amount of words, and a bit of Levenshtein-distance magic should do the trick.
  • 拼写错误:这不是那么难。你的词数有限,而且一点Levenshtein距离法术应该可以解决问题。

  • Alternate forms, like you already mentioned, exist. This includes ordinal/cardinal numbers, as well as forty/fourty and...
  • 像您已经提到的替代形式存在。这包括序数/基数,以及四十四十和......

  • ... common names or commonly used phrases and NEs (named entities). Would you want to extract 30 from the Thirty Years War or 2 from World War II?
  • ...通用名称或常用短语和NE(命名实体)。你想从三十年战争中抽取30或从第二次世界大战中抽取2吗?

  • Roman numerals, too?
  • 罗马数字呢?

  • Colloqialisms, such as "thirty-something" and "three Euro and shrapnel", which I wouldn't know how to treat.
  • Colloqialisms,如“三十多岁”和“三欧元和弹片”,我不知道如何对待。

If you are interested in this, I could give it a shot this weekend. My idea is probably using UIMA and tokenizing with it, then going on to further tokenize/disambiguate and finally translate. There might be more issues, let's see if I can come up with some more interesting things.

如果你对此感兴趣,我可以在本周末试一试。我的想法可能是使用UIMA并使用它进行标记,然后继续进行标记化/消除歧义并最终翻译。可能会有更多问题,让我们看看我是否可以提出一些更有趣的事情。

Sorry, this is not a real answer yet, just an extension to your question. I'll let you know if I find/write something.

对不起,这还不是一个真正的答案,只是你问题的扩展。如果我发现/写东西,我会告诉你的。

By the way, if you are interested in the semantics of numerals, I just found an interesting paper by Friederike Moltmann, discussing some issues regarding the logic interpretation of numerals.

顺便说一句,如果你对数字的语义感兴趣,我刚刚发现了Friederike Moltmann的一篇有趣的论文,讨论了关于数字逻辑解释的一些问题。

#3


10  

I have some code I wrote a while ago: text2num. This does some of what you want, except it does not handle ordinal numbers. I haven't actually used this code for anything, so it's largely untested!

我有一些我刚才写的代码:text2num。这可以做你想要的一些,除了它不处理序数。我实际上并没有使用这个代码,所以它基本上没有经过测试!

#4


7  

Use the Python pattern-en library:

使用Python pattern-en库:

>>> from pattern.en import number
>>> number('two thousand fifty and a half') => 2050.5

#5


5  

You should keep in mind that Europe and America count differently.

你应该记住,欧洲和美国的数量不同。

European standard:

One Thousand
One Million
One Thousand Millions (British also use Milliard)
One Billion
One Thousand Billions
One Trillion
One Thousand Trillions

Here is a small reference on it.

这是一个小参考。


A simple way to see the difference is the following:

查看差异的简单方法如下:

(American counting Trillion) == (European counting Billion)

#6


4  

Ordinal numbers are not applicable because they cant be joined in meaningful ways with other numbers in language (...at least in English)

序数不适用,因为它们不能以有意义的方式与其他语言数字相结合(...至少用英语)

e.g. one hundred and first, eleven second, etc...

例如一百零一,十一秒等...

However, there is another English/American caveat with the word 'and'

然而,还有另一个英语/美国警告,其中包含'和'

i.e.

one hundred and one (English) one hundred one (American)

一百零一(英语)一百一(美国)

Also, the use of 'a' to mean one in English

另外,使用'a'表示英语中的一个

a thousand = one thousand

一千=一千

...On a side note Google's calculator does an amazing job of this.

...另一方面,谷歌的计算器做得非常出色。

one hundred and three thousand times the speed of light

光速的一万三千倍

And even...

two thousand and one hundred plus a dozen

两千一百加一打

...wtf?!? a score plus a dozen in roman numerals

... WTF?!?一个分数加十几个罗马数字

#7


3  

Here is an extremely robust solution in Clojure.

这是Clojure中非常强大的解决方案。

AFAIK it is a unique implementation approach.

AFAIK是一种独特的实施方法。

;----------------------------------------------------------------------
; numbers.clj
; written by: Mike Mattie codermattie@gmail.com
;----------------------------------------------------------------------
(ns operator.numbers
  (:use compojure.core)

  (:require
    [clojure.string     :as string] ))

(def number-word-table {
  "zero"          0
  "one"           1
  "two"           2
  "three"         3
  "four"          4
  "five"          5
  "six"           6
  "seven"         7
  "eight"         8
  "nine"          9
  "ten"           10
  "eleven"        11
  "twelve"        12
  "thirteen"      13
  "fourteen"      14
  "fifteen"       15
  "sixteen"       16
  "seventeen"     17
  "eighteen"      18
  "nineteen"      19
  "twenty"        20
  "thirty"        30
  "fourty"        40
  "fifty"         50
  "sixty"         60
  "seventy"       70
  "eighty"        80
  "ninety"        90
})

(def multiplier-word-table {
  "hundred"       100
  "thousand"      1000
})

(defn sum-words-to-number [ words ]
  (apply + (map (fn [ word ] (number-word-table word)) words)) )

; are you down with the sickness ?
(defn words-to-number [ words ]
  (let
    [ n           (count words)

      multipliers (filter (fn [x] (not (false? x))) (map-indexed
                                                      (fn [ i word ]
                                                        (if (contains? multiplier-word-table word)
                                                          (vector i (multiplier-word-table word))
                                                          false))
                                                      words) )

      x           (ref 0) ]

    (loop [ indices (reverse (conj (reverse multipliers) (vector n 1)))
            left    0
            combine + ]
      (let
        [ right (first indices) ]

        (dosync (alter x combine (* (if (> (- (first right) left) 0)
                                      (sum-words-to-number (subvec words left (first right)))
                                      1)
                                    (second right)) ))

        (when (> (count (rest indices)) 0)
          (recur (rest indices) (inc (first right))
            (if (= (inc (first right)) (first (second indices)))
              *
              +))) ) )
    @x ))

Here are some examples

这里有些例子

(operator.numbers/words-to-number ["six" "thousand" "five" "hundred" "twenty" "two"])
(operator.numbers/words-to-number ["fifty" "seven" "hundred"])
(operator.numbers/words-to-number ["hundred"])

#8


2  

My LPC implementation of some of your requirements (American English only):

我的LPC实现了您的一些要求(仅限美国英语):

internal mapping inordinal = ([]);
internal mapping number = ([]);

#define Numbers ([\
    "zero"        : 0, \
    "one"         : 1, \
    "two"         : 2, \
    "three"       : 3, \
    "four"        : 4, \
    "five"        : 5, \
    "six"         : 6, \
    "seven"       : 7, \
    "eight"       : 8, \
    "nine"        : 9, \
    "ten"         : 10, \
    "eleven"      : 11, \
    "twelve"      : 12, \
    "thirteen"    : 13, \
    "fourteen"    : 14, \
    "fifteen"     : 15, \
    "sixteen"     : 16, \
    "seventeen"   : 17, \
    "eighteen"    : 18, \
    "nineteen"    : 19, \
    "twenty"      : 20, \
    "thirty"      : 30, \
    "forty"       : 40, \
    "fifty"       : 50, \
    "sixty"       : 60, \
    "seventy"     : 70, \
    "eighty"      : 80, \
    "ninety"      : 90, \
    "hundred"     : 100, \
    "thousand"    : 1000, \
    "million"     : 1000000, \
    "billion"     : 1000000000, \
])

#define Ordinals ([\
    "zeroth"        : 0, \
    "first"         : 1, \
    "second"        : 2, \
    "third"         : 3, \
    "fourth"        : 4, \
    "fifth"         : 5, \
    "sixth"         : 6, \
    "seventh"       : 7, \
    "eighth"        : 8, \
    "ninth"         : 9, \
    "tenth"         : 10, \
    "eleventh"      : 11, \
    "twelfth"       : 12, \
    "thirteenth"    : 13, \
    "fourteenth"    : 14, \
    "fifteenth"     : 15, \
    "sixteenth"     : 16, \
    "seventeenth"   : 17, \
    "eighteenth"    : 18, \
    "nineteenth"    : 19, \
    "twentieth"     : 20, \
    "thirtieth"     : 30, \
    "fortieth"      : 40, \
    "fiftieth"      : 50, \
    "sixtieth"      : 60, \
    "seventieth"    : 70, \
    "eightieth"     : 80, \
    "ninetieth"     : 90, \
    "hundredth"     : 100, \
    "thousandth"    : 1000, \
    "millionth"     : 1000000, \
    "billionth"     : 1000000000, \
])

varargs int denumerical(string num, status ordinal) {
    if(ordinal) {
        if(member(inordinal, num))
            return inordinal[num];
    } else {
        if(member(number, num))
            return number[num];
    }
    int sign = 1;
    int total = 0;
    int sub = 0;
    int value;
    string array parts = regexplode(num, " |-");
    if(sizeof(parts) >= 2 && parts[0] == "" && parts[1] == "-")
        sign = -1;
    for(int ix = 0, int iix = sizeof(parts); ix < iix; ix++) {
        string part = parts[ix];
        switch(part) {
        case "negative" :
        case "minus"    :
            sign = -1;
            continue;
        case ""         :
            continue;
        }
        if(ordinal && ix == iix - 1) {
            if(part[0] >= '0' && part[0] <= '9' && ends_with(part, "th"))
                value = to_int(part[..<3]);
            else if(member(Ordinals, part))
                value = Ordinals[part];
            else
                continue;
        } else {
            if(part[0] >= '0' && part[0] <= '9')
                value = to_int(part);
            else if(member(Numbers, part))
                value = Numbers[part];
            else
                continue;
        }
        if(value < 0) {
            sign = -1;
            value = - value;
        }
        if(value < 10) {
            if(sub >= 1000) {
                total += sub;
                sub = value;
            } else {
                sub += value;
            }
        } else if(value < 100) {
            if(sub < 10) {
                sub = 100 * sub + value;
            } else if(sub >= 1000) {
                total += sub;
                sub = value;
            } else {
                sub *= value;
            }
        } else if(value < sub) {
            total += sub;
            sub = value;
        } else if(sub == 0) {
            sub = value;
        } else {
            sub *= value;
        }
    }
    total += sub;
    return sign * total;
}

#9


2  

Well, I was too late on the answer for this question, but I was working a little test scenario that seems to have worked very well for me. I used a (simple, but ugly, and large) regular expression to locate all the words for me. The expression is as follows:

好吧,我对这个问题的回答太迟了,但我正在做一个似乎对我来说效果很好的小测试场景。我使用(简单但丑陋,大)正则表达式为我找到所有单词。表达式如下:

(?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)|
(?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)|
(?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)|
(?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)|
(?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)|
(?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)|
(?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)|
(?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)|
(?:billion|billionth)))

Shown here with line breaks for formatting purposes..

为了格式化目的,此处显示换行符..

Anyways, my method was to execute this RegEx with a library like PCRE, and then read back the named matches. And it worked on all of the different examples listed in this question, minus the "One Half", types, as I didn't add them in, but as you can see, it wouldn't be hard to do so. This addresses a lot of issues. For example, it addresses the following items in the original question and other answers:

无论如何,我的方法是使用像PCRE这样的库来执行这个RegEx,然后回读命名匹配。它适用于这个问题中列出的所有不同的例子,减去“一半”类型,因为我没有添加它们,但正如你所看到的,这并不难。这解决了很多问题。例如,它解决了原始问题和其他答案中的以下项目:

  1. cardinal/nominal or ordinal: "one" and "first"
  2. 基数/名义或序数:“一”和“第一”

  3. common spelling mistakes: "forty"/"fourty" (Note that it does not EXPLICITLY address this, that would be something you'd want to do before you passed the string to this parser. This parser sees this example as "FOUR"...)
  4. 常见的拼写错误:“四十”/“十四”(请注意,它并不明确地解决这个问题,在将字符串传递给此解析器之前,这是您想要做的事情。此解析器将此示例视为“四个”。 ..)

  5. hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
  6. 数百/数千:2100 - >“二十一”,还有“二千一百”

  7. separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
  8. 分隔符:“十一二五二”,还有“十一五十二”或“十一二十二”等等

  9. colloqialisms: "thirty-something" (This also is not TOTALLY addressed, as what IS "something"? Well, this code finds this number as simply "30").**
  10. colloqialisms:“三十多岁”(这也没有完全解决,因为什么是“某事”?嗯,这段代码发现这个数字只是“30”)。**

Now, rather than store this monster of a regular expression in your source, I was considering building this RegEx at runtime, using something like the following:

现在,我不是在源代码中存储正则表达式的怪物,而是考虑在运行时使用以下内容构建此RegEx:

char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve",
  "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"};
char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"};
char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" };
char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" };
and so on...

The easy part here is we are only storing the words that matter. In the case of SIXTH, you'll notice that there isn't an entry for it, because it's just it's normal number with TH tacked on... But ones like TWELVE need different attention.

这里最简单的部分是我们只存储重要的单词。在SIXTH的情况下,你会注意到它没有一个条目,因为它只是正常的数字,TH加上......但像TWELVE这样的东西需要不同的关注。

Ok, so now we have the code to build our (ugly) RegEx, now we just execute it on our number strings.

好的,现在我们有了构建我们(丑陋)RegEx的代码,现在我们只需要在我们的数字字符串上执行它。

One thing I would recommend, is to filter, or eat the word "AND". It's not necessary, and only leads to other issues.

我建议的一件事是过滤,或吃“和”这个词。这没有必要,只会导致其他问题。

So, what you are going to want to do is setup a function that passes the named matches for "Magnitude" into a function that looks at all the possible magnitude values, and multiplies your current result by that value of magnitude. Then, you create a function that looks at the "Value" named matches, and returns an int (or whatever you are using), based on the value discovered there.

因此,您要做的是设置一个函数,将“Magnitude”的命名匹配传递给一个函数,该函数查看所有可能的幅度值,并将当前结果乘以该值。然后,创建一个查看名为matches的“Value”的函数,并根据在那里发现的值返回一个int(或者你正在使用的任何东西)。

All VALUE matches are ADDED to your result, while magnitutde matches multiply the result by the mag value. So, Two Hundred Fifty Thousand becomes "2", then "2 * 100", then "200 + 50", then "250 * 1000", ending up with 250000...

所有VALUE匹配都添加到您的结果中,而magnitutde匹配则将结果乘以mag值。那么,二十五万变成“2”,然后是“2 * 100”,然后是“200 + 50”,然后是“250 * 1000”,最终变成250000 ......

Just for fun, I wrote a vbScript version of this and it worked great with all the examples provided. Now, it doesn't support named matches, so I had to work a little harder getting the correct result, but I got it. Bottom line is, if it's a "VALUE" match, add it your accumulator. If it's a magnitude match, multiply your accumulator by 100, 1000, 1000000, 1000000000, etc... This will provide you with some pretty amazing results, and all you have to do to adjust for things like "one half" is add them to your RegEx, put in a code marker for them, and handle them.

只是为了好玩,我写了一个vbScript版本,并且它提供了很好的所有示例。现在,它不支持命名匹配,所以我必须更努力地获得正确的结果,但我得到了它。底线是,如果是“VALUE”匹配,则将其添加到累加器中。如果它是一个幅度匹配,将你的累加器乘以100,1000,1000000,1000000000等......这将为你提供一些非常惊人的结果,你需要做的就是调整像“一半”之类的东西来添加它们到您的RegEx,为它们添加代码标记,并处理它们。

Well, I hope this post helps SOMEONE out there. If anyone want, I can post by vbScript pseudo code that I used to test this with, however, it's not pretty code, and NOT production code.

好吧,我希望这篇文章可以帮助SOMEONE。如果有人想要,我可以通过我用来测试它的vbScript伪代码发布,但是,它不是漂亮的代码,而不是生产代码。

If I may.. What is the final language this will be written in? C++, or something like a scripted language? Greg Hewgill's source will go a long way in helping understand how all of this comes together.

如果我可以..这将写入最终语言是什么? C ++,还是像脚本语言? Greg Hewgill的资料来源将有助于理解所有这些如何融合在一起。

Let me know if I can be of any other help. Sorry, I only know English/American, so I can't help you with the other languages.

如果我有任何其他帮助,请告诉我。对不起,我只懂英语/美国,所以我无法帮助你使用其他语言。

#10


0  

I was converting ordinal edition statements from early modern books (e.g. "2nd edition", "Editio quarta") to integers and needed support for ordinals 1-100 in English and ordinals 1-10 in a few Romance languages. Here's what I came up with in Python:

我正在将早期现代书籍(例如“第2版”,“Editio quarta”)中的序数版声明转换为整数,并且需要支持英语中的1-100和第1章1-10中的一些罗马语言。这是我在Python中提出的:

def get_data_mapping():
  data_mapping = {
    "1st": 1,
    "2nd": 2,
    "3rd": 3,

    "tenth": 10,
    "eleventh": 11,
    "twelfth": 12,
    "thirteenth": 13,
    "fourteenth": 14,
    "fifteenth": 15,
    "sixteenth": 16,
    "seventeenth": 17,
    "eighteenth": 18,
    "nineteenth": 19,
    "twentieth": 20,

    "new": 2,
    "newly": 2,
    "nova": 2,
    "nouvelle": 2,
    "altera": 2,
    "andere": 2,

    # latin
    "primus": 1,
    "secunda": 2,
    "tertia": 3,
    "quarta": 4,
    "quinta": 5,
    "sexta": 6,
    "septima": 7,
    "octava": 8,
    "nona": 9,
    "decima": 10,

    # italian
    "primo": 1,
    "secondo": 2,
    "terzo": 3,
    "quarto": 4,
    "quinto": 5,
    "sesto": 6,
    "settimo": 7,
    "ottavo": 8,
    "nono": 9,
    "decimo": 10,

    # french
    "premier": 1,
    "deuxième": 2,
    "troisième": 3,
    "quatrième": 4,
    "cinquième": 5,
    "sixième": 6,
    "septième": 7,
    "huitième": 8,
    "neuvième": 9,
    "dixième": 10,

    # spanish
    "primero": 1,
    "segundo": 2,
    "tercero": 3,
    "cuarto": 4,
    "quinto": 5,
    "sexto": 6,
    "septimo": 7,
    "octavo": 8,
    "noveno": 9,
    "decimo": 10
  }

  # create 4th, 5th, ... 20th
  for i in xrange(16):
    data_mapping[str(4+i) + "th"] = 4+i

  # create 21st, 22nd, ... 99th
  for i in xrange(79):
    last_char = str(i)[-1]

    if last_char == "0":
      data_mapping[str(20+i) + "th"] = 20+i

    elif last_char == "1":
      data_mapping[str(20+i) + "st"] = 20+i

    elif last_char == "2":
      data_mapping[str(20+i) + "nd"] = 20+i

    elif last_char == "3":
      data_mapping[str(20+i) + "rd"] = 20+i

    else:
      data_mapping[str(20+i) + "th"] = 20+i

  ordinals = [
    "first", "second", "third", 
    "fourth", "fifth", "sixth", 
    "seventh", "eighth", "ninth"
  ]

  # create first, second ... ninth
  for c, i in enumerate(ordinals):
    data_mapping[i] = c+1

  # create twenty-first, twenty-second ... ninty-ninth
  for ci, i in enumerate([
    "twenty", "thirty", "forty", 
    "fifty", "sixty", "seventy", 
    "eighty", "ninety"
  ]):
    for cj, j in enumerate(ordinals):
      data_mapping[i + "-" + j] = 20 + (ci*10) + (cj+1)
    data_mapping[i.replace("y", "ieth")] = 20 + (ci*10)

  return data_mapping

#11


-1  

Try

  1. Open an HTTP Request to "http://www.google.com/search?q=" + number + "+in+decimal".

    打开“http://www.google.com/search?q=”+ number +“+ in + decimal”的HTTP请求。

  2. Parse the result for your number.

    解析您的号码的结果。

  3. Cache the number / result pairs to lesson the requests over time.

    缓存数字/结果对以随时间推移请求。

#12


-2  

One place to start looking is the gnu get_date lib, which can parse just about any English textual date into a timestamp. While not exactly what you're looking for, their solution to a similar problem could provide a lot of useful clues.

一个开始寻找的地方是gnu get_date lib,它可以将任何英文文本日期解析为时间戳。虽然不完全是您正在寻找的,但他们对类似问题的解决方案可以提供许多有用的线索。