为什么这个MySQL双元电话功能不能正常工作?

时间:2022-09-18 12:30:06

I am just learning about the Metaphone and Double Metaphone search algorithms, and I have a few questions. Per the Metaphone Wiki page, I found a couple sources with implementations, a MySQL implementation in particular. I wanted to test it out with a test database of mine so I first imported the metaphone.sql file (containing the double metaphone function) found here

我刚刚学习了Metaphone和Double Metaphone搜索算法,我有几个问题。根据Metaphone Wiki页面,我发现了几个带有实现的源,特别是MySQL实现。我想用我的测试数据库测试它,所以我首先导入了metaphone.sql文件(包含双元电话功能)

Right now, I have a table, country, that has a list of all countries in the 'name' column, e.g. 'Afghanistan', 'Albania', 'Algeria', etc. So, first, I wanted to actually create a new column in the table to store the Double Metaphone string of each country. I ran the following code:

现在,我有一个表格,国家/地区,其中包含“名称”列中所有国家/地区的列表,例如'阿富汗','阿尔巴尼亚','阿尔及利亚'等等。首先,我想在表格中创建一个新列,以存储每个国家的双重Metaphone字符串。我运行了以下代码:

UPDATE country SET NameDM = dm(name)

Everything worked correctly. Afghanistan's metaphone string is 'AFKNSTN', Albania's is 'ALPN', Algeria's is 'ALKR;ALJR', etc. "Awesome," I thought.

一切正常。阿富汗的互联网字符串是'AFKNSTN',阿尔巴尼亚的'ALPN',阿尔及利亚的'ALKR; ALJR'等等,“真棒”,我想。

However, when I tried to query the table, I got no results. Per the author of metaphone.sql, I adhered to the syntax of the following SQL statement:

但是,当我试图查询表时,我没有得到任何结果。根据metaphone.sql的作者,我坚持以下SQL语句的语法:

SELECT Name FROM tblPeople WHERE dm(Name) = dm(@search)

So, I changed this code to the following:

所以,我将此代码更改为以下内容:

SELECT * FROM country WHERE dm(name) = dm(@search)

Of course, I changed "@search" to whatever search term I was looking for, but I got 0 results after each and every SQL query.

当然,我将“@search”更改为我正在寻找的任何搜索词,但在每次SQL查询后我得到0个结果。

Could anyone explain this issue? Am I missing something important, or am I just plain misunderstanding the Metaphone algorithm?

有谁能解释这个问题?我错过了一些重要的东西,还是我只是误解了Metaphone算法?

Thank you!

3 个解决方案

#1


2  

take a close look at the collation/character set/encoding (it can be defined down to the column level). Collation defines how strings are compared, but a character set can imply a certain collation be used. Maybe your literal string has a different character set, causing the string comparison to fail.

仔细查看排序/字符集/编码(可以定义到列级别)。排序规则定义了字符串的比较方式,但字符集可能意味着使用某种排序规则。也许你的文字字符串有不同的字符集,导致字符串比较失败。

even this may be revealing

即使这可能是揭示

select name, length(name), char_length(name), @search, length(@search), char_length(@search) from tbl

.

show variables like 'character%'

.

show create table tbl

#2


3  

When comparing dm() outputs I use the following function to allow a further level of fuzziness. A direct check dm('smith') != dm('schmitt') fails for a significant number of names, including common misspellings of my own.

当比较dm()输出时,我使用以下函数来允许更高级别的模糊性。直接检查dm('smith')!= dm('schmitt')因大量名字而失败,包括我自己的常见拼写错误。

The function creates a match weighting between 0.0 and 1.0 (I hope), which allows me to rank each returned row, and select those of benefit, 0.3 is quite a good value for capturing odd pronunciations, 0.5 is more usual.

该函数创建了一个介于0.0和1.0之间的匹配(我希望),它允许我对每个返回的行进行排名,并选择那些有益的,0.3对于捕获奇怪的发音非常好,0.5更常见。

i.e. dmcompare(dm("boothroyd"), dm("boofreed")) = 0.3
dmcompare(dm("smith"), dm("scmitt")) = 0.5

即dmcompare(dm(“boothroyd”),dm(“boofreed”))= 0.3 dmcompare(dm(“smith”),dm(“scmitt”))= 0.5

Notice that this is a comparison of double metaphone strings and not the original strings, this is for performance issues, my DB contains a column for the metaphone as well as the original string.

请注意,这是双元电话串而不是原始字符串的比较,这是出于性能问题,我的数据库包含metaphone的列以及原始字符串。

    CREATE FUNCTION `dmcompare`(leftValue VARCHAR(55), rightValue VARCHAR(55)) 
        RETURNS DECIMAL(2,1) 
    NO SQL
    BEGIN
    ---------------------------------------------------------------------------------------
    -- Compare two (double) metaphone strings for potential similarlity, i.e.
    --    dm("smith") != dm("schmitt")  :: "SM0;XMT" != "XMT;SMT" 
    --  dmcompare( dm('smith'), dm('schmitt' ) returns 0,5
    -- @author: P.Boothroyd
    -- @version: 0.9, 08/01/2013
    -- The values here can still be played with
    -- (c) GNU P L - feel free to share and adapt, but please acknowledge the original code
    ---------------------------------------------------------------------------------------
        DECLARE leftPri, leftSec, rightPri, rightSec VARCHAR(55) DEFAULT '';
        DECLARE sepPos INT;
        DECLARE retValue DECIMAL(2,1);
        DECLARE partMatch BOOLEAN;

        -- Extract the metaphone tags
        SET sepPos = LOCATE(";", leftValue);
        IF sepPos = 0 THEN
            SET sepPos = LENGTH(leftValue) + 1;
        END IF;
        SET leftPri = LEFT(leftValue, sepPos - 1);
        SET leftSec = MID(leftValue, sepPos + 1, LENGTH( leftValue ) - sepPos);

        SET sepPos = LOCATE(";", rightValue);
        IF sepPos = 0 THEN
            SET sepPos = LENGTH(rightValue) + 1;
        END IF;
        SET rightPri = LEFT(rightValue, sepPos - 1);
        SET rightSec = MID(rightValue, sepPos + 1, LENGTH( rightValue ) - sepPos);

        -- Calculate likeness factor
        SET retValue = 0;
        SET partMatch = FALSE;
        -- Primaries equal 50% match
        IF leftPri = rightPri THEN
            SET retValue = retValue + 0.5;
            SET partMatch = TRUE;
        ELSE
            IF SOUNDEX(leftPri) = SOUNDEX(rightPri) THEN
                SET retValue = retValue + 0.3;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Test alternate primary and secondaries, worth 30% match
        IF leftSec = rightPri THEN
            SET retValue = retValue + 0.3;
            SET partMatch = TRUE;
            IF SOUNDEX(leftSec) = SOUNDEX(rightPri) THEN
                SET retValue = retValue + 0.2;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Test alternate primary and secondaries, worth 30% match
        IF leftPri = rightSec THEN
            SET retValue = retValue + 0.3;
            SET partMatch = TRUE;
            IF SOUNDEX(leftPri) = SOUNDEX(rightSec) THEN
                SET retValue = retValue + 0.2;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Are secondary values the same or both NULL
        IF leftSec = rightSec THEN
            -- No secondaries ...
            IF leftSec = '' THEN
                -- If there is prior matching then no secondaries is 40%
                IF partMatch = TRUE THEN
                    SET retValue = retValue + 0.4;
                END IF;
            ELSE
                -- If the secondaries match then 50% match
                SET retValue = retValue + 0.5;
            END IF;
        ELSE
            IF SOUNDEX(leftSec) = SOUNDEX(rightSec) THEN
                IF leftSec = '' THEN
                    IF partMatch = TRUE THEN
                        SET retValue = retValue + 0.3;
                    END IF;
                END IF;
            END IF; 
        END IF;
        RETURN (retValue);
    END

Please feel free to use th code, but also please mention the sources for this code P.Boothroyd with any usage - i.e. changing values etc.

请随意使用该代码,但请提及此代码P.Boothroyd的任何用法的来源 - 即更改值等。

Cheers, Paul

#3


2  

SELECT * FROM country WHERE NameDM = dm(@search)

Is probably what you want in the end so you aren't computing the DM for every country every time you do a search. What you had looks like it should have worked though. You can trouble shoot by doing:

可能最终你想要的是这样你每次进行搜索时都不会为每个国家计算DM。你看起来应该有什么用。你可以通过这样做来解决问题:

SELECT dm('Albania')

... should get you ALPN. Now what do you get for...

......应该让你ALPN。现在你得到了什么......

SELECT * FROM country WHERE NameDM = 'ALPN'

?

#1


2  

take a close look at the collation/character set/encoding (it can be defined down to the column level). Collation defines how strings are compared, but a character set can imply a certain collation be used. Maybe your literal string has a different character set, causing the string comparison to fail.

仔细查看排序/字符集/编码(可以定义到列级别)。排序规则定义了字符串的比较方式,但字符集可能意味着使用某种排序规则。也许你的文字字符串有不同的字符集,导致字符串比较失败。

even this may be revealing

即使这可能是揭示

select name, length(name), char_length(name), @search, length(@search), char_length(@search) from tbl

.

show variables like 'character%'

.

show create table tbl

#2


3  

When comparing dm() outputs I use the following function to allow a further level of fuzziness. A direct check dm('smith') != dm('schmitt') fails for a significant number of names, including common misspellings of my own.

当比较dm()输出时,我使用以下函数来允许更高级别的模糊性。直接检查dm('smith')!= dm('schmitt')因大量名字而失败,包括我自己的常见拼写错误。

The function creates a match weighting between 0.0 and 1.0 (I hope), which allows me to rank each returned row, and select those of benefit, 0.3 is quite a good value for capturing odd pronunciations, 0.5 is more usual.

该函数创建了一个介于0.0和1.0之间的匹配(我希望),它允许我对每个返回的行进行排名,并选择那些有益的,0.3对于捕获奇怪的发音非常好,0.5更常见。

i.e. dmcompare(dm("boothroyd"), dm("boofreed")) = 0.3
dmcompare(dm("smith"), dm("scmitt")) = 0.5

即dmcompare(dm(“boothroyd”),dm(“boofreed”))= 0.3 dmcompare(dm(“smith”),dm(“scmitt”))= 0.5

Notice that this is a comparison of double metaphone strings and not the original strings, this is for performance issues, my DB contains a column for the metaphone as well as the original string.

请注意,这是双元电话串而不是原始字符串的比较,这是出于性能问题,我的数据库包含metaphone的列以及原始字符串。

    CREATE FUNCTION `dmcompare`(leftValue VARCHAR(55), rightValue VARCHAR(55)) 
        RETURNS DECIMAL(2,1) 
    NO SQL
    BEGIN
    ---------------------------------------------------------------------------------------
    -- Compare two (double) metaphone strings for potential similarlity, i.e.
    --    dm("smith") != dm("schmitt")  :: "SM0;XMT" != "XMT;SMT" 
    --  dmcompare( dm('smith'), dm('schmitt' ) returns 0,5
    -- @author: P.Boothroyd
    -- @version: 0.9, 08/01/2013
    -- The values here can still be played with
    -- (c) GNU P L - feel free to share and adapt, but please acknowledge the original code
    ---------------------------------------------------------------------------------------
        DECLARE leftPri, leftSec, rightPri, rightSec VARCHAR(55) DEFAULT '';
        DECLARE sepPos INT;
        DECLARE retValue DECIMAL(2,1);
        DECLARE partMatch BOOLEAN;

        -- Extract the metaphone tags
        SET sepPos = LOCATE(";", leftValue);
        IF sepPos = 0 THEN
            SET sepPos = LENGTH(leftValue) + 1;
        END IF;
        SET leftPri = LEFT(leftValue, sepPos - 1);
        SET leftSec = MID(leftValue, sepPos + 1, LENGTH( leftValue ) - sepPos);

        SET sepPos = LOCATE(";", rightValue);
        IF sepPos = 0 THEN
            SET sepPos = LENGTH(rightValue) + 1;
        END IF;
        SET rightPri = LEFT(rightValue, sepPos - 1);
        SET rightSec = MID(rightValue, sepPos + 1, LENGTH( rightValue ) - sepPos);

        -- Calculate likeness factor
        SET retValue = 0;
        SET partMatch = FALSE;
        -- Primaries equal 50% match
        IF leftPri = rightPri THEN
            SET retValue = retValue + 0.5;
            SET partMatch = TRUE;
        ELSE
            IF SOUNDEX(leftPri) = SOUNDEX(rightPri) THEN
                SET retValue = retValue + 0.3;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Test alternate primary and secondaries, worth 30% match
        IF leftSec = rightPri THEN
            SET retValue = retValue + 0.3;
            SET partMatch = TRUE;
            IF SOUNDEX(leftSec) = SOUNDEX(rightPri) THEN
                SET retValue = retValue + 0.2;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Test alternate primary and secondaries, worth 30% match
        IF leftPri = rightSec THEN
            SET retValue = retValue + 0.3;
            SET partMatch = TRUE;
            IF SOUNDEX(leftPri) = SOUNDEX(rightSec) THEN
                SET retValue = retValue + 0.2;
                SET partMatch = TRUE;
            END IF;
        END IF;
        -- Are secondary values the same or both NULL
        IF leftSec = rightSec THEN
            -- No secondaries ...
            IF leftSec = '' THEN
                -- If there is prior matching then no secondaries is 40%
                IF partMatch = TRUE THEN
                    SET retValue = retValue + 0.4;
                END IF;
            ELSE
                -- If the secondaries match then 50% match
                SET retValue = retValue + 0.5;
            END IF;
        ELSE
            IF SOUNDEX(leftSec) = SOUNDEX(rightSec) THEN
                IF leftSec = '' THEN
                    IF partMatch = TRUE THEN
                        SET retValue = retValue + 0.3;
                    END IF;
                END IF;
            END IF; 
        END IF;
        RETURN (retValue);
    END

Please feel free to use th code, but also please mention the sources for this code P.Boothroyd with any usage - i.e. changing values etc.

请随意使用该代码,但请提及此代码P.Boothroyd的任何用法的来源 - 即更改值等。

Cheers, Paul

#3


2  

SELECT * FROM country WHERE NameDM = dm(@search)

Is probably what you want in the end so you aren't computing the DM for every country every time you do a search. What you had looks like it should have worked though. You can trouble shoot by doing:

可能最终你想要的是这样你每次进行搜索时都不会为每个国家计算DM。你看起来应该有什么用。你可以通过这样做来解决问题:

SELECT dm('Albania')

... should get you ALPN. Now what do you get for...

......应该让你ALPN。现在你得到了什么......

SELECT * FROM country WHERE NameDM = 'ALPN'

?