根据人名匹配记录

Are there any tools or methods that can be used for matching by a person's name between two different data sources?

是否有任何工具或方法可用于在两个不同数据源之间通过人名进行匹配?

The systems have no other common information and the names have been entered differently in many cases.

系统没有其他常见信息,并且在许多情况下输入的名称也不同。

Examples of non-exact matches:

非完全匹配的示例:

King Jr., Martin Luther = King, Martin (exclude suffix)
Erving, Dr. J. = Erving, J. (exclude prefix)
Obama, Barak Hussein = Obama, Barak (exclude middle name)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei (match abbreviations)
Tankengine, Thomas = Tankengine, Tom (match common nicknames)
Flair, Rick "the Natureboy" = Flair, Natureboy (match on nickname)

小王,马丁路德=国王,马丁(不包括后缀)欧文,J.博士=欧文,J。(不包括前缀)奥巴马,巴拉克侯赛因=奥巴马,巴拉克(不包括中间名)Pufnstuf,HR = Pufnstuf,Haibane Renmei (匹配缩写)Tankengine,Thomas = Tankengine,Tom(匹配常见的昵称)Flair,Rick“the Natureboy”= Flair,Natureboy(匹配昵称)

5 个解决方案

#1

I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.

我不得不使用各种建议的技术。谢谢指出我正确的方向。希望以下内容可以帮助其他人解决这类问题。

Removing excess characters

删除多余的字符

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

Usage:

--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')

Split name into parts

将名称拆分为部分

CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE 
    (
        [ID] VARCHAR(MAX),
        [Val] VARCHAR(MAX)
    )
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

    INSERT INTO @ResultList (ID, Val)   
    select @ID, split.s from dbo.Split(@sep, @Val) as split 
           where len(split.s) > 0

    fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor 

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
    WITH Pieces(pn, start, stop) AS (
      SELECT 1, 1, CHARINDEX(@sep, @s)
      UNION ALL
      SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
      FROM Pieces
      WHERE stop > 0
    )
    SELECT pn,
      LTRIM(RTRIM(SUBSTRING(@s, start, 
             CASE WHEN stop > 0 
                  THEN stop-start 
                  ELSE 8000 
             END))) AS s
    FROM Pieces
  )

RETURN

Usage:

--create split name list
DECLARE @NameList StringList 

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

Common nicknames:

I created a table based on this list and used it to join on common name equivalents.

我基于此列表创建了一个表,并使用它来加入通用名称等价物。

Usage:

SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u 
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
    SELECT c.id, c.LastName, c.FirstName, 
         c_nickname_maybe.Name AS MaybeFirstName
    FROM dbo.Contact c
    INNER JOIN nickname c_nickname_match
    ON c.FirstName = c_nickname_match.Name
    INNER JOIN nickname c_nickname_maybe
    ON c_nickname_match.relatedid = c_nickname_maybe.id
    WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID 
       AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

Phonetic algorithms (Jaro Winkler):

语音算法(Jaro Winkler):

The amazing article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms. I ended up mostly using Jaro Winkler to match the names.

令人惊叹的文章Beyond SoundEx - MS SQL Server中的模糊搜索功能,展示了如何在SQL Server中安装和使用SimMetrics库。该库允许您查找字符串之间的相对相似性,并包含许多算法。我最终大多使用Jaro Winkler来匹配这些名字。

Usage:

SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName 
,u.LastName
,c.LastName
,maxResult.CombinedScores
 from
(
    SELECT
      u.ID
    , 
        max(
            dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
            * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
        ) AS CombinedScores
    FROM dbo.[User] u, dbo.[Contact] c
    WHERE u.ContactID IS NULL
    GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id  = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores = 
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))

#2

It's a very complex problem - and there are a lot of expensive tools to do it correctly.
If you ever wondered why you can't check in on a flight as Tom, Dick or Harry (or Bill)
Or why no-fly lists and terrorists watch lists don't work -consider:

这是一个非常复杂的问题 - 并且有很多昂贵的工具可以正确地完成它。如果你想知道为什么你不能像Tom,Dick或Harry(或Bill)那样登机,或者为什么禁飞名单和*观察名单不起作用 - 考虑:

(1) Muammar Qaddafi
(2) Mo'ammar Gadhafi
(3) Muammar Kaddafi
(4) Muammar Qadhafi
(5) Moammar El Kadhafi
(6) Muammar Gadafi
(7) Mu'ammar al-Qadafi
(8) Moamer El Kazzafi
(9) Moamar al-Gaddafi
(10) Mu'ammar Al Qathafi
(11) Muammar Al Qathafi
(12) Mo'ammar el-Gadhafi
(13) Moamar El Kadhafi
(14) Muammar al-Qadhafi
(15) Mu'ammar al-Qadhdhafi
(16) Mu'ammar Qadafi
(17) Moamar Gaddafi
(18) Mu'ammar Qadhdhafi
(19) Muammar Khaddafi
(20) Muammar al-Khaddafi
(21) Mu'amar al-Kadafi
(22) Muammar Ghaddafy
(23) Muammar Ghadafi
(24) Muammar Ghaddafi
(25) Muamar Kaddafi
(26) Muammar Quathafi
(27) Muammar Gheddafi
(28) Muamar Al-Kaddafi
(29) Moammar Khadafy
(30) Moammar Qudhafi
(31) Mu'ammar al-Qaddafi
(32) Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi

(1)Muammar Qaddafi(2)Mo'ammar Gadhafi(3)Muammar Kaddafi(4)Muammar Qadhafi(5)Moammar El Kadhafi(6)Muammar Gadafi(7)Mu'ammar al-Qadafi(8)Moamer El Kazzafi(9) )Moamar al-Gaddafi(10)Mu'ammar Al Qathafi(11)Muammar Al Qathafi(12)Mo'ammar el-Gadhafi(13)Moamar El Kadhafi(14)Muammar al-Qadhafi(15)Mu'ammar al-Qadhdhafi (16)Mu'ammar Qadafi(17)Moamar Gaddafi(18)Mu'ammar Qadhdhafi(19)Muammar Khaddafi(20)Muammar al-Khaddafi(21)Mu'amar al-Kadafi(22)Muammar Ghaddafy(23)Muammar Ghadafi (24)Muammar Ghaddafi(25)Muamar Kaddafi(26)Muammar Quathafi(27)Muammar Gheddafi(28)Muamar Al-Kaddafi(29)Moammar Khadafy(30)Moammar Qudhafi(31)Mu'ammar al-Qaddafi(32)Mulazim Awwal Mu'ammar Muhammad Abu Minyar al-Qadhafi

And that's just official spellings - it doesn't include typos!

这只是官方拼写 - 它不包括拼写错误!

#3

I often employ soundex-type algorithms for this type of situation. Try the Double Metaphone algorithm. If you are using SQL Server, there is some source code to create a user defined function.

我经常在这种情况下使用soundex类型的算法。尝试Double Metaphone算法。如果您使用的是SQL Server,则可以使用一些源代码来创建用户定义的函数。

Because you have transposed data, you may want to normalize it a bit, e.g., remove all commas and the sort resulting words by first letter. That will give you some better matching potential. In the case where words have been added in the middle, it gets a bit tougher. You could consider breaking a name into words, checking with Double Metaphone whether there is a word in the other column that matches, and then collect the overall count of matches vs. words, which will tell you how close the two columns are.

因为您有转置数据,您可能需要对其进行一些标准化,例如,删除所有逗号以及按首字母排序的结果。这将为您提供更好的匹配潜力。如果在中间添加了单词,则会变得更加困难。您可以考虑将名称分为单词,使用Double Metaphone检查其他列中是否有匹配的单词,然后收集匹配与单词的总数,这将告诉您两列的接近程度。

I would also filter out common words like Dr., Mr., Ms., Mrs., etc., before doing the comparisons.

在进行比较之前,我还会过滤掉博士,先生,女士,夫人等常用词。

#4

Here are some options:

以下是一些选项:

Phonetic algorithms...

Soundex (http://en.wikipedia.org/wiki/Soundex)

Double Metaphone (http://en.wikipedia.org/wiki/Double_Metaphone)

Double Metaphone(http://en.wikipedia.org/wiki/Double_Metaphone)

Edit Distance (http://en.wikipedia.org/wiki/Levenshtein_distance)

编辑距离(http://en.wikipedia.org/wiki/Levenshtein_distance)

Jaro-Winkler Distance (http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

Jaro-Winkler距离(http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

Another thing you could try would be to compare each word (splitting on space and maybe hyphen) with each word in the other name and see how many words match up. Maybe combine this with phonetic algorithms for more fuzzy matching. For a huge data set, you would want to index each word and match it with a name id. For abbreviation matching you could compare just the first letter. You probably want to ignore anything but letters when you compare words as well.

你可以尝试的另一件事是将每个单词(在空格上分裂,也可能是连字符)与另一个名字中的每个单词进行比较,看看有多少单词匹配。也许将它与语音算法结合起来进行更多的模糊匹配。对于庞大的数据集,您可能希望索引每个单词并将其与名称ID匹配。对于缩写匹配,您可以只比较第一个字母。在比较单词时,您可能希望忽略除字母之外的任何内容。

Many of the phonetic algorithms have open source / samples online.

许多语音算法都有在线开源/样本。

#5

Metaphone 3 is the third generation of the Metaphone algorithm. It increases the accuracy of phonetic encoding from the 89% of Double Metaphone to 98%, as tested against a database of the most common English words, and names and non-English words familiar in North America. This produces an extremely reliable phonetic encoding for American pronunciations.

Metaphone 3是Metaphone算法的第三代。它将语音编码的准确性从双重Metaphone的89%提高到98%,这是根据最常见英语单词的数据库以及北美熟悉的名字和非英语单词进行测试的。这为美国发音产生了极其可靠的语音编码。

Metaphone 3 was designed and developed by Lawrence Philips, who designed and developed the original Metaphone and Double Metaphone algorithms.

Metaphone 3由Lawrence Philips设计和开发,他设计并开发了原始的Metaphone和Double Metaphone算法。

#1

I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.

我不得不使用各种建议的技术。谢谢指出我正确的方向。希望以下内容可以帮助其他人解决这类问题。

Removing excess characters

删除多余的字符

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

Usage:

--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')

Split name into parts

将名称拆分为部分

CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE 
    (
        [ID] VARCHAR(MAX),
        [Val] VARCHAR(MAX)
    )
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

    INSERT INTO @ResultList (ID, Val)   
    select @ID, split.s from dbo.Split(@sep, @Val) as split 
           where len(split.s) > 0

    fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor 

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
    WITH Pieces(pn, start, stop) AS (
      SELECT 1, 1, CHARINDEX(@sep, @s)
      UNION ALL
      SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
      FROM Pieces
      WHERE stop > 0
    )
    SELECT pn,
      LTRIM(RTRIM(SUBSTRING(@s, start, 
             CASE WHEN stop > 0 
                  THEN stop-start 
                  ELSE 8000 
             END))) AS s
    FROM Pieces
  )

RETURN

Usage:

--create split name list
DECLARE @NameList StringList 

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

Common nicknames:

I created a table based on this list and used it to join on common name equivalents.

我基于此列表创建了一个表,并使用它来加入通用名称等价物。

Usage:

SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u 
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
    SELECT c.id, c.LastName, c.FirstName, 
         c_nickname_maybe.Name AS MaybeFirstName
    FROM dbo.Contact c
    INNER JOIN nickname c_nickname_match
    ON c.FirstName = c_nickname_match.Name
    INNER JOIN nickname c_nickname_maybe
    ON c_nickname_match.relatedid = c_nickname_maybe.id
    WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID 
       AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

Phonetic algorithms (Jaro Winkler):

语音算法(Jaro Winkler):

Usage:

SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName 
,u.LastName
,c.LastName
,maxResult.CombinedScores
 from
(
    SELECT
      u.ID
    , 
        max(
            dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
            * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
        ) AS CombinedScores
    FROM dbo.[User] u, dbo.[Contact] c
    WHERE u.ContactID IS NULL
    GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id  = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores = 
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))

#2

And that's just official spellings - it doesn't include typos!

这只是官方拼写 - 它不包括拼写错误!

#3

I often employ soundex-type algorithms for this type of situation. Try the Double Metaphone algorithm. If you are using SQL Server, there is some source code to create a user defined function.

我经常在这种情况下使用soundex类型的算法。尝试Double Metaphone算法。如果您使用的是SQL Server,则可以使用一些源代码来创建用户定义的函数。

I would also filter out common words like Dr., Mr., Ms., Mrs., etc., before doing the comparisons.

在进行比较之前,我还会过滤掉博士,先生,女士,夫人等常用词。

#4

Here are some options:

以下是一些选项:

Phonetic algorithms...

Soundex (http://en.wikipedia.org/wiki/Soundex)

Double Metaphone (http://en.wikipedia.org/wiki/Double_Metaphone)

Double Metaphone(http://en.wikipedia.org/wiki/Double_Metaphone)

Edit Distance (http://en.wikipedia.org/wiki/Levenshtein_distance)

编辑距离(http://en.wikipedia.org/wiki/Levenshtein_distance)

Jaro-Winkler Distance (http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

Jaro-Winkler距离(http://en.wikipedia.org/wiki/Jaro-Winkler_distance)

Many of the phonetic algorithms have open source / samples online.

许多语音算法都有在线开源/样本。

#5

Metaphone 3 is the third generation of the Metaphone algorithm. It increases the accuracy of phonetic encoding from the 89% of Double Metaphone to 98%, as tested against a database of the most common English words, and names and non-English words familiar in North America. This produces an extremely reliable phonetic encoding for American pronunciations.

Metaphone 3是Metaphone算法的第三代。它将语音编码的准确性从双重Metaphone的89%提高到98%,这是根据最常见英语单词的数据库以及北美熟悉的名字和非英语单词进行测试的。这为美国发音产生了极其可靠的语音编码。

Metaphone 3 was designed and developed by Lawrence Philips, who designed and developed the original Metaphone and Double Metaphone algorithms.

Metaphone 3由Lawrence Philips设计和开发,他设计并开发了原始的Metaphone和Double Metaphone算法。

秒客网

根据人名匹配记录

5 个解决方案

#1

#2

#3

#4

#5

#1

#2

#3

#4

#5

相关文章