如何基于php中的关联数组中的重复子串删除/过滤数组元素?

时间:2021-08-20 08:52:46

I want to remove similar title based values eg. if I have Rihanna - Work Ft. Some other words and Rihanna - Work I want to have only one of them. How can I remove duplicates still seach for Rihanna. see below json that contains similar titles:

我想删除类似的基于标题的值,例如。如果我有Rihanna - Work Ft。其他一些词和蕾哈娜 - 工作我想只有其中一个。如何删除Rihanna仍在搜索的重复项。请参阅下面包含类似标题的json:

Means I dont want to have multiple versions of a songs in my array SEE BELOW SAMPLE JSON TO BE FILTERED OUT AS SINGLE VERSION

意味着我不希望在我的阵列中有多个版本的歌曲。请参阅下面的示例JSON将作为单个版本过滤掉

    {
      "videos": [
        {
          "kind": "youtube#playlistItem",
          "etag": "\"gMxXHe-zinKdE9lTnzKu8vjcmDI/134M9maQodDR9PapI2tdE24XHdU\"",
          "id": "UExwWEExSXFCZ2VaUXpYOFh2Y0U0R0RscEFpTjAzczNGNi5EQUE1NTFDRjcwMDg0NEMz",
          "snippet": {
            "publishedAt": "2016-07-03T16:45:08.000Z",
            "channelId": "UCOb0YwX9e9SFbctQaSXkKGQ",
            "title": "Rihanna - Work ft. Drake (Audio)",
           
          },
          "shuffle_id": 88
        },
        {
          "kind": "youtube#playlistItem",
          "etag": "\"gMxXHe-zinKdE9lTnzKu8vjcmDI/Qeo1vUZh73p7gX3EFvVxRGbTxms\"",
          "id": "UExaOW5LbUs1dVVCcnN2Rld6ZDRWcFA0MHZ3NlZhLXZFeS5ENDU4Q0M4RDExNzM1Mjcy",
          "snippet": {
            "publishedAt": "2016-08-31T04:42:26.000Z",
            "channelId": "UC2mUsMtec7AOG9K-4ZlO7gA",
            "title": "Rihanna - Work (Explicit) ft. Drake",
            "description": "",
            "channelTitle": "Dickinson Kenneth",
            "playlistId": "PLZ9nKmK5uUBrsvFWzd4VpP40vw6Va-vEy",
            "position": 17,
          
          },
          "shuffle_id": 219
        }]
	}

2 个解决方案

#1


0  

So, you could define a hash function that returns the same hash for similar song titles; then, you could make songs list unique based on that hash value.

因此,您可以定义一个哈希函数,为相似的歌曲标题返回相同的哈希值;然后,您可以根据该哈希值使歌曲列表唯一。

This is a potential hash function and some demo:

这是一个潜在的哈希函数和一些演示:

$hash1 = hashSongTitle('Rihanna - Work ft. Drake (Audio)');
$hash2 = hashSongTitle('Rihanna - Work (Explicit) ft. Drake');

echo $hash1 . "\n";
echo $hash2 . "\n";

$sameHash = ($hash1 === $hash2);

echo $sameHash ? 'are the same' : 'not not the same';

function hashSongTitle($title)
{
    //get rid of noise words
    $title = str_replace(array('(Explicit)', '(Audio)', '-'), '', $title);

    //collapse consecutive spaces
    $title = preg_replace('#\s{2,}#ims', ' ', $title);

    //get rid of possible white spaces in front or in the back of the string
    $title  = trim($title, "\r\n ");

    return $title;
}

This should echo:

这应该回应:

Rihanna Work ft. Drake
Rihanna Work ft. Drake
are the same

You could see it live here: http://sandbox.onlinephpfunctions.com/code/201b95cdc80f587a0ee377155c5fb6a49475bc89

你可以在这里看到它:http://sandbox.onlinephpfunctions.com/code/201b95cdc80f587a0ee377155c5fb6a49475bc89

Then, you could store the song in an array indexed by that hash value, so they become unique.

然后,您可以将歌曲存储在由该哈希值索引的数组中,以便它们变得唯一。

foreach($songList as $song)
{
    $hash = hashSongTitle($song->title);
    $uniqueSongList[$hash] = $song;
}

#2


0  

You can detect similarity using similar_text function and decide a threshold to tell that two titles (or more) are similar enough to remove one of them (the shortest ?).

您可以使用similar_text函数检测相似性,并确定一个阈值,告诉您两个标题(或更多)相似,足以删除其中一个(最短的?)。

If you need to have more accurate results, which means that you are not only interested in the number of common letters but also in their order, then you are looking for the longest common substring problem, here is an implementation. Here you have to establish a threshold compared to the ratio largestSubstringLength/OriginalStringLength.

如果您需要更准确的结果,这意味着您不仅对常用字母的数量感兴趣,而且还对它们的顺序感兴趣,那么您正在寻找最长的常见子字符串问题,这是一个实现。在这里你必须建立一个阈值比较maximumSubstringLength / OriginalStringLength。

#1


0  

So, you could define a hash function that returns the same hash for similar song titles; then, you could make songs list unique based on that hash value.

因此,您可以定义一个哈希函数,为相似的歌曲标题返回相同的哈希值;然后,您可以根据该哈希值使歌曲列表唯一。

This is a potential hash function and some demo:

这是一个潜在的哈希函数和一些演示:

$hash1 = hashSongTitle('Rihanna - Work ft. Drake (Audio)');
$hash2 = hashSongTitle('Rihanna - Work (Explicit) ft. Drake');

echo $hash1 . "\n";
echo $hash2 . "\n";

$sameHash = ($hash1 === $hash2);

echo $sameHash ? 'are the same' : 'not not the same';

function hashSongTitle($title)
{
    //get rid of noise words
    $title = str_replace(array('(Explicit)', '(Audio)', '-'), '', $title);

    //collapse consecutive spaces
    $title = preg_replace('#\s{2,}#ims', ' ', $title);

    //get rid of possible white spaces in front or in the back of the string
    $title  = trim($title, "\r\n ");

    return $title;
}

This should echo:

这应该回应:

Rihanna Work ft. Drake
Rihanna Work ft. Drake
are the same

You could see it live here: http://sandbox.onlinephpfunctions.com/code/201b95cdc80f587a0ee377155c5fb6a49475bc89

你可以在这里看到它:http://sandbox.onlinephpfunctions.com/code/201b95cdc80f587a0ee377155c5fb6a49475bc89

Then, you could store the song in an array indexed by that hash value, so they become unique.

然后,您可以将歌曲存储在由该哈希值索引的数组中,以便它们变得唯一。

foreach($songList as $song)
{
    $hash = hashSongTitle($song->title);
    $uniqueSongList[$hash] = $song;
}

#2


0  

You can detect similarity using similar_text function and decide a threshold to tell that two titles (or more) are similar enough to remove one of them (the shortest ?).

您可以使用similar_text函数检测相似性,并确定一个阈值,告诉您两个标题(或更多)相似,足以删除其中一个(最短的?)。

If you need to have more accurate results, which means that you are not only interested in the number of common letters but also in their order, then you are looking for the longest common substring problem, here is an implementation. Here you have to establish a threshold compared to the ratio largestSubstringLength/OriginalStringLength.

如果您需要更准确的结果,这意味着您不仅对常用字母的数量感兴趣,而且还对它们的顺序感兴趣,那么您正在寻找最长的常见子字符串问题,这是一个实现。在这里你必须建立一个阈值比较maximumSubstringLength / OriginalStringLength。