如何检查一个值是否已经存在以避免重复?

时间:2021-07-09 01:40:18

I've got a table of URLs and I don't want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

我有一个url表,我不想要任何重复的url。如何使用PHP/MySQL检查给定的URL是否已经在表中?

17 个解决方案

#1


39  

If you don't want to have duplicates you can do following:

如果你不想要复制,你可以做以下:

If multiple users can insert data to DB, method suggested by @Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.

如果多个用户可以向DB插入数据,@Jeremy Ruten建议的方法可能会导致一个错误:在您执行检查之后,某人可以向该表插入类似的数据。

#2


23  

To answer your initial question, the easiest way to check whether there is a duplicate is to run an SQL query against what you're trying to add!

要回答您的第一个问题,检查是否存在重复的最简单的方法是针对要添加的内容运行SQL查询!

For example, were you to want to check for the url http://www.example.com/ in the table links, then your query would look something like

例如,如果您想要检查表链接中的url http://www.example.com/,那么您的查询应该类似于

SELECT * FROM links WHERE url = 'http://www.example.com/';

Your PHP code would look something like

你的PHP代码应该是这样的。

$conn = mysql_connect('localhost', 'username', 'password');
if (!$conn)
{
    die('Could not connect to database');
}
if(!mysql_select_db('mydb', $conn))
{
    die('Could not select database mydb');
}

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    die('This URL already exists in the database');
}

I've written this out longhand here, with all the connecting to the database, etc. It's likely that you'll already have a connection to a database, so you should use that rather than starting a new connection (replace $conn in the mysql_query command and remove the stuff to do with mysql_connect and mysql_select_db)

我写这个了手写,连接到数据库,等等。很可能你已经有一个连接到一个数据库,那么您应该使用,而不是开始一个新的连接(取代美元康涅狄格州mysql_query命令和删除的东西与mysql_connect和mysql_select_db)

Of course, there are other ways of connecting to the database, like PDO, or using an ORM, or similar, so if you're already using those, this answer may not be relevant (and it's probably a bit beyond the scope to give answers related to this here!)

当然,还有其他连接到数据库的方法,如PDO、ORM或类似的方法,所以如果您已经在使用这些方法,那么这个答案可能并不相关(在这里给出与此相关的答案可能有点超出范围!)

However, MySQL provides many ways to prevent this from happening in the first place.

但是,MySQL首先提供了许多防止这种情况发生的方法。

Firstly, you can mark a field as "unique".

首先,您可以将字段标记为“unique”。

Lets say I have a table where I want to just store all the URLs that are linked to from my site, and the last time they were visited.

假设我有一个表,我想存储从我的站点链接到的所有url,以及上次访问它们的时候。

My definition might look something like this:-

我的定义可能是这样的:-。

CREATE TABLE links
(
    url VARCHAR(255) NOT NULL,
    last_visited TIMESTAMP
)

This would allow me to add the same URL over and over again, unless I wrote some PHP code similar to the above to stop this happening.

这将允许我一次又一次地添加相同的URL,除非我编写了一些类似于上述的PHP代码来阻止这种情况的发生。

However, were my definition to change to

然而,我的定义是

CREATE TABLE links
(
  url VARCHAR(255)  NOT NULL,
  last_visited TIMESTAMP,
  PRIMARY KEY (url)
)

Then this would make mysql throw an error when I tried to insert the same value twice.

当我尝试插入相同的值两次时,这会导致mysql抛出错误。

An example in PHP would be

PHP中的一个例子是

$result = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result)
{
    die('Could not Insert Row 1');
}

$result2 = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result2)
{
    die('Could not Insert Row 2');
}

If you ran this, you'd find that on the first attempt, the script would die with the comment Could not Insert Row 2. However, on subsequent runs, it'd die with Could not Insert Row 1.

如果您运行这个,您会发现在第一次尝试时,脚本会死,因为注释无法插入第2行。然而,在后续的运行中,它将无法插入第一行。

This is because MySQL knows that the url is the Primary Key of the table. A Primary key is a unique identifier for that row. Most of the time, it's useful to set the unique identifier for a row to be a number. This is because MySQL is quicker at looking up numbers than it is looking up text. Within MySQL, keys (and espescially Primary Keys) are used to define relationships between two tables. For example, if we had a table for users, we could define it as

这是因为MySQL知道url是表的主键。主键是该行的唯一标识符。大多数情况下,将一行的唯一标识符设置为数字是有用的。这是因为MySQL查找数字的速度比查找文本的速度快。在MySQL中,键(和espescially主键)用于定义两个表之间的关系。例如,如果我们有一个用户表,我们可以将它定义为

CREATE TABLE users (
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40) NOT NULL,
  PRIMARY KEY (username)
)

However, when we wanted to store information about a post the user had made, we'd have to store the username with that post to identify that the post belonged to that user.

但是,当我们想要存储用户发布的文章的信息时,我们必须将用户名存储在该文章中,以标识该文章属于该用户。

I've already mentioned that MySQL is faster at looking up numbers than strings, so this would mean we'd be spending time looking up strings when we didn't have to.

我已经提到过MySQL在查找数字时比字符串更快,这意味着我们在不必要的时候会花时间查找字符串。

To solve this, we can add an extra column, user_id, and make that the primary key (so when looking up the user record based on a post, we can find it quicker)

要解决这个问题,我们可以添加一个额外的列user_id,并将其作为主键(因此,当基于post查找用户记录时,我们可以更快地找到它)

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (`user_id`)
)

You'll notice that I've also added something new here - AUTO_INCREMENT. This basically allows us to let that field look after itself. Each time a new row is inserted, it adds 1 to the previous number, and stores that, so we don't have to worry about numbering, and can just let it do this itself.

您将注意到,我还在这里添加了一些新内容——AUTO_INCREMENT。这基本上让我们可以让这个领域自我管理。每次插入新行时,它都会向前面的数字加1,然后存储,这样我们就不用担心编号了,可以让它自己来做。

So, with the above table, we can do something like

通过上面的表格,我们可以做一些类似的事情

INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');

and then

然后

INSERT INTO users (username, password) VALUES('User', '988881adc9fc3655077dc2d4d757d480b5ea0e11');

When we select the records from the database, we get the following:-

当我们从数据库中选择记录时,我们得到以下信息:-。

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
+---------+----------+------------------------------------------+
2 rows in set (0.00 sec)

However, here - we have a problem - we can still add another user with the same username! Obviously, this is something we don't want to do!

但是,这里有一个问题——我们仍然可以添加一个用户名相同的用户!显然,这是我们不想做的事情!

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
|       3 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
+---------+----------+------------------------------------------+
3 rows in set (0.00 sec)

Lets change our table definition!

让我们更改表定义!

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (user_id),
  UNIQUE KEY (username)
)

Lets see what happens when we now try and insert the same user twice.

让我们看看当我们现在尝试插入相同的用户两次时会发生什么。

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
ERROR 1062 (23000): Duplicate entry 'Mez' for key 'username'

Huzzah!! We now get an error when we try and insert the username for the second time. Using something like the above, we can detect this in PHP.

万岁! !当我们第二次尝试插入用户名时,我们会得到一个错误。使用上面的一些东西,我们可以在PHP中检测到这一点。

Now, lets go back to our links table, but with a new definition.

现在,让我们回到我们的链接表,但是有一个新的定义。

CREATE TABLE links
(
    link_id INT(10)  NOT NULL AUTO_INCREMENT,
    url VARCHAR(255)  NOT NULL,
    last_visited TIMESTAMP,
    PRIMARY KEY (link_id),
    UNIQUE KEY (url)
)

and let's insert "http://www.example.com" into the database.

让我们将“http://www.example.com”插入到数据库中。

INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());

If we try and insert it again....

如果我们试着插入一遍....

ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'

But what happens if we want to update the time it was last visited?

但是如果我们想要更新上次访问的时间,会发生什么呢?

Well, we could do something complex with PHP, like so:-

我们可以用PHP做一些复杂的事情,比如:-

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $result = mysql_query("UPDATE links SET last_visited = NOW() WHERE url = 'http://www.example.com/'", $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

Or, even grab the id of the row in the database and use that to update it.

或者,甚至获取数据库中行的id并使用它更新它。

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

$result = mysql_query(“从url = 'http://www.example.com/'的链接中选择*”,$conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $row = mysql_fetch_assoc($result);

    $result = mysql_query('UPDATE links SET last_visited = NOW() WHERE link_id = ' . intval($row['link_id'], $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

But, MySQL has a nice built in feature called REPLACE INTO

但是,MySQL有一个很好的内置特性,叫做REPLACE INTO

Let's see how it works.

让我们看看它是如何工作的。

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       1 | http://www.example.com/ | 2011-08-19 23:48:03 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'
mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       2 | http://www.example.com/ | 2011-08-19 23:55:55 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

Notice that when using REPLACE INTO, it's updated the last_visited time, and not thrown an error!

注意,当使用REPLACE INTO时,它将更新last_visited时间,不会抛出错误!

This is because MySQL detects that you're attempting to replace a row. It knows the row that you want, as you've set url to be unique. MySQL figures out the row to replace by using the bit that you passed in that should be unique (in this case, the url) and updating for that row the other values. It's also updated the link_id - which is a bit unexpected! (In fact, I didn't realise this would happen until I just saw it happen!)

这是因为MySQL检测到您正在尝试替换一行。它知道您想要的行,因为您将url设置为惟一的。MySQL通过使用您传入的应该是唯一的位(在本例中是url)来确定要替换的行,并为该行更新其他值。它还更新了link_id——有点出乎意料!(事实上,我直到亲眼目睹才意识到这一点!)

But what if you wanted to add a new URL? Well, REPLACE INTO will happily insert a new row if it can't find a matching unique row!

但是如果你想添加一个新的URL呢?如果找不到匹配的唯一行,REPLACE INTO会很高兴地插入一个新行!

mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.*.com/', NOW());
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------------+---------------------+
| link_id | url                           | last_visited        |
+---------+-------------------------------+---------------------+
|       2 | http://www.example.com/       | 2011-08-20 00:00:07 |
|       3 | http://www.*.com/ | 2011-08-20 00:01:22 |
+---------+-------------------------------+---------------------+
2 rows in set (0.00 sec)

I hope this answers your question, and gives you a bit more information about how MySQL works!

我希望这能回答您的问题,并提供更多关于MySQL如何工作的信息!

#3


14  

Are you concerned purely about URLs that are the exact same string .. if so there is a lot of good advice in other answers. Or do you also have to worry about canonization?

你是否只关心相同字符串的url。如果是这样的话,在其他的答案中有很多好的建议。或者你还需要担心会被封为圣徒吗?

For example: http://google.com and http://go%4fgle.com are the exact same URL, but would be allowed as duplicates by any of the database only techniques. If this is an issue you should preprocess the URLs to resolve and character escape sequences.

例如:http://google.com和http://go%4fgle.com是完全相同的URL,但是任何数据库技术都允许它们作为副本。如果这是一个问题,您应该预先处理要解析的url和字符转义序列。

Depending where the URLs are coming from you will also have to worry about parameters and whether they are significant in your application.

根据url的来源,您还需要考虑参数,以及它们是否在应用程序中具有重要意义。

#4


14  

First, prepare the database.

首先,准备数据库。

  • Domain names aren't case-sensitive, but you have to assume the rest of a URL is. (Not all web servers respect case in URLs, but most do, and you can't easily tell by looking.)
  • 域名不区分大小写,但您必须假设其他的URL是。(并不是所有的web服务器都尊重url中的情况,但大多数都是如此,通过查找是很难判断的。)
  • Assuming you need to store more than a domain name, use a case-sensitive collation.
  • 假设您需要存储多个域名,请使用区分大小写的排序。
  • If you decide to store the URL in two columns--one for the domain name and one for the resource locator--consider using a case-insensitive collation for the domain name, and a case-sensitive collation for the resource locator. If I were you, I'd test both ways (URL in one column vs. URL in two columns).
  • 如果您决定将URL存储在两列中——一列用于域名,一列用于资源定位器——请考虑对域名使用不区分大小写的排序,对资源定位器使用区分大小写的排序。如果我是您,我将测试这两种方法(在一个列中的URL和两个列中的URL)。
  • Put a UNIQUE constraint on the URL column. Or on the pair of columns, if you store the domain name and resource locator in separate columns, as UNIQUE (url, resource_locator).
  • 在URL列上设置唯一的约束。或者在这对列上,如果您将域名和资源定位符存储在单独的列中,作为惟一的(url, resource_locator)。
  • Use a CHECK() constraint to keep encoded URLs out of the database. This CHECK() constraint is essential to keep bad data from coming in through a bulk copy or through the SQL shell.
  • 使用CHECK()约束将已编码的url保存在数据库之外。这个CHECK()约束对于防止坏数据通过批量复制或SQL shell进入非常重要。

Second, prepare the URL.

第二,准备URL。

  • Domain names aren't case-sensitive. If you store the full URL in one column, lowercase the domain name on all URLs. But be aware that some languages have uppercase letters that have no lowercase equivalent.
  • 域名不区分大小写的。如果将完整的URL存储在一列中,则在所有URL上以小写字母表示域名。但是要注意,有些语言的大写字母没有小写的对等字母。
  • Think about trimming trailing characters. For example, these two URLs from amazon.com point to the same product. You probably want to store the second version, not the first.

    考虑修剪后的字符。例如,这两个来自amazon.com的url指向同一个产品。您可能希望存储第二个版本,而不是第一个版本。

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X

  • Decode encoded URLs. (See php's urldecode() function. Note carefully its shortcomings, as described in that page's comments.) Personally, I'd rather handle these kinds of transformations in the database rather than in client code. That would involve revoking permissions on the tables and views, and allowing inserts and updates only through stored procedures; the stored procedures handle all the string operations that put the URL into a canonical form. But keep an eye on performance when you try that. CHECK() constraints (see above) are your safety net.

    解码编码的url。(请参阅php的urldecode()函数。请仔细注意它的缺点,如该页的注释所述。就我个人而言,我宁愿在数据库中而不是在客户端代码中处理这些类型的转换。这将涉及取消对表和视图的权限,并允许仅通过存储过程进行插入和更新;存储过程处理将URL放入规范格式的所有字符串操作。但是当你尝试的时候,要注意你的表现。CHECK()约束是您的安全网。

Third, if you're inserting only the URL, don't test for its existence first. Instead, try to insert and trap the error that you'll get if the value already exists. Testing and inserting hits the database twice for every new URL. Insert-and-trap just hits the database once. Note carefully that insert-and-trap isn't the same thing as insert-and-ignore-errors. Only one particular error means you violated the unique constraint; other errors mean there are other problems.

第三,如果只插入URL,不要先测试URL是否存在。相反,尝试插入并捕获如果值已经存在的错误。对每个新URL进行两次测试和插入。插入-陷阱只攻击数据库一次。注意插入-陷阱和插入-忽略错误不是一回事。只有一个特定的错误意味着你违反了唯一的约束;其他错误意味着还有其他问题。

On the other hand, if you're inserting the URL along with some other data in the same row, you need to decide ahead of time whether you'll handle duplicate urls by

另一方面,如果在同一行中插入URL和其他数据,则需要提前决定是否要处理重复的URL

  • deleting the old row and inserting a new one (See MySQL's REPLACE extension to SQL)
  • 删除旧行并插入新行(参见MySQL的SQL替换扩展)
  • updating existing values (See ON DUPLICATE KEY UPDATE)
  • 更新现有值(参见重复键更新)
  • ignoring the issue
  • 忽略这个问题
  • requiring the user to take further action
  • 要求用户采取进一步行动

REPLACE eliminates the need to trap duplicate key errors, but it might have unfortunate side effects if there are foreign key references.

替换消除了捕获重复键错误的需要,但是如果有外键引用,它可能会有不幸的副作用。

#5


13  

To guarantee uniqueness you need to add a unique constraint. Assuming your table name is "urls" and the column name is "url", you can add the unique constraint with this alter table command:

为了保证唯一性,您需要添加一个惟一的约束。假设您的表名是“urls”,列名是“url”,您可以使用alter table命令添加唯一的约束:

alter table urls add constraint unique_url unique (url);

The alter table will probably fail (who really knows with MySQL) if you've already got duplicate urls in your table already.

如果您的表中已经有重复的url,那么alter table可能会失败(谁知道MySQL会发生什么)。

#6


6  

The simple SQL solutions require a unique field; the logic solutions do not.

简单的SQL解决方案需要一个惟一的字段;逻辑解决方案没有。

You should normalize your urls to ensure there is no duplication. Functions in PHP such as strtolower() and urldecode() or rawurldecode().

您应该使url规范化,以确保没有重复。PHP中的函数,如strtolower()和urldecode()或rawurldecode()。

Assumptions: Your table name is 'websites', the column name for your url is 'url', and the arbitrary data to be associated with the url is in the column 'data'.

假设:您的表名是“website”,url的列名是“url”,与url相关联的任意数据在列“data”中。

Logic Solutions

逻辑解决方案

SELECT COUNT(*) AS UrlResults FROM websites WHERE url='http://www.domain.com'

Test the previous query with if statements in SQL or PHP to ensure that it is 0 before you continue with an INSERT statement.

使用SQL或PHP中的if语句测试前一个查询,以确保在继续使用INSERT语句之前它是0。

Simple SQL Statements

简单的SQL语句

Scenario 1: Your db is a first come first serve table and you have no desire to have duplicate entries in the future.

场景1:您的db是先到先得的表,您不希望将来有重复的条目。

ALTER TABLE websites ADD UNIQUE (url)

This will prevent any entries from being able to be entered in to the database if the url value already exists in that column.

如果在该列中已经存在url值,这将阻止任何条目能够输入到数据库中。

Scenario 2: You want the most up to date information for each url and don't want to duplicate content. There are two solutions for this scenario. (These solutions also require 'url' to be unique so the solution in Scenario 1 will also need to be carried out.)

场景2:您希望每个url都有最新的信息,并且不希望重复内容。这个场景有两个解决方案。(这些解决方案还要求“url”是唯一的,因此场景1中的解决方案也需要执行。)

REPLACE INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')

This will trigger a DELETE action if a row exists followed by an INSERT in all cases, so be careful with ON DELETE declarations.

这将触发删除操作,如果在所有情况下存在一行,后面跟着插入,所以要小心删除声明。

INSERT INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
ON DUPLICATE KEY UPDATE data='random data'

This will trigger an UPDATE action if a row exists and an INSERT if it does not.

如果一行存在,则触发更新操作;如果不存在,则触发插入操作。

#7


4  

In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.

在考虑这个问题的解决方案时,您需要首先定义“重复URL”对项目的含义。这将决定如何在将url添加到数据库之前将它们规范化。

There are at least two definitions:

至少有两种定义:

  1. Two URLs are considered duplicates if they represent the same resource knowing nothing about the corresponding web service that generates the corresponding content. Some considerations include:
    • The scheme and domain name portion of the URLs are case-insensitive, so HTTP://WWW.*.COM/ is the same as http://www.*.com/.
    • url的scheme和域名部分不区分大小写,所以HTTP://WWW.*显示。COM/与http://www.*.com/相同。
    • If one URL specifies a port, but it is the conventional port for the scheme and they are otherwise equivalent, then they are the same ( http://www.*.com/ and http://www.*.com:80/).
    • 如果一个URL指定了一个端口,但它是该方案的常规端口,它们是相同的,那么它们是相同的(http://www.*.com/和http://www.*.com:80/)。
    • If the parameters in the query string are simple rearrangements and the parameter names are all different, then they are the same; e.g. http://authority/?a=test&b=test and http://authority/?b=test&a=test. Note that http://authority/?a%5B%5D=test1&a%5B%5D=test2 is not the same, by this first definition of sameness, as http://authority/?a%5B%5D=test2&a%5B%5D=test1.
    • 如果查询字符串中的参数是简单的重新排列,且参数名称都不同,则它们是相同的;例如,http://authority/?= test&b =测试和http://authority/?b=test&a=test。请注意,http://authority/?a%5B%5D= test1b %5D=test2,根据第一个相同的定义,a%5B%5D= http://authority/?
    • If the scheme is HTTP or HTTPS, then the hash portions of the URLs can be removed, as this portion of the URL is not sent to the web server.
    • 如果该方案是HTTP或HTTPS,则可以删除URL的散列部分,因为URL的这一部分没有发送到web服务器。
    • A shortened IPv6 address can be expanded.
    • 缩短的IPv6地址可以扩展。
    • Append a trailing forward slash to the authority only if it is missing.
    • 只在缺少的情况下附加一个向前斜杠。
    • Unicode canonicalization changes the referenced resource; e.g. you can't conclude that http://google.com/?q=%C3%84 (%C3%84 represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88 represents U+0308, COMBINING DIAERESIS).
    • Unicode规范化改变了引用的资源;例如:you can't conclusion that http://google.com/?q=%C3%84 (%C3%84表示UTF-8中的“A”)与http://google.com/?q=A%CC%88 (%CC%88表示U+0308,合并透气性)。
    • If the scheme is HTTP or HTTPS, 'www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.
    • 如果这个计划是HTTP或HTTPS,“www”。在一个URL的权威不仅可以删除如果两个URL否则等效,域名发送的文本如主机HTTP头的价值,和一些web服务器使用虚拟主机发回基于这个头不同的内容。更一般地说,即使域名解析到相同的IP地址,也不能断定引用的资源是相同的。
  2. 如果两个url表示相同的资源,而不知道生成相应内容的相应web服务,则认为它们是重复的。一些注意事项包括:url的scheme和域名部分不区分大小写,因此HTTP://WWW.*。COM/与http://www.*.com/相同。如果一个URL指定了一个端口,但它是该方案的常规端口,否则它们是等价的,那么它们是相同的(http://www.*.com/和http://www.*.com:80/)。如果查询字符串中的参数是简单的重新排列,且参数名称都不同,则它们是相同的;例如,http://authority/?= test&b =测试和http://authority/?b=test&a=test。请注意,http://authority/?a%5B%5D= test1b %5D=test2,根据第一个相同的定义,a%5B%5D= http://authority/?如果该方案是HTTP或HTTPS,则可以删除URL的散列部分,因为URL的这一部分没有发送到web服务器。可以扩展一个缩短的IPv6地址。只有当权限缺失时,才向它追加一个尾斜杠。Unicode规范化改变了引用的资源;例如:you can't conclusion that http://google.com/?q=%C3%84 (%C3%84表示UTF-8中的“A”)与http://google.com/?q=A%CC%88 (%CC%88表示U+0308,合并透气性)。如果这个计划是HTTP或HTTPS,“www”。在一个URL的权威不仅可以删除如果两个URL否则等效,域名发送的文本如主机HTTP头的价值,和一些web服务器使用虚拟主机发回基于这个头不同的内容。更一般地说,即使域名解析到相同的IP地址,也不能断定引用的资源是相同的。
  3. Apply basic URL canonicalization (e.g. lower case the scheme and domain name, supply the default port, stable sort query parameters by parameter name, remove the hash portion in the case of HTTP and HTTPS, ...), and take into account knowledge of the web service. Maybe you will assume that all web services are smart enough to canonicalize Unicode input (Wikipedia is, for example), so you can apply Unicode Normalization Form Canonical Composition (NFC). You would strip 'www.' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).
  4. 应用基本的URL规范化(例如,小写的方案和域名,提供默认端口,按参数名提供稳定的排序查询参数,在HTTP和HTTPS的情况下删除散列部分,…),并考虑到web服务的知识。也许您会认为所有的web服务都足够聪明,可以将Unicode输入规范化(例如Wikipedia),因此您可以应用Unicode规范化形式的Canonical Composition (NFC)。您将从所有堆栈溢出url中删除'www.'。您可以使用PostRank的potusk -uri代码,移植到PHP,以删除不必要的url(例如,utm_source=…)。

Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.

定义1导致稳定的解决方案(例如,没有可以执行的进一步规范化,URL的规范化也不会改变)。定义2,我认为这是一个人认为的URL规范化的定义,它导致一个规范化例程,可以在不同的时间点产生不同的结果。

Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.

无论您选择哪个定义,我建议您为方案、登录、主机、端口和路径部分使用单独的列。这将允许您智能地使用索引。scheme和主机的列可以使用字符排序(MySQL中所有字符排序都是不区分大小写的),但是登录和路径的列需要使用二进制的、不区分大小写的排序。此外,如果使用定义2,则需要保留原始的方案、权限和路径部分,因为有时可能会添加或删除某些规范化规则。

EDIT: Here are example table definitions:

编辑:下面是表定义示例:

CREATE TABLE `urls1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `scheme` VARCHAR(20) NOT NULL,
    `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
    `port` INT UNSIGNED,
    `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    PRIMARY KEY (`id`),
    INDEX (`canonical_host`(10), `scheme`)
) ENGINE = 'InnoDB';


CREATE TABLE `urls2` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `canonical_scheme` VARCHAR(20) NOT NULL,
    `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `port` INT UNSIGNED,
    `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    `orig_scheme` VARCHAR(20) NOT NULL, 
    `orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    PRIMARY KEY (`id`),
    INDEX (`canonical_host`(10), `canonical_scheme`),
    INDEX (`orig_host`(10), `orig_scheme`)
) ENGINE = 'InnoDB';

Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.

表' urls1 '是根据定义1来存储标准url的。表“urls2”用于根据定义2存储规范url。

Unfortunately you will not be able to specify a UNIQUE constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.

不幸的是,当MySQL将InnoDB键的长度限制为767字节时,您将无法在元组上指定唯一的约束(“scheme”/“canonical_scheme”、“canonical_login”、“canonical_path”)。

#8


2  

i don't know the syntax for MySQL, but all you need to do is wrap your INSERT with IF statement that will query the table and see if the record with given url EXISTS, if it exists - don't insert a new record.

我不知道MySQL的语法,但是您需要做的就是使用IF语句包装插入,该语句将查询表,并查看具有给定url的记录是否存在(如果存在)——不要插入新的记录。

if MSSQL you can do this:

如果MSSQL可以做到:

IF NOT EXISTS (SELECT 1 FROM YOURTABLE WHERE URL = 'URL')
INSERT INTO YOURTABLE (...) VALUES (...)

#9


1  

If you want to insert urls into the table, but only those that don't exist already you can add a UNIQUE contraint on the column and in your INSERT query add IGNORE so that you don't get an error.

如果您想要将url插入到表中,但是只有那些还不存在的url,您可以在列上添加一个惟一的禁忌,并在insert query add IGNORE中添加忽略,这样就不会出现错误。

Example: INSERT IGNORE INTO urls SET url = 'url-to-insert'

示例:将“忽略”插入到url集合url = 'url-to- INSERT '

#10


1  

First things first. If you haven't already created the table, or you created a table but do not have data in in then you need to add a unique constriant, or a unique index. More information about choosing between index or constraints follows at the end of the post. But they both accomplish the same thing, enforcing that the column only contains unique values.

先做重要的事。如果您还没有创建表,或者您创建了一个表,但是没有数据,那么您需要添加一个惟一的constriant或一个惟一的索引。关于在索引或约束之间进行选择的更多信息将在文章的最后。但是它们都完成相同的事情,强制要求列只包含惟一的值。

To create a table with a unique index on this column, you can use.

要在此列上创建具有惟一索引的表,可以使用。

CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,UNIQUE INDEX IDX_URL(URL)
);

If you just want a unique constraint, and no index on that table, you can use

如果您只想要一个惟一的约束,并且在该表上没有索引,您可以使用。

CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,CONSTRAINT UNIQUE UNIQUE_URL(URL)
);

Now, if you already have a table, and there is no data in it, then you can add the index or constraint to the table with one of the following pieces of code.

现在,如果您已经有了一个表,并且其中没有数据,那么您可以使用以下代码之一将索引或约束添加到表中。

ALTER TABLE MyURLTable
ADD UNIQUE INDEX IDX_URL(URL);

ALTER TABLE MyURLTable
ADD CONSTRAINT UNIQUE UNIQUE_URL(URL);

Now, you may already have a table with some data in it. In that case, you may already have some duplicate data in it. You can try creating the constriant or index shown above, and it will fail if you already have duplicate data. If you don't have duplicate data, great, if you do, you'll have to remove the duplicates. You can see a lit of urls with duplicates using the following query.

现在,您可能已经有了一个包含一些数据的表。在这种情况下,您可能已经有一些重复的数据。您可以尝试创建上面所示的constriant或index,如果您已经有了重复的数据,它就会失败。如果没有重复的数据,很好,如果有,就必须删除重复的数据。使用以下查询,您可以看到具有重复的url。

SELECT URL,COUNT(*),MIN(ID) 
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1;

To delete rows that are duplicates, and keep one, do the following:

要删除重复的行并保留一个行,请执行以下操作:

DELETE RemoveRecords
FROM MyURLTable As RemoveRecords
LEFT JOIN 
(
SELECT MIN(ID) AS ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1
UNION
SELECT ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) = 1
) AS KeepRecords
ON RemoveRecords.ID = KeepRecords.ID
WHERE KeepRecords.ID IS NULL;

Now that you have deleted all the records, you can go ahead and create you index or constraint. Now, if you want to insert a value into your database, you should use something like.

现在已经删除了所有记录,可以继续创建索引或约束。现在,如果您想在数据库中插入一个值,您应该使用以下方法。

INSERT IGNORE INTO MyURLTable(URL)
VALUES('http://www.example.com');

That will attempt to do the insert, and if it finds a duplicate, nothing will happen. Now, lets say you have other columns, you can do something like this.

这将尝试插入,如果它找到一个副本,将不会发生任何事情。现在,假设你有其他的列,你可以这样做。

INSERT INTO MyURLTable(URL,Visits) 
VALUES('http://www.example.com',1)
ON DUPLICATE KEY UPDATE Visits=Visits+1;

That will look try to insert the value, and if it finds the URL, then it will update the record by incrementing the visits counter. Of course, you can always do a plain old insert, and handle the resulting error in your PHP Code. Now, as for whether or not you should use constraints or indexes, that depends on a lot of factors. Indexes make for faster lookups, so your performance will be better as the table gets bigger, but storing the index will take up extra space. Indexes also usually make inserts and updates take longer as well, because it has to update the index. However, since the value will have to be looked up either way, to enforce the uniqueness, in this case, It may be quicker to just have the index anyway. As for anything performance related, the answer is try both options and profile the results to see which works best for your situation.

它将尝试插入值,如果它找到了URL,那么它将通过增加访问计数器来更新记录。当然,您可以始终做一个简单的旧插入,并处理PHP代码中产生的错误。至于是否应该使用约束或索引,这取决于很多因素。索引可以使查找速度更快,因此随着表变大,性能会更好,但是存储索引将占用额外的空间。索引通常也会使插入和更新花费更长的时间,因为它必须更新索引。但是,由于必须以任何一种方式查找值,以强制惟一性,在这种情况下,仅使用索引可能会更快。对于任何与性能相关的问题,答案是同时尝试两个选项并分析结果,看看哪一个最适合您的情况。

#11


0  

If you just want a yes or no answer this syntax should give you the best performance.

如果您只是想要一个“是”或“否”的答案,那么这个语法应该能够提供最好的性能。

select if(exists (select url from urls where url = 'http://asdf.com'), 1, 0) from dual

#12


0  

If you just want to make sure there are no duplicates then add an unique index to the url field, that way there is no need to explicitly check if the url exists, just insert as normal, and if it is already there then the insert will fail with a duplicate key error.

如果你只是想确保没有重复然后url字段添加一个惟一的索引,这样不需要显式地检查网址是否存在,只是正常插入,如果它已经存在,那么插入重复键错误将会失败。

#13


0  

The answer depends on whether you want to know when an attempt is made to enter a record with a duplicate field. If you don't care then use the "INSERT... ON DUPLICATE KEY" syntax as this will make your attempt quietly succeed without creating a duplicate.

答案取决于您是否想知道何时尝试输入具有重复字段的记录。如果你不介意,使用“插入…”对于重复键的语法,这将使您的尝试悄悄地成功,而不会创建重复。

If on the other hand you want to know when such an event happens and prevent it, then you should use a unique key constraint which will cause the attempted insert/update to fail with a meaningful error.

另一方面,如果您想知道何时发生了这样的事件并防止它发生,那么您应该使用一个惟一的键约束,它将导致试图插入/更新失败,并产生一个有意义的错误。

#14


0  

$url = "http://www.scroogle.com";

$query  = "SELECT `id` FROM `urls` WHERE  `url` = '$url' ";
$resultdb = mysql_query($query) or die(mysql_error());   
list($idtemp) = mysql_fetch_array($resultdb) ;

if(empty($idtemp)) // if $idtemp is empty the url doesn't exist and we go ahead and insert it into the db.
{ 
   mysql_query("INSERT INTO urls (`url` ) VALUES('$url') ") or die (mysql_error());
}else{
   //do something else if the url already exists in the DB
}

#15


0  

Make the column the primary key

使列成为主键

#16


0  

You can locate (and remove) using a self-join. Your table has some URL and also some PK (We know that the PK is not the URL because otherwise you would not be allowed to have duplicates)

您可以使用自连接定位(并删除)。您的表有一些URL和一些PK(我们知道PK不是URL,否则不允许有重复)

SELECT
    *
FROM
    yourTable a
JOIN
    yourTable b -- Join the same table
        ON b.[URL] = a.[URL] -- where the URL's match
        AND b.[PK] <> b.[PK] -- but the PK's are different

This will return all rows which have duplicated URLs.

这将返回所有具有重复url的行。

Say, though, that you wanted to only select the duplicates and exclude the original.... Well you would need to decide what constitutes the original. For the purpose of this answer let's assume that the lowest PK is the "original"

只说,你想选择副本并排除原....你需要决定什么构成了原函数。对于这个答案,我们假设最低的PK是“原始的”

All you need to do is add the following clause to the above query:

您所需要做的就是在上面的查询中添加以下子句:

WHERE
    a.[PK] NOT IN (
        SELECT 
            TOP 1 c.[PK] -- Only grabbing the original!
        FROM
            yourTable c
        WHERE
            c.[URL] = a.[URL] -- has the same URL
        ORDER BY
            c.[PK] ASC) -- sort it by whatever your criterion is for "original"

Now you have a set of all non-original duplicated rows. You could easily execute a DELETE or whatever you like from this result set.

现在有一组所有非原始的重复行。您可以很容易地从这个结果集中执行删除或任何您喜欢的操作。

Note that this approach may be inefficient, in part because mySQL doesn't always handle IN well but I understand from the OP that this is sort of "clean up" on the table, not always a check.

注意,这种方法可能是低效的,部分原因是mySQL不能很好地处理,但是我从OP中了解到,这是一种“清理”,并不总是一种检查。

If you want to check at INSERT time whether or not a value already exists you can run something like this

如果您想在插入时检查一个值是否已经存在,可以运行类似这样的操作

SELECT 
    1
WHERE
    EXISTS (SELECT * FROM yourTable WHERE [URL] = 'testValue')

If you get a result then you can conclude the value already exists in your DB at least once.

如果您得到一个结果,那么您就可以至少一次地推断DB中已经存在的值。

#17


-1  

You could do this query:

你可以做这个查询:

SELECT url FROM urls WHERE url = 'http://asdf.com' LIMIT 1

Then check if mysql_num_rows() == 1 to see if it exists.

然后检查mysql_num_rows() = 1是否存在。

#1


39  

If you don't want to have duplicates you can do following:

如果你不想要复制,你可以做以下:

If multiple users can insert data to DB, method suggested by @Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.

如果多个用户可以向DB插入数据,@Jeremy Ruten建议的方法可能会导致一个错误:在您执行检查之后,某人可以向该表插入类似的数据。

#2


23  

To answer your initial question, the easiest way to check whether there is a duplicate is to run an SQL query against what you're trying to add!

要回答您的第一个问题,检查是否存在重复的最简单的方法是针对要添加的内容运行SQL查询!

For example, were you to want to check for the url http://www.example.com/ in the table links, then your query would look something like

例如,如果您想要检查表链接中的url http://www.example.com/,那么您的查询应该类似于

SELECT * FROM links WHERE url = 'http://www.example.com/';

Your PHP code would look something like

你的PHP代码应该是这样的。

$conn = mysql_connect('localhost', 'username', 'password');
if (!$conn)
{
    die('Could not connect to database');
}
if(!mysql_select_db('mydb', $conn))
{
    die('Could not select database mydb');
}

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    die('This URL already exists in the database');
}

I've written this out longhand here, with all the connecting to the database, etc. It's likely that you'll already have a connection to a database, so you should use that rather than starting a new connection (replace $conn in the mysql_query command and remove the stuff to do with mysql_connect and mysql_select_db)

我写这个了手写,连接到数据库,等等。很可能你已经有一个连接到一个数据库,那么您应该使用,而不是开始一个新的连接(取代美元康涅狄格州mysql_query命令和删除的东西与mysql_connect和mysql_select_db)

Of course, there are other ways of connecting to the database, like PDO, or using an ORM, or similar, so if you're already using those, this answer may not be relevant (and it's probably a bit beyond the scope to give answers related to this here!)

当然,还有其他连接到数据库的方法,如PDO、ORM或类似的方法,所以如果您已经在使用这些方法,那么这个答案可能并不相关(在这里给出与此相关的答案可能有点超出范围!)

However, MySQL provides many ways to prevent this from happening in the first place.

但是,MySQL首先提供了许多防止这种情况发生的方法。

Firstly, you can mark a field as "unique".

首先,您可以将字段标记为“unique”。

Lets say I have a table where I want to just store all the URLs that are linked to from my site, and the last time they were visited.

假设我有一个表,我想存储从我的站点链接到的所有url,以及上次访问它们的时候。

My definition might look something like this:-

我的定义可能是这样的:-。

CREATE TABLE links
(
    url VARCHAR(255) NOT NULL,
    last_visited TIMESTAMP
)

This would allow me to add the same URL over and over again, unless I wrote some PHP code similar to the above to stop this happening.

这将允许我一次又一次地添加相同的URL,除非我编写了一些类似于上述的PHP代码来阻止这种情况的发生。

However, were my definition to change to

然而,我的定义是

CREATE TABLE links
(
  url VARCHAR(255)  NOT NULL,
  last_visited TIMESTAMP,
  PRIMARY KEY (url)
)

Then this would make mysql throw an error when I tried to insert the same value twice.

当我尝试插入相同的值两次时,这会导致mysql抛出错误。

An example in PHP would be

PHP中的一个例子是

$result = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result)
{
    die('Could not Insert Row 1');
}

$result2 = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result2)
{
    die('Could not Insert Row 2');
}

If you ran this, you'd find that on the first attempt, the script would die with the comment Could not Insert Row 2. However, on subsequent runs, it'd die with Could not Insert Row 1.

如果您运行这个,您会发现在第一次尝试时,脚本会死,因为注释无法插入第2行。然而,在后续的运行中,它将无法插入第一行。

This is because MySQL knows that the url is the Primary Key of the table. A Primary key is a unique identifier for that row. Most of the time, it's useful to set the unique identifier for a row to be a number. This is because MySQL is quicker at looking up numbers than it is looking up text. Within MySQL, keys (and espescially Primary Keys) are used to define relationships between two tables. For example, if we had a table for users, we could define it as

这是因为MySQL知道url是表的主键。主键是该行的唯一标识符。大多数情况下,将一行的唯一标识符设置为数字是有用的。这是因为MySQL查找数字的速度比查找文本的速度快。在MySQL中,键(和espescially主键)用于定义两个表之间的关系。例如,如果我们有一个用户表,我们可以将它定义为

CREATE TABLE users (
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40) NOT NULL,
  PRIMARY KEY (username)
)

However, when we wanted to store information about a post the user had made, we'd have to store the username with that post to identify that the post belonged to that user.

但是,当我们想要存储用户发布的文章的信息时,我们必须将用户名存储在该文章中,以标识该文章属于该用户。

I've already mentioned that MySQL is faster at looking up numbers than strings, so this would mean we'd be spending time looking up strings when we didn't have to.

我已经提到过MySQL在查找数字时比字符串更快,这意味着我们在不必要的时候会花时间查找字符串。

To solve this, we can add an extra column, user_id, and make that the primary key (so when looking up the user record based on a post, we can find it quicker)

要解决这个问题,我们可以添加一个额外的列user_id,并将其作为主键(因此,当基于post查找用户记录时,我们可以更快地找到它)

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (`user_id`)
)

You'll notice that I've also added something new here - AUTO_INCREMENT. This basically allows us to let that field look after itself. Each time a new row is inserted, it adds 1 to the previous number, and stores that, so we don't have to worry about numbering, and can just let it do this itself.

您将注意到,我还在这里添加了一些新内容——AUTO_INCREMENT。这基本上让我们可以让这个领域自我管理。每次插入新行时,它都会向前面的数字加1,然后存储,这样我们就不用担心编号了,可以让它自己来做。

So, with the above table, we can do something like

通过上面的表格,我们可以做一些类似的事情

INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');

and then

然后

INSERT INTO users (username, password) VALUES('User', '988881adc9fc3655077dc2d4d757d480b5ea0e11');

When we select the records from the database, we get the following:-

当我们从数据库中选择记录时,我们得到以下信息:-。

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
+---------+----------+------------------------------------------+
2 rows in set (0.00 sec)

However, here - we have a problem - we can still add another user with the same username! Obviously, this is something we don't want to do!

但是,这里有一个问题——我们仍然可以添加一个用户名相同的用户!显然,这是我们不想做的事情!

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
|       3 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
+---------+----------+------------------------------------------+
3 rows in set (0.00 sec)

Lets change our table definition!

让我们更改表定义!

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (user_id),
  UNIQUE KEY (username)
)

Lets see what happens when we now try and insert the same user twice.

让我们看看当我们现在尝试插入相同的用户两次时会发生什么。

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
ERROR 1062 (23000): Duplicate entry 'Mez' for key 'username'

Huzzah!! We now get an error when we try and insert the username for the second time. Using something like the above, we can detect this in PHP.

万岁! !当我们第二次尝试插入用户名时,我们会得到一个错误。使用上面的一些东西,我们可以在PHP中检测到这一点。

Now, lets go back to our links table, but with a new definition.

现在,让我们回到我们的链接表,但是有一个新的定义。

CREATE TABLE links
(
    link_id INT(10)  NOT NULL AUTO_INCREMENT,
    url VARCHAR(255)  NOT NULL,
    last_visited TIMESTAMP,
    PRIMARY KEY (link_id),
    UNIQUE KEY (url)
)

and let's insert "http://www.example.com" into the database.

让我们将“http://www.example.com”插入到数据库中。

INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());

If we try and insert it again....

如果我们试着插入一遍....

ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'

But what happens if we want to update the time it was last visited?

但是如果我们想要更新上次访问的时间,会发生什么呢?

Well, we could do something complex with PHP, like so:-

我们可以用PHP做一些复杂的事情,比如:-

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $result = mysql_query("UPDATE links SET last_visited = NOW() WHERE url = 'http://www.example.com/'", $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

Or, even grab the id of the row in the database and use that to update it.

或者,甚至获取数据库中行的id并使用它更新它。

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

$result = mysql_query(“从url = 'http://www.example.com/'的链接中选择*”,$conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $row = mysql_fetch_assoc($result);

    $result = mysql_query('UPDATE links SET last_visited = NOW() WHERE link_id = ' . intval($row['link_id'], $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

But, MySQL has a nice built in feature called REPLACE INTO

但是,MySQL有一个很好的内置特性,叫做REPLACE INTO

Let's see how it works.

让我们看看它是如何工作的。

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       1 | http://www.example.com/ | 2011-08-19 23:48:03 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'
mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       2 | http://www.example.com/ | 2011-08-19 23:55:55 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

Notice that when using REPLACE INTO, it's updated the last_visited time, and not thrown an error!

注意,当使用REPLACE INTO时,它将更新last_visited时间,不会抛出错误!

This is because MySQL detects that you're attempting to replace a row. It knows the row that you want, as you've set url to be unique. MySQL figures out the row to replace by using the bit that you passed in that should be unique (in this case, the url) and updating for that row the other values. It's also updated the link_id - which is a bit unexpected! (In fact, I didn't realise this would happen until I just saw it happen!)

这是因为MySQL检测到您正在尝试替换一行。它知道您想要的行,因为您将url设置为惟一的。MySQL通过使用您传入的应该是唯一的位(在本例中是url)来确定要替换的行,并为该行更新其他值。它还更新了link_id——有点出乎意料!(事实上,我直到亲眼目睹才意识到这一点!)

But what if you wanted to add a new URL? Well, REPLACE INTO will happily insert a new row if it can't find a matching unique row!

但是如果你想添加一个新的URL呢?如果找不到匹配的唯一行,REPLACE INTO会很高兴地插入一个新行!

mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.*.com/', NOW());
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------------+---------------------+
| link_id | url                           | last_visited        |
+---------+-------------------------------+---------------------+
|       2 | http://www.example.com/       | 2011-08-20 00:00:07 |
|       3 | http://www.*.com/ | 2011-08-20 00:01:22 |
+---------+-------------------------------+---------------------+
2 rows in set (0.00 sec)

I hope this answers your question, and gives you a bit more information about how MySQL works!

我希望这能回答您的问题,并提供更多关于MySQL如何工作的信息!

#3


14  

Are you concerned purely about URLs that are the exact same string .. if so there is a lot of good advice in other answers. Or do you also have to worry about canonization?

你是否只关心相同字符串的url。如果是这样的话,在其他的答案中有很多好的建议。或者你还需要担心会被封为圣徒吗?

For example: http://google.com and http://go%4fgle.com are the exact same URL, but would be allowed as duplicates by any of the database only techniques. If this is an issue you should preprocess the URLs to resolve and character escape sequences.

例如:http://google.com和http://go%4fgle.com是完全相同的URL,但是任何数据库技术都允许它们作为副本。如果这是一个问题,您应该预先处理要解析的url和字符转义序列。

Depending where the URLs are coming from you will also have to worry about parameters and whether they are significant in your application.

根据url的来源,您还需要考虑参数,以及它们是否在应用程序中具有重要意义。

#4


14  

First, prepare the database.

首先,准备数据库。

  • Domain names aren't case-sensitive, but you have to assume the rest of a URL is. (Not all web servers respect case in URLs, but most do, and you can't easily tell by looking.)
  • 域名不区分大小写,但您必须假设其他的URL是。(并不是所有的web服务器都尊重url中的情况,但大多数都是如此,通过查找是很难判断的。)
  • Assuming you need to store more than a domain name, use a case-sensitive collation.
  • 假设您需要存储多个域名,请使用区分大小写的排序。
  • If you decide to store the URL in two columns--one for the domain name and one for the resource locator--consider using a case-insensitive collation for the domain name, and a case-sensitive collation for the resource locator. If I were you, I'd test both ways (URL in one column vs. URL in two columns).
  • 如果您决定将URL存储在两列中——一列用于域名,一列用于资源定位器——请考虑对域名使用不区分大小写的排序,对资源定位器使用区分大小写的排序。如果我是您,我将测试这两种方法(在一个列中的URL和两个列中的URL)。
  • Put a UNIQUE constraint on the URL column. Or on the pair of columns, if you store the domain name and resource locator in separate columns, as UNIQUE (url, resource_locator).
  • 在URL列上设置唯一的约束。或者在这对列上,如果您将域名和资源定位符存储在单独的列中,作为惟一的(url, resource_locator)。
  • Use a CHECK() constraint to keep encoded URLs out of the database. This CHECK() constraint is essential to keep bad data from coming in through a bulk copy or through the SQL shell.
  • 使用CHECK()约束将已编码的url保存在数据库之外。这个CHECK()约束对于防止坏数据通过批量复制或SQL shell进入非常重要。

Second, prepare the URL.

第二,准备URL。

  • Domain names aren't case-sensitive. If you store the full URL in one column, lowercase the domain name on all URLs. But be aware that some languages have uppercase letters that have no lowercase equivalent.
  • 域名不区分大小写的。如果将完整的URL存储在一列中,则在所有URL上以小写字母表示域名。但是要注意,有些语言的大写字母没有小写的对等字母。
  • Think about trimming trailing characters. For example, these two URLs from amazon.com point to the same product. You probably want to store the second version, not the first.

    考虑修剪后的字符。例如,这两个来自amazon.com的url指向同一个产品。您可能希望存储第二个版本,而不是第一个版本。

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X

  • Decode encoded URLs. (See php's urldecode() function. Note carefully its shortcomings, as described in that page's comments.) Personally, I'd rather handle these kinds of transformations in the database rather than in client code. That would involve revoking permissions on the tables and views, and allowing inserts and updates only through stored procedures; the stored procedures handle all the string operations that put the URL into a canonical form. But keep an eye on performance when you try that. CHECK() constraints (see above) are your safety net.

    解码编码的url。(请参阅php的urldecode()函数。请仔细注意它的缺点,如该页的注释所述。就我个人而言,我宁愿在数据库中而不是在客户端代码中处理这些类型的转换。这将涉及取消对表和视图的权限,并允许仅通过存储过程进行插入和更新;存储过程处理将URL放入规范格式的所有字符串操作。但是当你尝试的时候,要注意你的表现。CHECK()约束是您的安全网。

Third, if you're inserting only the URL, don't test for its existence first. Instead, try to insert and trap the error that you'll get if the value already exists. Testing and inserting hits the database twice for every new URL. Insert-and-trap just hits the database once. Note carefully that insert-and-trap isn't the same thing as insert-and-ignore-errors. Only one particular error means you violated the unique constraint; other errors mean there are other problems.

第三,如果只插入URL,不要先测试URL是否存在。相反,尝试插入并捕获如果值已经存在的错误。对每个新URL进行两次测试和插入。插入-陷阱只攻击数据库一次。注意插入-陷阱和插入-忽略错误不是一回事。只有一个特定的错误意味着你违反了唯一的约束;其他错误意味着还有其他问题。

On the other hand, if you're inserting the URL along with some other data in the same row, you need to decide ahead of time whether you'll handle duplicate urls by

另一方面,如果在同一行中插入URL和其他数据,则需要提前决定是否要处理重复的URL

  • deleting the old row and inserting a new one (See MySQL's REPLACE extension to SQL)
  • 删除旧行并插入新行(参见MySQL的SQL替换扩展)
  • updating existing values (See ON DUPLICATE KEY UPDATE)
  • 更新现有值(参见重复键更新)
  • ignoring the issue
  • 忽略这个问题
  • requiring the user to take further action
  • 要求用户采取进一步行动

REPLACE eliminates the need to trap duplicate key errors, but it might have unfortunate side effects if there are foreign key references.

替换消除了捕获重复键错误的需要,但是如果有外键引用,它可能会有不幸的副作用。

#5


13  

To guarantee uniqueness you need to add a unique constraint. Assuming your table name is "urls" and the column name is "url", you can add the unique constraint with this alter table command:

为了保证唯一性,您需要添加一个惟一的约束。假设您的表名是“urls”,列名是“url”,您可以使用alter table命令添加唯一的约束:

alter table urls add constraint unique_url unique (url);

The alter table will probably fail (who really knows with MySQL) if you've already got duplicate urls in your table already.

如果您的表中已经有重复的url,那么alter table可能会失败(谁知道MySQL会发生什么)。

#6


6  

The simple SQL solutions require a unique field; the logic solutions do not.

简单的SQL解决方案需要一个惟一的字段;逻辑解决方案没有。

You should normalize your urls to ensure there is no duplication. Functions in PHP such as strtolower() and urldecode() or rawurldecode().

您应该使url规范化,以确保没有重复。PHP中的函数,如strtolower()和urldecode()或rawurldecode()。

Assumptions: Your table name is 'websites', the column name for your url is 'url', and the arbitrary data to be associated with the url is in the column 'data'.

假设:您的表名是“website”,url的列名是“url”,与url相关联的任意数据在列“data”中。

Logic Solutions

逻辑解决方案

SELECT COUNT(*) AS UrlResults FROM websites WHERE url='http://www.domain.com'

Test the previous query with if statements in SQL or PHP to ensure that it is 0 before you continue with an INSERT statement.

使用SQL或PHP中的if语句测试前一个查询,以确保在继续使用INSERT语句之前它是0。

Simple SQL Statements

简单的SQL语句

Scenario 1: Your db is a first come first serve table and you have no desire to have duplicate entries in the future.

场景1:您的db是先到先得的表,您不希望将来有重复的条目。

ALTER TABLE websites ADD UNIQUE (url)

This will prevent any entries from being able to be entered in to the database if the url value already exists in that column.

如果在该列中已经存在url值,这将阻止任何条目能够输入到数据库中。

Scenario 2: You want the most up to date information for each url and don't want to duplicate content. There are two solutions for this scenario. (These solutions also require 'url' to be unique so the solution in Scenario 1 will also need to be carried out.)

场景2:您希望每个url都有最新的信息,并且不希望重复内容。这个场景有两个解决方案。(这些解决方案还要求“url”是唯一的,因此场景1中的解决方案也需要执行。)

REPLACE INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')

This will trigger a DELETE action if a row exists followed by an INSERT in all cases, so be careful with ON DELETE declarations.

这将触发删除操作,如果在所有情况下存在一行,后面跟着插入,所以要小心删除声明。

INSERT INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
ON DUPLICATE KEY UPDATE data='random data'

This will trigger an UPDATE action if a row exists and an INSERT if it does not.

如果一行存在,则触发更新操作;如果不存在,则触发插入操作。

#7


4  

In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.

在考虑这个问题的解决方案时,您需要首先定义“重复URL”对项目的含义。这将决定如何在将url添加到数据库之前将它们规范化。

There are at least two definitions:

至少有两种定义:

  1. Two URLs are considered duplicates if they represent the same resource knowing nothing about the corresponding web service that generates the corresponding content. Some considerations include:
    • The scheme and domain name portion of the URLs are case-insensitive, so HTTP://WWW.*.COM/ is the same as http://www.*.com/.
    • url的scheme和域名部分不区分大小写,所以HTTP://WWW.*显示。COM/与http://www.*.com/相同。
    • If one URL specifies a port, but it is the conventional port for the scheme and they are otherwise equivalent, then they are the same ( http://www.*.com/ and http://www.*.com:80/).
    • 如果一个URL指定了一个端口,但它是该方案的常规端口,它们是相同的,那么它们是相同的(http://www.*.com/和http://www.*.com:80/)。
    • If the parameters in the query string are simple rearrangements and the parameter names are all different, then they are the same; e.g. http://authority/?a=test&b=test and http://authority/?b=test&a=test. Note that http://authority/?a%5B%5D=test1&a%5B%5D=test2 is not the same, by this first definition of sameness, as http://authority/?a%5B%5D=test2&a%5B%5D=test1.
    • 如果查询字符串中的参数是简单的重新排列,且参数名称都不同,则它们是相同的;例如,http://authority/?= test&b =测试和http://authority/?b=test&a=test。请注意,http://authority/?a%5B%5D= test1b %5D=test2,根据第一个相同的定义,a%5B%5D= http://authority/?
    • If the scheme is HTTP or HTTPS, then the hash portions of the URLs can be removed, as this portion of the URL is not sent to the web server.
    • 如果该方案是HTTP或HTTPS,则可以删除URL的散列部分,因为URL的这一部分没有发送到web服务器。
    • A shortened IPv6 address can be expanded.
    • 缩短的IPv6地址可以扩展。
    • Append a trailing forward slash to the authority only if it is missing.
    • 只在缺少的情况下附加一个向前斜杠。
    • Unicode canonicalization changes the referenced resource; e.g. you can't conclude that http://google.com/?q=%C3%84 (%C3%84 represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88 represents U+0308, COMBINING DIAERESIS).
    • Unicode规范化改变了引用的资源;例如:you can't conclusion that http://google.com/?q=%C3%84 (%C3%84表示UTF-8中的“A”)与http://google.com/?q=A%CC%88 (%CC%88表示U+0308,合并透气性)。
    • If the scheme is HTTP or HTTPS, 'www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.
    • 如果这个计划是HTTP或HTTPS,“www”。在一个URL的权威不仅可以删除如果两个URL否则等效,域名发送的文本如主机HTTP头的价值,和一些web服务器使用虚拟主机发回基于这个头不同的内容。更一般地说,即使域名解析到相同的IP地址,也不能断定引用的资源是相同的。
  2. 如果两个url表示相同的资源,而不知道生成相应内容的相应web服务,则认为它们是重复的。一些注意事项包括:url的scheme和域名部分不区分大小写,因此HTTP://WWW.*。COM/与http://www.*.com/相同。如果一个URL指定了一个端口,但它是该方案的常规端口,否则它们是等价的,那么它们是相同的(http://www.*.com/和http://www.*.com:80/)。如果查询字符串中的参数是简单的重新排列,且参数名称都不同,则它们是相同的;例如,http://authority/?= test&b =测试和http://authority/?b=test&a=test。请注意,http://authority/?a%5B%5D= test1b %5D=test2,根据第一个相同的定义,a%5B%5D= http://authority/?如果该方案是HTTP或HTTPS,则可以删除URL的散列部分,因为URL的这一部分没有发送到web服务器。可以扩展一个缩短的IPv6地址。只有当权限缺失时,才向它追加一个尾斜杠。Unicode规范化改变了引用的资源;例如:you can't conclusion that http://google.com/?q=%C3%84 (%C3%84表示UTF-8中的“A”)与http://google.com/?q=A%CC%88 (%CC%88表示U+0308,合并透气性)。如果这个计划是HTTP或HTTPS,“www”。在一个URL的权威不仅可以删除如果两个URL否则等效,域名发送的文本如主机HTTP头的价值,和一些web服务器使用虚拟主机发回基于这个头不同的内容。更一般地说,即使域名解析到相同的IP地址,也不能断定引用的资源是相同的。
  3. Apply basic URL canonicalization (e.g. lower case the scheme and domain name, supply the default port, stable sort query parameters by parameter name, remove the hash portion in the case of HTTP and HTTPS, ...), and take into account knowledge of the web service. Maybe you will assume that all web services are smart enough to canonicalize Unicode input (Wikipedia is, for example), so you can apply Unicode Normalization Form Canonical Composition (NFC). You would strip 'www.' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).
  4. 应用基本的URL规范化(例如,小写的方案和域名,提供默认端口,按参数名提供稳定的排序查询参数,在HTTP和HTTPS的情况下删除散列部分,…),并考虑到web服务的知识。也许您会认为所有的web服务都足够聪明,可以将Unicode输入规范化(例如Wikipedia),因此您可以应用Unicode规范化形式的Canonical Composition (NFC)。您将从所有堆栈溢出url中删除'www.'。您可以使用PostRank的potusk -uri代码,移植到PHP,以删除不必要的url(例如,utm_source=…)。

Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.

定义1导致稳定的解决方案(例如,没有可以执行的进一步规范化,URL的规范化也不会改变)。定义2,我认为这是一个人认为的URL规范化的定义,它导致一个规范化例程,可以在不同的时间点产生不同的结果。

Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.

无论您选择哪个定义,我建议您为方案、登录、主机、端口和路径部分使用单独的列。这将允许您智能地使用索引。scheme和主机的列可以使用字符排序(MySQL中所有字符排序都是不区分大小写的),但是登录和路径的列需要使用二进制的、不区分大小写的排序。此外,如果使用定义2,则需要保留原始的方案、权限和路径部分,因为有时可能会添加或删除某些规范化规则。

EDIT: Here are example table definitions:

编辑:下面是表定义示例:

CREATE TABLE `urls1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `scheme` VARCHAR(20) NOT NULL,
    `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
    `port` INT UNSIGNED,
    `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    PRIMARY KEY (`id`),
    INDEX (`canonical_host`(10), `scheme`)
) ENGINE = 'InnoDB';


CREATE TABLE `urls2` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `canonical_scheme` VARCHAR(20) NOT NULL,
    `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `port` INT UNSIGNED,
    `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    `orig_scheme` VARCHAR(20) NOT NULL, 
    `orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    PRIMARY KEY (`id`),
    INDEX (`canonical_host`(10), `canonical_scheme`),
    INDEX (`orig_host`(10), `orig_scheme`)
) ENGINE = 'InnoDB';

Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.

表' urls1 '是根据定义1来存储标准url的。表“urls2”用于根据定义2存储规范url。

Unfortunately you will not be able to specify a UNIQUE constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.

不幸的是,当MySQL将InnoDB键的长度限制为767字节时,您将无法在元组上指定唯一的约束(“scheme”/“canonical_scheme”、“canonical_login”、“canonical_path”)。

#8


2  

i don't know the syntax for MySQL, but all you need to do is wrap your INSERT with IF statement that will query the table and see if the record with given url EXISTS, if it exists - don't insert a new record.

我不知道MySQL的语法,但是您需要做的就是使用IF语句包装插入,该语句将查询表,并查看具有给定url的记录是否存在(如果存在)——不要插入新的记录。

if MSSQL you can do this:

如果MSSQL可以做到:

IF NOT EXISTS (SELECT 1 FROM YOURTABLE WHERE URL = 'URL')
INSERT INTO YOURTABLE (...) VALUES (...)

#9


1  

If you want to insert urls into the table, but only those that don't exist already you can add a UNIQUE contraint on the column and in your INSERT query add IGNORE so that you don't get an error.

如果您想要将url插入到表中,但是只有那些还不存在的url,您可以在列上添加一个惟一的禁忌,并在insert query add IGNORE中添加忽略,这样就不会出现错误。

Example: INSERT IGNORE INTO urls SET url = 'url-to-insert'

示例:将“忽略”插入到url集合url = 'url-to- INSERT '

#10


1  

First things first. If you haven't already created the table, or you created a table but do not have data in in then you need to add a unique constriant, or a unique index. More information about choosing between index or constraints follows at the end of the post. But they both accomplish the same thing, enforcing that the column only contains unique values.

先做重要的事。如果您还没有创建表,或者您创建了一个表,但是没有数据,那么您需要添加一个惟一的constriant或一个惟一的索引。关于在索引或约束之间进行选择的更多信息将在文章的最后。但是它们都完成相同的事情,强制要求列只包含惟一的值。

To create a table with a unique index on this column, you can use.

要在此列上创建具有惟一索引的表,可以使用。

CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,UNIQUE INDEX IDX_URL(URL)
);

If you just want a unique constraint, and no index on that table, you can use

如果您只想要一个惟一的约束,并且在该表上没有索引,您可以使用。

CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,CONSTRAINT UNIQUE UNIQUE_URL(URL)
);

Now, if you already have a table, and there is no data in it, then you can add the index or constraint to the table with one of the following pieces of code.

现在,如果您已经有了一个表,并且其中没有数据,那么您可以使用以下代码之一将索引或约束添加到表中。

ALTER TABLE MyURLTable
ADD UNIQUE INDEX IDX_URL(URL);

ALTER TABLE MyURLTable
ADD CONSTRAINT UNIQUE UNIQUE_URL(URL);

Now, you may already have a table with some data in it. In that case, you may already have some duplicate data in it. You can try creating the constriant or index shown above, and it will fail if you already have duplicate data. If you don't have duplicate data, great, if you do, you'll have to remove the duplicates. You can see a lit of urls with duplicates using the following query.

现在,您可能已经有了一个包含一些数据的表。在这种情况下,您可能已经有一些重复的数据。您可以尝试创建上面所示的constriant或index,如果您已经有了重复的数据,它就会失败。如果没有重复的数据,很好,如果有,就必须删除重复的数据。使用以下查询,您可以看到具有重复的url。

SELECT URL,COUNT(*),MIN(ID) 
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1;

To delete rows that are duplicates, and keep one, do the following:

要删除重复的行并保留一个行,请执行以下操作:

DELETE RemoveRecords
FROM MyURLTable As RemoveRecords
LEFT JOIN 
(
SELECT MIN(ID) AS ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1
UNION
SELECT ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) = 1
) AS KeepRecords
ON RemoveRecords.ID = KeepRecords.ID
WHERE KeepRecords.ID IS NULL;

Now that you have deleted all the records, you can go ahead and create you index or constraint. Now, if you want to insert a value into your database, you should use something like.

现在已经删除了所有记录,可以继续创建索引或约束。现在,如果您想在数据库中插入一个值,您应该使用以下方法。

INSERT IGNORE INTO MyURLTable(URL)
VALUES('http://www.example.com');

That will attempt to do the insert, and if it finds a duplicate, nothing will happen. Now, lets say you have other columns, you can do something like this.

这将尝试插入,如果它找到一个副本,将不会发生任何事情。现在,假设你有其他的列,你可以这样做。

INSERT INTO MyURLTable(URL,Visits) 
VALUES('http://www.example.com',1)
ON DUPLICATE KEY UPDATE Visits=Visits+1;

That will look try to insert the value, and if it finds the URL, then it will update the record by incrementing the visits counter. Of course, you can always do a plain old insert, and handle the resulting error in your PHP Code. Now, as for whether or not you should use constraints or indexes, that depends on a lot of factors. Indexes make for faster lookups, so your performance will be better as the table gets bigger, but storing the index will take up extra space. Indexes also usually make inserts and updates take longer as well, because it has to update the index. However, since the value will have to be looked up either way, to enforce the uniqueness, in this case, It may be quicker to just have the index anyway. As for anything performance related, the answer is try both options and profile the results to see which works best for your situation.

它将尝试插入值,如果它找到了URL,那么它将通过增加访问计数器来更新记录。当然,您可以始终做一个简单的旧插入,并处理PHP代码中产生的错误。至于是否应该使用约束或索引,这取决于很多因素。索引可以使查找速度更快,因此随着表变大,性能会更好,但是存储索引将占用额外的空间。索引通常也会使插入和更新花费更长的时间,因为它必须更新索引。但是,由于必须以任何一种方式查找值,以强制惟一性,在这种情况下,仅使用索引可能会更快。对于任何与性能相关的问题,答案是同时尝试两个选项并分析结果,看看哪一个最适合您的情况。

#11


0  

If you just want a yes or no answer this syntax should give you the best performance.

如果您只是想要一个“是”或“否”的答案,那么这个语法应该能够提供最好的性能。

select if(exists (select url from urls where url = 'http://asdf.com'), 1, 0) from dual

#12


0  

If you just want to make sure there are no duplicates then add an unique index to the url field, that way there is no need to explicitly check if the url exists, just insert as normal, and if it is already there then the insert will fail with a duplicate key error.

如果你只是想确保没有重复然后url字段添加一个惟一的索引,这样不需要显式地检查网址是否存在,只是正常插入,如果它已经存在,那么插入重复键错误将会失败。

#13


0  

The answer depends on whether you want to know when an attempt is made to enter a record with a duplicate field. If you don't care then use the "INSERT... ON DUPLICATE KEY" syntax as this will make your attempt quietly succeed without creating a duplicate.

答案取决于您是否想知道何时尝试输入具有重复字段的记录。如果你不介意,使用“插入…”对于重复键的语法,这将使您的尝试悄悄地成功,而不会创建重复。

If on the other hand you want to know when such an event happens and prevent it, then you should use a unique key constraint which will cause the attempted insert/update to fail with a meaningful error.

另一方面,如果您想知道何时发生了这样的事件并防止它发生,那么您应该使用一个惟一的键约束,它将导致试图插入/更新失败,并产生一个有意义的错误。

#14


0  

$url = "http://www.scroogle.com";

$query  = "SELECT `id` FROM `urls` WHERE  `url` = '$url' ";
$resultdb = mysql_query($query) or die(mysql_error());   
list($idtemp) = mysql_fetch_array($resultdb) ;

if(empty($idtemp)) // if $idtemp is empty the url doesn't exist and we go ahead and insert it into the db.
{ 
   mysql_query("INSERT INTO urls (`url` ) VALUES('$url') ") or die (mysql_error());
}else{
   //do something else if the url already exists in the DB
}

#15


0  

Make the column the primary key

使列成为主键

#16


0  

You can locate (and remove) using a self-join. Your table has some URL and also some PK (We know that the PK is not the URL because otherwise you would not be allowed to have duplicates)

您可以使用自连接定位(并删除)。您的表有一些URL和一些PK(我们知道PK不是URL,否则不允许有重复)

SELECT
    *
FROM
    yourTable a
JOIN
    yourTable b -- Join the same table
        ON b.[URL] = a.[URL] -- where the URL's match
        AND b.[PK] <> b.[PK] -- but the PK's are different

This will return all rows which have duplicated URLs.

这将返回所有具有重复url的行。

Say, though, that you wanted to only select the duplicates and exclude the original.... Well you would need to decide what constitutes the original. For the purpose of this answer let's assume that the lowest PK is the "original"

只说,你想选择副本并排除原....你需要决定什么构成了原函数。对于这个答案,我们假设最低的PK是“原始的”

All you need to do is add the following clause to the above query:

您所需要做的就是在上面的查询中添加以下子句:

WHERE
    a.[PK] NOT IN (
        SELECT 
            TOP 1 c.[PK] -- Only grabbing the original!
        FROM
            yourTable c
        WHERE
            c.[URL] = a.[URL] -- has the same URL
        ORDER BY
            c.[PK] ASC) -- sort it by whatever your criterion is for "original"

Now you have a set of all non-original duplicated rows. You could easily execute a DELETE or whatever you like from this result set.

现在有一组所有非原始的重复行。您可以很容易地从这个结果集中执行删除或任何您喜欢的操作。

Note that this approach may be inefficient, in part because mySQL doesn't always handle IN well but I understand from the OP that this is sort of "clean up" on the table, not always a check.

注意,这种方法可能是低效的,部分原因是mySQL不能很好地处理,但是我从OP中了解到,这是一种“清理”,并不总是一种检查。

If you want to check at INSERT time whether or not a value already exists you can run something like this

如果您想在插入时检查一个值是否已经存在,可以运行类似这样的操作

SELECT 
    1
WHERE
    EXISTS (SELECT * FROM yourTable WHERE [URL] = 'testValue')

If you get a result then you can conclude the value already exists in your DB at least once.

如果您得到一个结果,那么您就可以至少一次地推断DB中已经存在的值。

#17


-1  

You could do this query:

你可以做这个查询:

SELECT url FROM urls WHERE url = 'http://asdf.com' LIMIT 1

Then check if mysql_num_rows() == 1 to see if it exists.

然后检查mysql_num_rows() = 1是否存在。