数据库:查询地理位置数据的最佳性能方法?

时间:2022-10-07 15:27:17

I have a MySQL database. I store homes in the database and perform literally just 1 query against the database, but I need this query to be performed super fast, and that's to return all homes within a square box geo latitude & longitude.

我有一个MySQL数据库。我将home存储在数据库中,对数据库只执行1个查询,但是我需要这个查询执行得非常快,这将返回一个方形地理纬度和经度范围内的所有home。

SELECT * FROM homes 
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

How is the best way for me to store my geo data so that I can perform this query of displaying all home within the geolocation box the quickest?

对我来说,如何最好地存储地理数据,以便我能够最快地执行这个查询,显示地理定位框内的所有家庭?

Basically:

基本上:

  • Am I using the best SQL statement to perform this query the quickest?
  • 我是否使用了最好的SQL语句来最快地执行这个查询?
  • Does any other method exist, maybe not even using a database, for me to query the fastest way a result of homes within a boxed geolocation bounds?
  • 是否存在其他方法,甚至可能不使用数据库,让我在封闭的地理定位范围内查询家庭结果的最快方式?

In case it helps, I've include my database table schema below:

如果有帮助的话,我将我的数据库表模式包含在下面:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `address` varchar(128) collate utf8_unicode_ci NOT NULL,
  `city` varchar(64) collate utf8_unicode_ci NOT NULL,
  `state` varchar(2) collate utf8_unicode_ci NOT NULL,
  `zip` mediumint(8) unsigned NOT NULL,
  `price` mediumint(8) unsigned NOT NULL,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  PRIMARY KEY  (`home_id`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
) ENGINE=InnoDB  ;

UPDATE

更新

I understand spatial will factor in the curvature of the earth but I'm most interested in returning geo data the FASTEST. Unless these spatial database packages somehow return data faster, please don't recommend spatial extensions. Thanks

我知道空间会影响地球的曲率,但我最感兴趣的是最快地返回地球数据。除非这些空间数据库包以某种方式更快地返回数据,否则请不要推荐空间扩展。谢谢

UPDATE 2

更新2

Please note, no one below has truly answered the question. I'm really looking forward to any assistance I might receive. Thanks in advance.

请注意,以下没有人真正回答过这个问题。我真的很期待任何可能得到的帮助。提前谢谢。

11 个解决方案

#1


12  

There is a good paper on MySQL geolocation performance here.

这里有一篇关于MySQL地理定位性能的好文章。

EDIT Pretty sure this is using fixed radius. Also I am not 100% certain the algorithm for calculating distance is the most advanced (i.e. it'll "drill" through Earth).

编辑确定这是使用固定的半径。我也不是100%的确定计算距离的算法是最先进的(也就是说,它会在地球上“钻”)。

What's significant is that the algorithm is cheap to give you a ball park limit on the number of rows to do proper distance search.

重要的是,该算法的代价很低,可以给你一个限制行数来进行适当的距离搜索。

#2


4  

I had the same problem, and wrote a 3 part blogpost. This was faster than the geo index.

我也有同样的问题,写了一篇三部分的博文。这比geo指数要快。

Intro, Benchmark, SQL

介绍、基准、SQL

#3


2  

If you really need to go for performance you can define bounding boxes for your data and map the pre-compute bounding boxes to your objects on insertion and use them later for queries.

如果您确实需要提高性能,您可以为您的数据定义边界框,并在插入时将预计算边界框映射到您的对象,并在以后的查询中使用它们。

If the resultsets are reasonably small you could still do accuracy corrections in the application logic (easier to scale horizontal than a database) while enabling to serve accurate results.

如果结果集相当小,您仍然可以在应用程序逻辑(比数据库更容易横向伸缩)中进行精确校正,同时提供准确的结果。

Take a look at Bret Slatkin's geobox.py which contains great documentation for the approach.

看看Bret Slatkin的geobox。其中包含了关于该方法的大量文档。

I would still recommend checking out PostgreSQL and PostGIS in comparison to MySQL if you intend to do more complex queries in the foreseeable future.

如果您打算在可预见的将来进行更复杂的查询,我仍然建议您使用PostgreSQL和PostGIS与MySQL进行比较。

#4


1  

The indices you are using are indeed B-tree indices and support the BETWEEN keyword in your query. This means that the optimizer is able to use your indices to find the homes within your "box". It does however not mean that it will always use the indices. If you specify a range that contains too many "hits" the indices will not be used.

您正在使用的索引实际上是B-tree索引,并且支持查询中的BETWEEN关键字。这意味着优化器能够使用您的索引来找到您的“框”中的家庭。然而,这并不意味着它将永远使用索引。如果指定包含太多“命中”的范围,则不使用索引。

#5


1  

Here's a trick I've used with some success is to create round-off regions. That is to say, if you have a location that's at 36.12345,-120.54321, and you want to group it with other locations which are within a half-mile (approximate) grid box, you can call its region 36.12x-120.54, and all other locations with the same round-off region will fall in the same box.

这里有一个技巧,我曾经成功地使用过,那就是创建围捕区域。也就是说,如果你有一个位置在36.12345,-120.54321,你想把它和其他在半英里(近似)网格框内的位置进行分组,你可以把它的区域命名为36.12x-120.54,而所有其他具有相同的环绕区域的位置都会落在同一个框中。

Obviously, that doesn't get you a clean radius, i.e. if the location you're looking at is closer to one edge than another. However, with this sort of a set-up, it's easy enough to calculate the eight boxes that surround your main location's box. To wit:

显然,这并不能得到一个清晰的半径,也就是说,如果你观察的位置比另一条更靠近一条边。然而,有了这样的设置,就很容易计算出围绕主位置框的8个框。即:

[36.13x-120.55][36.13x-120.54][36.13x-120.53]
[36.12x-120.55][36.12x-120.54][36.12x-120.53]
[36.11x-120.55][36.11x-120.54][36.11x-120.53]

Pull all the locations with matching round-off labels and then, once you've got them out of the database, you can do your distance calculations to determine which ones to use.

使用匹配的舍入标签提取所有位置,然后,一旦将它们从数据库中取出,就可以进行距离计算,以确定使用哪个位置。

#6


0  

This looks pretty fast. My only concern would be that it would use an index to get all the values within 3 miles of the latitude, then filter those for values within 3 miles of the longitude. If I understand how the underlying system works, you can only use one INDEX per table, so either the index on lat or long is worthless.

这看起来非常快。我唯一关心的是,它将使用一个索引获取离纬度3英里内的所有值,然后将这些值过滤到离经度3英里内的值。如果我理解底层系统是如何工作的,那么您只能对每个表使用一个索引,因此lat或long上的索引都没有价值。

If you had a large amount of data, it might speed things up to give every 1x1 mile square a unique logical ID, and then make an additional restriction on the SELECT that (area="23234/34234" OR area="23235/34234" OR ... ) for all the squares around your point, then force the database to use that index rather than the lat and long. Then you'll only be filtering much less square miles of data.

如果你有大量的数据,它可能会加快速度,给每1x1平方英里一个独特的逻辑ID,然后在选择上再加一个限制(面积="23234/34234"或面积=" 232235 /34234"或…)对于点周围的所有方块,然后强制数据库使用该索引,而不是lat和long。那么你只会过滤更少平方英里的数据。

#7


0  

Homes? You probably won't even have ten thousand of them. Just use an in-memory index like STRTree.

房屋吗?你可能连一万人都没有。只需使用内存中的索引,如STRTree。

#8


0  

Sticking with your current approach there is one change you should make, Rather than indexing geolat and geolong separately you should have a composite index:

按照你目前的方法,你应该做一个改变,而不是把geolat和geolong分开,你应该有一个复合指数:

KEY `geolat_geolng` (`geolat`, `geolng`),

Currently your query will only be taking advantage of one of the two indexes.

目前,您的查询只能利用这两个索引中的一个。

#9


0  

A very good alternative is MongoDB with its Geospatial Indexing.

MongoDB是一个很好的选择,它的地理空间索引。

#10


0  

You might consider creating a separate table 'GeoLocations' that has a primary key of ('geolat','geolng') and has a column that holds the home_id if that particular geolocation happens to have a home. This should allow the optimizer to search for a range of geo locations that will be sorted on disk for a list of home_ids. You could then perform a join with your 'homes' table to find information about those home_ids.

您可以考虑创建一个单独的表“地理定位”,它具有一个主键('geolat','geolng'),如果这个特定的地理定位碰巧有一个home,那么它还有一个包含home_id的列。这应该允许优化器搜索一系列地理位置,这些位置将在磁盘上为home_ids列表排序。然后,您可以执行与“home”表的连接,以查找关于这些home_ids的信息。

CREATE TABLE IF NOT EXISTS `GeoLocations` (
`geolat` decimal(10,6) NOT NULL,
`geolng` decimal(10,6) NOT NULL,
`home_id` int(10) NULL
PRIMARY KEY  (`geolat`,`geolng`)
);

SELECT GL.home_id
FROM GeoLocations GL
INNER JOIN Homes H
 ON GL.home_id = H.home_id
WHERE GL.geolat between X and Y
 and GL.geolng between X and Y

#11


0  

Since MySQL 5.7 mysql can use geoindex like ST_Distance_Sphere() and ST_Contains() wich improve performances.

因为MySQL 5.7可以使用ST_Distance_Sphere()和ST_Contains()这样的地理索引来提高性能。

#1


12  

There is a good paper on MySQL geolocation performance here.

这里有一篇关于MySQL地理定位性能的好文章。

EDIT Pretty sure this is using fixed radius. Also I am not 100% certain the algorithm for calculating distance is the most advanced (i.e. it'll "drill" through Earth).

编辑确定这是使用固定的半径。我也不是100%的确定计算距离的算法是最先进的(也就是说,它会在地球上“钻”)。

What's significant is that the algorithm is cheap to give you a ball park limit on the number of rows to do proper distance search.

重要的是,该算法的代价很低,可以给你一个限制行数来进行适当的距离搜索。

#2


4  

I had the same problem, and wrote a 3 part blogpost. This was faster than the geo index.

我也有同样的问题,写了一篇三部分的博文。这比geo指数要快。

Intro, Benchmark, SQL

介绍、基准、SQL

#3


2  

If you really need to go for performance you can define bounding boxes for your data and map the pre-compute bounding boxes to your objects on insertion and use them later for queries.

如果您确实需要提高性能,您可以为您的数据定义边界框,并在插入时将预计算边界框映射到您的对象,并在以后的查询中使用它们。

If the resultsets are reasonably small you could still do accuracy corrections in the application logic (easier to scale horizontal than a database) while enabling to serve accurate results.

如果结果集相当小,您仍然可以在应用程序逻辑(比数据库更容易横向伸缩)中进行精确校正,同时提供准确的结果。

Take a look at Bret Slatkin's geobox.py which contains great documentation for the approach.

看看Bret Slatkin的geobox。其中包含了关于该方法的大量文档。

I would still recommend checking out PostgreSQL and PostGIS in comparison to MySQL if you intend to do more complex queries in the foreseeable future.

如果您打算在可预见的将来进行更复杂的查询,我仍然建议您使用PostgreSQL和PostGIS与MySQL进行比较。

#4


1  

The indices you are using are indeed B-tree indices and support the BETWEEN keyword in your query. This means that the optimizer is able to use your indices to find the homes within your "box". It does however not mean that it will always use the indices. If you specify a range that contains too many "hits" the indices will not be used.

您正在使用的索引实际上是B-tree索引,并且支持查询中的BETWEEN关键字。这意味着优化器能够使用您的索引来找到您的“框”中的家庭。然而,这并不意味着它将永远使用索引。如果指定包含太多“命中”的范围,则不使用索引。

#5


1  

Here's a trick I've used with some success is to create round-off regions. That is to say, if you have a location that's at 36.12345,-120.54321, and you want to group it with other locations which are within a half-mile (approximate) grid box, you can call its region 36.12x-120.54, and all other locations with the same round-off region will fall in the same box.

这里有一个技巧,我曾经成功地使用过,那就是创建围捕区域。也就是说,如果你有一个位置在36.12345,-120.54321,你想把它和其他在半英里(近似)网格框内的位置进行分组,你可以把它的区域命名为36.12x-120.54,而所有其他具有相同的环绕区域的位置都会落在同一个框中。

Obviously, that doesn't get you a clean radius, i.e. if the location you're looking at is closer to one edge than another. However, with this sort of a set-up, it's easy enough to calculate the eight boxes that surround your main location's box. To wit:

显然,这并不能得到一个清晰的半径,也就是说,如果你观察的位置比另一条更靠近一条边。然而,有了这样的设置,就很容易计算出围绕主位置框的8个框。即:

[36.13x-120.55][36.13x-120.54][36.13x-120.53]
[36.12x-120.55][36.12x-120.54][36.12x-120.53]
[36.11x-120.55][36.11x-120.54][36.11x-120.53]

Pull all the locations with matching round-off labels and then, once you've got them out of the database, you can do your distance calculations to determine which ones to use.

使用匹配的舍入标签提取所有位置,然后,一旦将它们从数据库中取出,就可以进行距离计算,以确定使用哪个位置。

#6


0  

This looks pretty fast. My only concern would be that it would use an index to get all the values within 3 miles of the latitude, then filter those for values within 3 miles of the longitude. If I understand how the underlying system works, you can only use one INDEX per table, so either the index on lat or long is worthless.

这看起来非常快。我唯一关心的是,它将使用一个索引获取离纬度3英里内的所有值,然后将这些值过滤到离经度3英里内的值。如果我理解底层系统是如何工作的,那么您只能对每个表使用一个索引,因此lat或long上的索引都没有价值。

If you had a large amount of data, it might speed things up to give every 1x1 mile square a unique logical ID, and then make an additional restriction on the SELECT that (area="23234/34234" OR area="23235/34234" OR ... ) for all the squares around your point, then force the database to use that index rather than the lat and long. Then you'll only be filtering much less square miles of data.

如果你有大量的数据,它可能会加快速度,给每1x1平方英里一个独特的逻辑ID,然后在选择上再加一个限制(面积="23234/34234"或面积=" 232235 /34234"或…)对于点周围的所有方块,然后强制数据库使用该索引,而不是lat和long。那么你只会过滤更少平方英里的数据。

#7


0  

Homes? You probably won't even have ten thousand of them. Just use an in-memory index like STRTree.

房屋吗?你可能连一万人都没有。只需使用内存中的索引,如STRTree。

#8


0  

Sticking with your current approach there is one change you should make, Rather than indexing geolat and geolong separately you should have a composite index:

按照你目前的方法,你应该做一个改变,而不是把geolat和geolong分开,你应该有一个复合指数:

KEY `geolat_geolng` (`geolat`, `geolng`),

Currently your query will only be taking advantage of one of the two indexes.

目前,您的查询只能利用这两个索引中的一个。

#9


0  

A very good alternative is MongoDB with its Geospatial Indexing.

MongoDB是一个很好的选择,它的地理空间索引。

#10


0  

You might consider creating a separate table 'GeoLocations' that has a primary key of ('geolat','geolng') and has a column that holds the home_id if that particular geolocation happens to have a home. This should allow the optimizer to search for a range of geo locations that will be sorted on disk for a list of home_ids. You could then perform a join with your 'homes' table to find information about those home_ids.

您可以考虑创建一个单独的表“地理定位”,它具有一个主键('geolat','geolng'),如果这个特定的地理定位碰巧有一个home,那么它还有一个包含home_id的列。这应该允许优化器搜索一系列地理位置,这些位置将在磁盘上为home_ids列表排序。然后,您可以执行与“home”表的连接,以查找关于这些home_ids的信息。

CREATE TABLE IF NOT EXISTS `GeoLocations` (
`geolat` decimal(10,6) NOT NULL,
`geolng` decimal(10,6) NOT NULL,
`home_id` int(10) NULL
PRIMARY KEY  (`geolat`,`geolng`)
);

SELECT GL.home_id
FROM GeoLocations GL
INNER JOIN Homes H
 ON GL.home_id = H.home_id
WHERE GL.geolat between X and Y
 and GL.geolng between X and Y

#11


0  

Since MySQL 5.7 mysql can use geoindex like ST_Distance_Sphere() and ST_Contains() wich improve performances.

因为MySQL 5.7可以使用ST_Distance_Sphere()和ST_Contains()这样的地理索引来提高性能。