如何有效地利用数据来计算两个坐标之间的距离。表:=

时间:2023-01-01 15:23:39

I want to find the most efficient (fastest) method to calculate the distances between pairs of lat long coordinates.

我想找到最高效(最快)的方法来计算lat长坐标之间的距离。

A not so efficient solution has been presented (here) using sapply and spDistsN1{sp}. I believe this could be made much faster if one would use spDistsN1{sp} inside data.table with the := operator but I haven't been able to do that. Any suggestions?

使用sapply和spDistsN1{sp}来提出一个不那么有效的解决方案。我相信,如果使用spDistsN1{sp}内部数据,可以更快地实现这一点。表with:=操作符,但我还没能做到。有什么建议吗?

Here is a reproducible example:

这里有一个可复制的例子:

# load libraries
  library(data.table)
  library(dplyr)
  library(sp)
  library(rgeos)
  library(UScensus2000tract)

# load data and create an Origin-Destination matrix
  data("oregon.tract")

# get centroids as a data.frame
  centroids <- as.data.frame(gCentroid(oregon.tract,byid=TRUE))

# Convert row names into first column
  setDT(centroids, keep.rownames = TRUE)[]

# create Origin-destination matrix
  orig <- centroids[1:754, ]
  dest <- centroids[2:755, ]
  odmatrix <- bind_cols(orig,dest)
  colnames(odmatrix) <- c("origi_id", "long_orig", "lat_orig", "dest_id", "long_dest", "lat_dest")

My failed attempt using data.table

odmatrix[ , dist_km := spDistsN1(as.matrix(long_orig, lat_orig), as.matrix(long_dest, lat_dest), longlat=T)]

Here is a solution that works (but probably less efficiently)

odmatrix$dist_km <- sapply(1:nrow(odmatrix),function(i)
  spDistsN1(as.matrix(odmatrix[i,2:3]),as.matrix(odmatrix[i,5:6]),longlat=T))

head(odmatrix)

>   origi_id long_orig lat_orig  dest_id long_dest lat_dest dist_km
>      (chr)     (dbl)    (dbl)    (chr)     (dbl)    (dbl)   (dbl)
> 1 oregon_0   -123.51   45.982 oregon_1   -123.67   46.113 19.0909
> 2 oregon_1   -123.67   46.113 oregon_2   -123.95   46.179 22.1689
> 3 oregon_2   -123.95   46.179 oregon_3   -123.79   46.187 11.9014
> 4 oregon_3   -123.79   46.187 oregon_4   -123.83   46.181  3.2123
> 5 oregon_4   -123.83   46.181 oregon_5   -123.85   46.182  1.4054
> 6 oregon_5   -123.85   46.182 oregon_6   -123.18   46.066 53.0709

2 个解决方案

#1


3  

I wrote my own version of geosphere::distHaversine so that it would more naturally fit into a data.table := call, and it might be of use here

我编写了自己的geosphere版本::disthaversin,以便它更自然地适合于数据。表:=调用,它可能在这里使用。

dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
    radians <- pi/180
    lat_to <- lat_to * radians
    lat_from <- lat_from * radians
    lon_to <- lon_to * radians
    lon_from <- lon_from * radians
    dLat <- (lat_to - lat_from)
    dLon <- (lon_to - lon_from)
    a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
    return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}

Here are some benchmarks on how it performs against the original geosphere::distHaversine, and geosphere::distGeo

下面是一些关于它如何对原始地球层的表现的基准::disthaversin和geosphere::distGeo。

dt1 <- copy(odmatrix); dt2 <- copy(odmatrix); dt3 <- copy(odmatrix)

library(microbenchmark)

microbenchmark(

    dtHaversine = {
        dt1[, dist := dt.haversine(lat_orig, long_orig, lat_dest, long_dest)]
    }   ,

    haversine = {
        dt2[ , dist := distHaversine(matrix(c(long_orig, lat_orig), ncol = 2), 
                                     matrix(c(long_dest, lat_dest), ncol = 2))]
    },

    geo = {
        dt3[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2), 
                               matrix(c(long_dest, lat_dest), ncol = 2))]
    }
)

# Unit: microseconds
#         expr     min       lq     mean   median       uq      max neval
# dtHaversine 370.300 396.6210 434.5841 411.4305 463.9965  906.797   100
#   haversine 651.974 681.1745 776.6127 706.2760 731.3480 1505.765   100
#         geo 647.699 679.8285 743.4914 706.0465 742.1605 1272.310   100

Naturally, due to the way the distances are calculated in the two different techniques (geo & haversine), the results will differ slightly.

当然,由于两种不同技术(geo & haversine)计算距离的方式,结果会略有不同。

#2


6  

Thanks to @chinsoon12's comment I found a quite fast solution combining distGeo{geosphere} and data.table. In my laptop the fast solutions was than 120 times faster than the alternative.

感谢@chinsoon12的评论,我找到了一个非常快的解决方案,结合了distGeo{geosphere}和data.table。在我的笔记本电脑中,快速解决方案的速度是替代方案的120倍。

Let's make the data set larger to compare speed performance.

让我们使数据集更大以比较速度性能。

# Multiplicate data observations by 1000 
  odmatrix <- odmatrix[rep(seq_len(nrow(odmatrix)), 1000), ]

slow solution

system.time(
           odmatrix$dist_km <- sapply(1:nrow(odmatrix),function(i)
             spDistsN1(as.matrix(odmatrix[i,2:3]),as.matrix(odmatrix[i,5:6]),longlat=T)) 
            )

 >   user  system elapsed 
 >   222.17    0.08  222.84 

fast solution

# load library
  library(geosphere)

# convert the data.frame to a data.table
  setDT(odmatrix)

system.time(
            odmatrix[ , dist_km2 := distGeo(matrix(c(long_orig, lat_orig), ncol = 2), 
                                            matrix(c(long_dest, lat_dest), ncol = 2))/1000]
           )

>   user  system elapsed 
>   1.76    0.03    1.79 

#1


3  

I wrote my own version of geosphere::distHaversine so that it would more naturally fit into a data.table := call, and it might be of use here

我编写了自己的geosphere版本::disthaversin,以便它更自然地适合于数据。表:=调用,它可能在这里使用。

dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
    radians <- pi/180
    lat_to <- lat_to * radians
    lat_from <- lat_from * radians
    lon_to <- lon_to * radians
    lon_from <- lon_from * radians
    dLat <- (lat_to - lat_from)
    dLon <- (lon_to - lon_from)
    a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
    return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}

Here are some benchmarks on how it performs against the original geosphere::distHaversine, and geosphere::distGeo

下面是一些关于它如何对原始地球层的表现的基准::disthaversin和geosphere::distGeo。

dt1 <- copy(odmatrix); dt2 <- copy(odmatrix); dt3 <- copy(odmatrix)

library(microbenchmark)

microbenchmark(

    dtHaversine = {
        dt1[, dist := dt.haversine(lat_orig, long_orig, lat_dest, long_dest)]
    }   ,

    haversine = {
        dt2[ , dist := distHaversine(matrix(c(long_orig, lat_orig), ncol = 2), 
                                     matrix(c(long_dest, lat_dest), ncol = 2))]
    },

    geo = {
        dt3[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2), 
                               matrix(c(long_dest, lat_dest), ncol = 2))]
    }
)

# Unit: microseconds
#         expr     min       lq     mean   median       uq      max neval
# dtHaversine 370.300 396.6210 434.5841 411.4305 463.9965  906.797   100
#   haversine 651.974 681.1745 776.6127 706.2760 731.3480 1505.765   100
#         geo 647.699 679.8285 743.4914 706.0465 742.1605 1272.310   100

Naturally, due to the way the distances are calculated in the two different techniques (geo & haversine), the results will differ slightly.

当然,由于两种不同技术(geo & haversine)计算距离的方式,结果会略有不同。

#2


6  

Thanks to @chinsoon12's comment I found a quite fast solution combining distGeo{geosphere} and data.table. In my laptop the fast solutions was than 120 times faster than the alternative.

感谢@chinsoon12的评论,我找到了一个非常快的解决方案,结合了distGeo{geosphere}和data.table。在我的笔记本电脑中,快速解决方案的速度是替代方案的120倍。

Let's make the data set larger to compare speed performance.

让我们使数据集更大以比较速度性能。

# Multiplicate data observations by 1000 
  odmatrix <- odmatrix[rep(seq_len(nrow(odmatrix)), 1000), ]

slow solution

system.time(
           odmatrix$dist_km <- sapply(1:nrow(odmatrix),function(i)
             spDistsN1(as.matrix(odmatrix[i,2:3]),as.matrix(odmatrix[i,5:6]),longlat=T)) 
            )

 >   user  system elapsed 
 >   222.17    0.08  222.84 

fast solution

# load library
  library(geosphere)

# convert the data.frame to a data.table
  setDT(odmatrix)

system.time(
            odmatrix[ , dist_km2 := distGeo(matrix(c(long_orig, lat_orig), ncol = 2), 
                                            matrix(c(long_dest, lat_dest), ncol = 2))/1000]
           )

>   user  system elapsed 
>   1.76    0.03    1.79