函数用于比较dataframe中的列，提供关于现有差异的信息

I'm trying to write a function to compare the values of two colums (x and y) in every row of a dataframe. The function shall compare line by line if the values are identical, allowing a specified tolerance z for each pair of values. identical() doesn't help because it doesn't allow small differences. Nor can I use all.equal(), because its "tolerance"-parameter relates to the mean difference across all rows, how the following example demonstrates.

我正在编写一个函数来比较dataframe每一行中的两列(x和y)的值。如果值是相同的，则函数应该逐行进行比较，允许对每对值进行指定的容忍z。same()不起作用，因为它不允许有很小的差异。我也不能使用all.equal()，因为它的“容忍”参数与所有行之间的平均差异有关，如下面的示例演示了这一点。

> df <- data.frame("x"=c(1,2,3,4,5), "y"=c(2,7,3,4,5))
> df$diff_x_y <- df$x-df$y
> df
  x y diff_x_y
1 1 2       -1
2 2 7       -5
3 3 3        0
4 4 4        0
5 5 5        0
> all.equal(df$x, df$y, scale=1,tolerance=4)
[1] TRUE
>

So this is what I've made up so far:

这就是我到目前为止所做的:

fun <- function (x, y, z) 
{
  diff <- abs(x-y)                            # compute difference for each row
  tolerance <- ifelse(diff <= z, TRUE, FALSE) # test whether difference <= tolerance
  return(summary(tolerance))                  
}

This works fine for the example dataframe from above:

这适用于上面的dataframe示例:

> fun(df$x,df$y,1)
   Mode   FALSE    TRUE    NA's 
logical       1       4       0

Now I want the function to give me some information about the existing differences. I image something like this:

现在我想让这个函数给我一些关于现有差异的信息。我想象的是这样的:

difference  frequency
1:10        4
11:100      30
101:1000    350

"difference" is supposed to define an adjustable values range of the differences and "frequency" shall display the number of rows with the corresponding difference. Other suggestions for the way of returning more detailed information about the differences are welcome. Notice that my original dataframe contains about 2 mio. rows, of whom some may differ significantly.

“差”应定义差值的可调取值范围，“频”应显示具有相应差值的行数。对于返回关于差异的更详细信息的方法，欢迎提出其他建议。注意，我最初的dataframe包含大约2 mio。行，其中一些可能有很大的不同。

1 个解决方案

#1

simplest way imho is to use cut:

最简单的方法是使用切割:

df$diff.cat  <- cut(abs(df$x-df$y),breaks=c(0,1,10,100,1000),right = FALSE)

the right = FALSE switch is making the intervals include the left (small) margin -

右=假开关使间隔包括左(小)边距

0 <= first interval < 1
1 <= second interval < 10 etc.

0 <=第一个区间< 1 <=第二个区间< 10等。

you can adjust the intervals of course you can see the frequencies with

你可以调整间隔当然你可以看到频率

table(df$diff.cat)

so basically for:

所以基本上为:

df <- data.frame("x"=c(1,2,3,4,5), "y"=c(2,7,3,4,5))

table(cut(abs(df$x-df$y),breaks=c(0,1,10,100,1000),right = FALSE))

will give:

将:

  [0,1)      [1,10)    [10,100) [100,1e+03) 
      3           2           0           0

#1