我如何对相对性能进行单元测试?

时间:2023-01-30 16:27:28

Given that I don't know at deployment time what kinds of system my code will be running on, how do I write a Performance benchmark that uses the potential of a system as its yardstick.

鉴于我在部署时不知道我的代码将运行什么类型的系统,我如何编写一个使用系统潜力作为衡量标准的性能基准测试。

What I mean is that if a system is capable of running the piece of code 1000 times per second, I'd like the test to ensure that is comes under as close to 1000 as possible. If it can only do 500, then that's the rate I'd like to compare it against.

我的意思是,如果一个系统能够每秒运行1000次代码,我希望测试确保尽可能接近1000。如果它只能做500,那就是我想比较它的速度。

If it helps in making the answer more specific, I'm using JUnit4.

如果它有助于使答案更具体,我正在使用JUnit4。

Thank you.

4 个解决方案

#1


A test means you have a pass/fail threshold. For a performance test, this means too slow and you fail, fast enough and you pass. If you fail, you start doing rework.

测试意味着您有通过/未通过阈值。对于性能测试,这意味着太慢,你失败,足够快,你通过。如果你失败了,你就开始做返工了。

If you can't fail, then you're benchmarking, not actually testing.

如果你不能失败,那么你就是基准测试,而不是实际的测试。

When you talk about "system is capable of running" you have to define "capable". You could use any of a large number of hardware performance benchmarks. Whetstone, Dhrystone, etc., are popular. Or, perhaps you have a database-intensive application, then you might want to look at the TPC benchmark. Or, perhaps you have a network-intensive application and want to use netperf. Or a GUI-intensive application and want to use some kind of graphics benchmark.

当你谈到“系统能够运行”时,你必须定义“有能力”。您可以使用任何大量的硬件性能基准测试。油石,Dhrystone等很受欢迎。或者,您可能拥有数据库密集型应用程序,那么您可能需要查看TPC基准测试。或者,您可能拥有网络密集型应用程序并希望使用netperf。或者是GUI密集型应用程序,并希望使用某种图形基准测试。

Any of these give you some kind of "capability" measurement. Pick one or more. They're all good. Equally debatable. Equally biased toward your competitor and away from you.

任何这些都可以为您提供某种“能力”测量。选择一个或多个。他们都很好。同样值得商榷。同样偏向于您的竞争对手而远离您。

Once you've run the benchmark, you can then run your software and see what the system actually does.

运行基准测试后,您可以运行软件并查看系统实际执行的操作。

You could -- if you gather enough data -- establish some correlation between some benchmark numbers and your performance numbers. You'll see all kinds of variation based on workload, hardware configuration, OS version, virtual machine, DB server, etc.

您可以 - 如果您收集到足够的数据 - 在某些基准数字和您的性能数字之间建立一些相关性。您将看到基于工作负载,硬件配置,操作系统版本,虚拟机,数据库服务器等的各种变化。

With enough data from enough boxes with enough different configurations, you will eventually be able to develop a performance model that says "given this hardware, software, tuning parameters and configuration, I expect my software to do [X] transactions per second." That's a solid definition of "capable".

如果有足够数据来自足够不同配置的盒子,您最终将能够开发出一种性能模型,该模型说“考虑到这些硬件,软件,调整参数和配置,我希望我的软件能够每秒进行[X]次事务处理。”这是“有能力”的坚实定义。

Once you have that model, you can then compare your software against the capability number. Until you have a very complete model, you don't really know which systems are even capable of running the piece of code 1000 times per second.

拥有该模型后,您可以将软件与功能编号进行比较。在你拥有一个非常完整的模型之前,你并不知道哪个系统甚至能够每秒运行1000次代码。

#2


I would not use unit testing for performance tests for a couple of reasons.

出于几个原因,我不会将单元测试用于性能测试。

First, unit tests should not have dependencies to the surrounding system/code. Performance tests depend heavily on the hardware/OS, so it is hard to get uniform measures that will be usable on both developer workstations, build server etc.

首先,单元测试不应该与周围的系统/代码有依赖关系。性能测试在很大程度上取决于硬件/操作系统,因此很难获得可在开发人员工作站,构建服务器等上使用的统一措施。

Second, unit tests should execute really fast. When you do performance tests, you usually want to have quite large data sets and repeat the number of runs a couple of times in order average numbers/get rid of overhead and so forth. These all work against the idea of fast tests.

其次,单元测试应该执行得非常快。当你进行性能测试时,你通常希望拥有相当大的数据集并重复运行次数,以便平均数量/消除开销等等。这些都违背了快速测试的想法。

#3


I agree with Brian when he says that unit tests is not the appropriate way to do performance testing. However I put together a short example that could be used as an integration test to run on different system configurations/environments.
Note that is just to give an idea of what could be done in this regard, and does not provide results that are precise enough to back up any official statement about the performance of a system.

我同意Brian的观点,他说单元测试不是进行性能测试的合适方式。但是,我将一个简短的示例放在一起,可用作在不同系统配置/环境上运行的集成测试。请注意,这只是为了说明在这方面可以做些什么,并且没有提供足够精确的结果来支持关于系统性能的任何官方声明。

import static org.junit.Assert.*;
import org.junit.Test;

package com.*.samples.tests {

    @Test
    public void doStuffRuns500TimesPerSecond() {
        long maximumRunningTime = 1000;
        long currentRunningTime = 0;
        int iterations = 0;

        do {
            long startTime = System.getTimeMillis();

            // do stuff

            currentRunningTime += System.getTimeMillis() - startTime;
            iterations++;
        }
        while (currentRunningTime <= maximumRunningTime);

        assertEquals(500, iterations);
    }
}

#4


I do some time measurements on tests for code that is destined for a real time system where a correct answer that took too long to calculate is a failure.

我做了一些时间测量,用于测试代码,这些代码用于实时系统,其中需要花费太长时间来计算的正确答案是失败的。

All I do is plot the delta cpu time that the test took over the recent builds. Note, CPU time not real time. The actual value doesn't matter too much - what matters is how much it changed.

我所做的只是绘制测试占用最近构建的delta cpu时间。注意,CPU时间不是实时的。实际价值并不重要 - 重要的是它改变了多少。

If I commit a change to an algorithm which significantly changed the run time for the test I can easily zoom in to the specific changeset that caused it. What I really care about are these points of interest - not necessarily the absolute values. There are quite often many tradeoffs in a realtime system and these can't always be represented to the test framework as a simple compare.

如果我对一个显着改变测试运行时间的算法进行了更改,我可以轻松地放大导致它的特定变更集。我真正关心的是这些兴趣点 - 不一定是绝对值。在实时系统中经常存在许多权衡,并且这些权衡不能总是作为简单的比较表示给测试框架。

Looking at absolute times and normalizing them at first seems reasonable but in reality the conversion between your system and the target system will be non-linear - for instance cache pressure, swap usage, disk speed on the target system etc may cause the time for the test to explode at a different threshold as your system.

查看绝对时间并首先将它们标准化似乎是合理的,但实际上系统和目标系统之间的转换将是非线性的 - 例如缓存压力,交换使用率,目标系统上的磁盘速度等可能会导致时间测试以不同的阈值爆炸作为您的系统。

If you absolutely need a test that is accurate in this regard, duplicate the target system and use it as a test slave but in a similiar environment as you expect it to be in.

如果您绝对需要在这方面准确的测试,请复制目标系统并将其用作测试从站,但是在您期望的类似环境中。

In my case it may be actually downloading firmware to a DSP, remotely powercycling it, reading the response from a serial port or seeing no response because it crashed!

在我的情况下,它可能实际上是下载固件到DSP,远程电源循环,从串口读取响应或看不到响应,因为它崩溃了!

--jeffk++

#1


A test means you have a pass/fail threshold. For a performance test, this means too slow and you fail, fast enough and you pass. If you fail, you start doing rework.

测试意味着您有通过/未通过阈值。对于性能测试,这意味着太慢,你失败,足够快,你通过。如果你失败了,你就开始做返工了。

If you can't fail, then you're benchmarking, not actually testing.

如果你不能失败,那么你就是基准测试,而不是实际的测试。

When you talk about "system is capable of running" you have to define "capable". You could use any of a large number of hardware performance benchmarks. Whetstone, Dhrystone, etc., are popular. Or, perhaps you have a database-intensive application, then you might want to look at the TPC benchmark. Or, perhaps you have a network-intensive application and want to use netperf. Or a GUI-intensive application and want to use some kind of graphics benchmark.

当你谈到“系统能够运行”时,你必须定义“有能力”。您可以使用任何大量的硬件性能基准测试。油石,Dhrystone等很受欢迎。或者,您可能拥有数据库密集型应用程序,那么您可能需要查看TPC基准测试。或者,您可能拥有网络密集型应用程序并希望使用netperf。或者是GUI密集型应用程序,并希望使用某种图形基准测试。

Any of these give you some kind of "capability" measurement. Pick one or more. They're all good. Equally debatable. Equally biased toward your competitor and away from you.

任何这些都可以为您提供某种“能力”测量。选择一个或多个。他们都很好。同样值得商榷。同样偏向于您的竞争对手而远离您。

Once you've run the benchmark, you can then run your software and see what the system actually does.

运行基准测试后,您可以运行软件并查看系统实际执行的操作。

You could -- if you gather enough data -- establish some correlation between some benchmark numbers and your performance numbers. You'll see all kinds of variation based on workload, hardware configuration, OS version, virtual machine, DB server, etc.

您可以 - 如果您收集到足够的数据 - 在某些基准数字和您的性能数字之间建立一些相关性。您将看到基于工作负载,硬件配置,操作系统版本,虚拟机,数据库服务器等的各种变化。

With enough data from enough boxes with enough different configurations, you will eventually be able to develop a performance model that says "given this hardware, software, tuning parameters and configuration, I expect my software to do [X] transactions per second." That's a solid definition of "capable".

如果有足够数据来自足够不同配置的盒子,您最终将能够开发出一种性能模型,该模型说“考虑到这些硬件,软件,调整参数和配置,我希望我的软件能够每秒进行[X]次事务处理。”这是“有能力”的坚实定义。

Once you have that model, you can then compare your software against the capability number. Until you have a very complete model, you don't really know which systems are even capable of running the piece of code 1000 times per second.

拥有该模型后,您可以将软件与功能编号进行比较。在你拥有一个非常完整的模型之前,你并不知道哪个系统甚至能够每秒运行1000次代码。

#2


I would not use unit testing for performance tests for a couple of reasons.

出于几个原因,我不会将单元测试用于性能测试。

First, unit tests should not have dependencies to the surrounding system/code. Performance tests depend heavily on the hardware/OS, so it is hard to get uniform measures that will be usable on both developer workstations, build server etc.

首先,单元测试不应该与周围的系统/代码有依赖关系。性能测试在很大程度上取决于硬件/操作系统,因此很难获得可在开发人员工作站,构建服务器等上使用的统一措施。

Second, unit tests should execute really fast. When you do performance tests, you usually want to have quite large data sets and repeat the number of runs a couple of times in order average numbers/get rid of overhead and so forth. These all work against the idea of fast tests.

其次,单元测试应该执行得非常快。当你进行性能测试时,你通常希望拥有相当大的数据集并重复运行次数,以便平均数量/消除开销等等。这些都违背了快速测试的想法。

#3


I agree with Brian when he says that unit tests is not the appropriate way to do performance testing. However I put together a short example that could be used as an integration test to run on different system configurations/environments.
Note that is just to give an idea of what could be done in this regard, and does not provide results that are precise enough to back up any official statement about the performance of a system.

我同意Brian的观点,他说单元测试不是进行性能测试的合适方式。但是,我将一个简短的示例放在一起,可用作在不同系统配置/环境上运行的集成测试。请注意,这只是为了说明在这方面可以做些什么,并且没有提供足够精确的结果来支持关于系统性能的任何官方声明。

import static org.junit.Assert.*;
import org.junit.Test;

package com.*.samples.tests {

    @Test
    public void doStuffRuns500TimesPerSecond() {
        long maximumRunningTime = 1000;
        long currentRunningTime = 0;
        int iterations = 0;

        do {
            long startTime = System.getTimeMillis();

            // do stuff

            currentRunningTime += System.getTimeMillis() - startTime;
            iterations++;
        }
        while (currentRunningTime <= maximumRunningTime);

        assertEquals(500, iterations);
    }
}

#4


I do some time measurements on tests for code that is destined for a real time system where a correct answer that took too long to calculate is a failure.

我做了一些时间测量,用于测试代码,这些代码用于实时系统,其中需要花费太长时间来计算的正确答案是失败的。

All I do is plot the delta cpu time that the test took over the recent builds. Note, CPU time not real time. The actual value doesn't matter too much - what matters is how much it changed.

我所做的只是绘制测试占用最近构建的delta cpu时间。注意,CPU时间不是实时的。实际价值并不重要 - 重要的是它改变了多少。

If I commit a change to an algorithm which significantly changed the run time for the test I can easily zoom in to the specific changeset that caused it. What I really care about are these points of interest - not necessarily the absolute values. There are quite often many tradeoffs in a realtime system and these can't always be represented to the test framework as a simple compare.

如果我对一个显着改变测试运行时间的算法进行了更改,我可以轻松地放大导致它的特定变更集。我真正关心的是这些兴趣点 - 不一定是绝对值。在实时系统中经常存在许多权衡,并且这些权衡不能总是作为简单的比较表示给测试框架。

Looking at absolute times and normalizing them at first seems reasonable but in reality the conversion between your system and the target system will be non-linear - for instance cache pressure, swap usage, disk speed on the target system etc may cause the time for the test to explode at a different threshold as your system.

查看绝对时间并首先将它们标准化似乎是合理的,但实际上系统和目标系统之间的转换将是非线性的 - 例如缓存压力,交换使用率,目标系统上的磁盘速度等可能会导致时间测试以不同的阈值爆炸作为您的系统。

If you absolutely need a test that is accurate in this regard, duplicate the target system and use it as a test slave but in a similiar environment as you expect it to be in.

如果您绝对需要在这方面准确的测试,请复制目标系统并将其用作测试从站,但是在您期望的类似环境中。

In my case it may be actually downloading firmware to a DSP, remotely powercycling it, reading the response from a serial port or seeing no response because it crashed!

在我的情况下,它可能实际上是下载固件到DSP,远程电源循环,从串口读取响应或看不到响应,因为它崩溃了!

--jeffk++