![]() ![]() We also need to know the frequency to get the instructions executed per-cycle. We perform the mixed test to identify such cases.Īll the tests mentioned above measure the amount of time taken for a particular number of instructions and thus we get the instructions executed per-second. Thus, if we only test the additions and multiplications separately, we will not see the peak throughput on such a machine. For example, one floating point unit may only support addition while another may only support multiplication. Some CPU cores (such as AMD's K10 core) have two floating point units but the two floating point units may not be identical. You may be wondering the reasoning behind this mixed test. There were no dependencies between the additions and following multiplies. the program consisted of an addition followed by a multiply, followed by another add, then another multiply and so on. ![]() I tested the performance of 128-bit floating point NEON instructions for addition, multiplication and multiply-accumulate.Īpart from testing throughput of individual instructions, I also wrote a test for testing throughput of a program consisting of two types of instructions: scalar addition and scalar multiplication instructions. All the tested ARM processors also support the NEON instruction set, which is a SIMD (single instruction multiple data) instruction set for ARM for integer and floating point operations. I tested the performance of scalar addition, multiplication and multiply-accumulate for 32-bit and 64-bit floating point datatypes. There were minimal dependencies in the loop body. There were no memory instructions inside the loop and thus memory performance was not an issue. The tests were written in C++ with gcc NEON intrisincs where required, and I always checked the assembler to verify that the generated assembly was as expected. The loop body consisted of many (say 20) floating point instructions with no data dependence between them. I wrote a simple benchmark consisting of a loop with a large number of iterations. I wanted to test the instruction throughput of various floating point instructions. We will look at 5 CPU cores today: the ARM Cortex A9, ARM Cortex A15, Qualcomm Scorpion, Qualcomm Krait 200 and Qualcomm Krait 300. For this article I'm focusing exclusively on floating point performance. In this spirit, I wrote a few synthetic tests to better understand the performance of current-gen ARM CPU cores without having to rely upon vendor supplied information. We've done quite a bit of low-level mobile CPU analysis at AnandTech in pursuit of understanding architectures where there is no publicly available documentation. This situation frustrates me to no end personally. Often times all that's available are marketing slides with fuzzy performance claims. However, unlike desktop and server CPUs, mobile CPU and GPU vendors tend to do very little architectural disclosure - a fact that we've been working hard to change over the past few years. As a programmer who wants to write decent performing code, I am very interested in understanding the architectures of CPUs and GPUs. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |