Xu Liu's Blog DrCCTProf: A Fine-grained Profiler for ARM Eco-system

An open-source project available via GitHub

SC'20 Best Paper Finalist

Introduction

ARM architectures, especially AArch64, are widely used in a variety of computing domains, ranging from SoCs (e.g., smart phones, IoT devices) to PC (e.g., tablets, Macbooks) and even to supercomputers (e.g., Tianhe-3 and Fugaku). Unfortunately, ARM is not a drop-in replacement for legacy software systems. The state of the ARM ecosystem -- compiler, operating system, networking, I/O, among others -- has been a source of concern until recently, which is changing now. Tool infrastructures are needed to help diagnose performance bottlenecks, pinpoint correctness bugs, and identify security vulnerabilities.

There are a number of tools available in ARM eco-systems, which mainly fall into two categories: coarse-grained tools and fine-grained tools. Coarse-grained tools such as MAP, Perf, TAU, Scalasca, and HPCToolkit work on ARM and they sample hardware performance counters or instrument code at source-level or at compiler time to monitor an execution with relatively low overhead. In contrast, fine-grained tools such as Valgrind, DynamoRIO, and Dyninst can monitor every instruction instance via extensive binary instrumentation. These two kinds of tools complement each other and there is place for both depending on the analysis under question.

The coarse-grained tools are more suitable to identify execution hotspots and parallel inefficiencies to name a few; fine-grained tools are valuable to glean deeper insights into execution, for example, reuse-distance analysis, cache simulation, redundancy detection, power and energy analysis, to name a few. Furthermore, fine-grained analysis tools are also necessary for correctness tools such as data-race detection and record and replay. Unfortunately, the tools in the software stack of ARM eco-system are insufficient.

In this work, we develop DrCCTProf, a fine-grained analyer for ARM architectures. Unlike existing fine-grained tools, DrCCTProf has the following features:

Implementation

Overview of DrCCTProf
Figure 1: Overview of DrCCTProf.

DrCCTProf uses DynamoRIO (maintained by Google) as its underlying binary instrumentation engine. As a framework, DynamoRIO exposes an interface for efficient and comprehensive binary manipulation at runtime. DynamoRIO supports multiple instruction sets, such as x86 and AArch64, and multiple operating systems e.g., Windows, Linux, and Android. Furthermore, DynamoRIO supports inserting inlined assembly code into binary to avoid costly instrumentation of function calls.
Addressing ARM ISA Specifics
ARM supports ldrex and strex instructions to load value from and store value to register exclusively. Compilers generate these two instructions as a pair for atomic or lock implementations. There is a constraint for these two instructions: instrumenting memory instructions between these two instructions results in a deadlock. To address this issue, DrCCTProf only allows the instrumentation before ldrex and after strex. No memory access in between is monitored.

Optimizing with DynamoRIO
DynamoRIO supports two types of instrumentations: clean calls and inlined instructions. Clean calls support sophisticated user-defined analysis but incur high overhead due to the function invocations; inlined instructions are subtle to add but reduce the profiling overhead. DrCCTProf uses the inlined instructions to pass the context and data handles to any client tool that queries them. As querying these handles are used extensively in client tools, DrCCTProf further reduces the overhead.

Handling Parallelism
DrCCTProf supports parallelism with both processes and threads. Each thread and process builds and manages its CCT concurrently. At the end of execution, each thread/process writes its analysis results including the CCT and metrics into its own file. BNode allocation is also lock free. Thus, DrCCTProf can scale to highly parallel executions.

Offline Analysis and Visualization
As a fine-grained tool, DrCCTProf produces a large amount of profiling data, which are typically orders of magnitudes more than coarse-grained tools. To facilitate the analysis, DrCCTProf employs an offline analysis component to (1) obtain additional program structures to the CCT, such as loops and inlined functions, (2) aggregate the profiles from different threads and processes for a compact view, and (3) attribute binary instructions to source codes.

Using DrCCTProf

DrCCTProf exposes a minimum set of APIs to client tools. The client tools, typically, have the following workflow.
  1. Initialize DrCCTProf and register the custom instrumentation functions to monitor all or a subset of instructions.
  2. Query the calling contexts and/or data objects for instructions of their choice.
  3. Correlate contexts involved in the analyzed problem (e.g., memory pairs for data use and reuse) and accumulate metrics related to them. Typically, a map is used, where the key is a 64-bit entity formed out of two 32-bit context handles and the value is any metric.
  4. Output the map as a tailored CCT with metrics for DrCCTProf's offline analysis and visualization.

Overhead Analysis (overhead is subject to change on different ARM platforms)

We evaluate DrCCTProf on a high-end ARM cluster. We use four nodes, total 128 cores, to evaluate the overhead on the NPB benchmark suite v3.3.1 with workload size C. NPB (aka NAS Parallel Benchmarks) is developed by NASA to evaluate the performance of parallel supercomputers, which has both OpenMP and MPI versions. We evaluate both versions compiled with gcc 7.5.0 -O3 and MPICH 3.3a2. While we evaluate all the NPB benchmarks, we show the common benchmarks in OpenMP and MPI versions. We develop two simple client tools of DrCCTProf -- one for collecting calling contexts for all instructions and the other for collecting all data objects for all memory accesses -- for the evaluation.

Overhead measurement
Table 1: Overhead of DrCCTProf.

The runtime overhead is computed as the execution time with DrCCTProf enabled divided by the native execution time. The memory overhead is computed as the peak memory consumption during the execution with DrCCTProf enabled divided by the peak memory consumption during the native execution; for MPI programs, we sum up peak memory consumption across all MPI processes. Table 1 shows the runtime and memory overhead of DrCCTProf.

For code-centric analysis, to collect call path on each instruction, DrCCTProf incurs a median runtime overhead of 1.56× and 3.89× for OpenMP and MPI benchmarks, respectively. It incurs a median memory overhead of 1.06× and 1.45× for OpenMP and MPI benchmarks, respectively. One outlier is EP, where DrCCTProf incurs 28× runtime overhead. The reason is that loop nests in EP has many distinct basic blocks, which requires more efforts to build the CCT.

For data-centric analysis, to associate each memory access to its constituent object, DrCCTProf incurs a median runtime overhead of 1.89× and 4.13× respectively for OpenMP and MPI benchmarks. The median memory overhead is 9.04× and 3.09× respectively for OpenMP and MPI benchmarks. The high memory overhead is due to the usage of shadow memory for the data-centric attribution.

A Use Case of DrCCTProf

We implement a client tool atop DrCCTProf to collect reuse distances to guide data locality optimization. With the support of DrCCTProf, the client obtains the full calling contexts for both use and reuse instructions, as well as the data objects associated with the memory reuses. The client presents all reuse pairs with both code- and data-centric views in the GUI for further investigation. As a unique capability, the client can show an aggregate view of data objects that account for most reuses. The client uses ∼500 lines of code to enable the powerful reuse distance analysis on ARM architectures.

A client tool of DrCCTProf that measures reuse distance for an OpenMP application
Figure 2: A client tool of DrCCTProf that measures reuse distance for an OpenMP application.

Figure 2 presents the temporal reuse profiles in the GUI, which consists of three panes. The top pane shows the source code. The bottom left pane shows the reuse pairs with both code- and data-centric attributions; clicking the numbers on the left of the function names can show the corresponding source code at those line numbers in the top pane. The bottom right pane shows the reuse distance metrics -- the occurrences of temporal reuses in this case study. We annotate the snapshot of the GUI to facilitate our discussion.

One hot data object highlighted in the figure accounts for 5.8% of total temporal reuses. Its allocation site Allocate at line 1505 (which is a wrapper of malloc) in the call path indicates the involved array is z8n as shown in the top pane. We investigate one of the top reuse pairs on z8n, which accounts for 3.6% of total temporal reuses. As shown in the full calling context, the use is the memory access at line 1527 in the loop at line 1509 and the reuse is the memory access at line 1314 in the loop at line 1284 inside the function CalcFBHourglassForceForElems. The reuse distance is as high as 4×107 (not shown in the figure), indicating poor temporal locality. The client tool also reports similar reuses for arrays x8n and y8n allocated at line 1503 and 1504 as shown in the source code pane. To enhance the temporal locality, we hoist the two loops (line 1508 and 1283) into their least common ancestor in the call paths (i.e., function CalcHourglassControlForElems) and fuse them. The optimization is safe: the two loops share the same loop bounds and stride and do not change the dependence direction after the fusion. This optimization yields a 1.28× speedup for the entire application.

Conclusions

This blog presents DrCCTProf, a framework for fine-grained call path collection for binaries running on ARM-based clusters. DrCCTProf maintains all call paths in an execution in a compact and efficient calling context tree representation. The design is highly engineered to constraint both time and memory overheads and make it suitable for fine-grained analysis. Moreover, DrCCTProf has an option to associate each memory access instruction with the data object it touches. DrCCTProf is able to monitor large-scale multi-threaded and multi-process executions and its overhead does not grow with parallel scaling. DrCCTProf exposes simple interfaces to allow client tools to incorporate code- and data-centric attribution. DrCCTProf outputs results in a compact profile aggregated from different threads and processes and employs a format consumable by a user-friendly GUI for hierarchical analysis. With an example client tool and a nontrivial optimization case under its guidance, we show that DrCCTProf can support powerful fine-grained analysis with minimum coding efforts.