Code migration with formal verification for performance improvement of legacy code

Published on

August 15, 2024

Share:

Domain-specific languages (DSL) have been a common alternative to general-purpose programming languages to implement domain-specific optimizations and simplify programming of domain-specific hardware. For instance, tremendous breakthroughs in AI applications seen in the past few years have been partly because of and have also led to developments in AI-specific hardware accelerators and programming frameworks such as PyTorch, TensorFlow, among others.

Although DSLs have simplified AI application development, they have two inherent problems: (i) programming using DSLs has been a steep learning curve for the programmers [1], who are otherwise well-versed in general purpose programming languages such as C++, Python, etc., and (ii) Domain-specific hardware, which can be programmed via DSLs, does not support code written in general-purpose programming languages, and thus leads to creation of “legacy code”, which does not benefit from the latest advancements in domain-specific hardware (e.g., AI accelerators). Moreover, legacy code leads to maintenance burden, not to mention the associated security risks; moreover, it also eliminates any opportunities to leverage the latest advancements in the domain-specific hardware.

... legacy code leads to maintenance burden, not to mention the associated security risks; moreover, it also eliminates any opportunities to leverage the latest advancements in the domain-specific hardware.

Porting legacy code to newer, popular DSLs has thus been an important business as well as research problem. The problem of transforming and compiling code from one language to another is called code transpilation. Given the fact that manual code transpilation is effort-intensive, cumbersome, error-prone, and expensive, several research efforts have been devoted to developing automated code transpilation approaches. Unfortunately though, many of these approaches suffer from either being DSL-specific (not extendible) [2, 3], or do not verify the generated code. Indeed, one of the key characteristics of a code transpilation tool would be whether it ensures that the translated code is correct. We, at CodeMetal, have been building systems to address the need of automated code transpilation and that too for programming languages for different edge hardware. More importantly, we are developing fully-automated systems that translate code from one language to another, while also ensuring the correctness and the performance of the output code for various target hardware.

We, at CodeMetal, have been building systems to address the need of automated code transpilation and that too for programming languages for different edge hardware. More importantly, we are developing fully-automated systems that translate code from one language to another, while also ensuring the correctness and the performance of the output code for various target hardware.

In this article, we cover one of our prior research works that we developed in collaboration with Prof. Alvin Cheung and his research group from University of California at Berkeley.

Verified translation of C++ programs to AI-specific DSLs

Rapid pace of innovation in AI has led to creation of AI specific hardware accelerators as well as new software frameworks (e.g., PyTorch, TensorFlow, etc.) that enable programmers to leverage those accelerators. These frameworks also simplify AI application development as well as deployment. Given the popularity of Python that pre-dated these frameworks, all of these frameworks provide Python-like programming languages that support new APIs for various AI computations. Learning these new languages is a steep learning curve, and more importantly, legacy code (written in different languages) would not benefit from the latest hardware accelerators until it is ported to these new frameworks.

In this research effort, we address the problem of migrating legacy code to AI-specific DSLs (called tensor DSLs) such as Numpy, TensorFlow, PyTorch, and others by developing a framework called Tenspiler. In particular, Tenspiler approaches the problem of code transpilation as a program synthesis and verification problem. For this problem, the goal of the synthesis step is to search a program in the target language, while the goal of the verification step is to verify that the searched target program is semantically-equivalent to the source program. Conceptually, this approach is simpler than the commonly-used, rule-based, pattern matching approach (used in source-to-source translation tools or compilers) because it does not require programming or maintaining the translation rules. Contrarily, Tenspiler searches the source-to-target language translation rules by employing program synthesis. Specifically, Tenspiler uses verified lifting [4] - a program synthesis technique that uses an intermediate language (IR) to simplify synthesis as well as verification.

TensIR - Tenspiler’s IR for tensor operations 

While we can conceptually translate an input program in a source language directly into an output program in a target language via program synthesis, doing so would have a few drawbacks: 1) such a translator would be specific to the source and target language, and 2) the grammar of the target language (that determines the set of possible programs in the target language - also called as the search space) may allow programs of infinite depth (e.g., via loops or recursion), leading to unbounded search space. The solution adopted in the program synthesis community is to develop an intermediate language (IR) that captures the semantics of the target language while also restricting the size of the search space.

In Tenspiler, we have developed an IR, called TensIR, that captures commonly-used tensor operations (such as matrix multiplication, tensor-scalar operations, etc.) and types (e.g., vectors, scalars, tensors, etc.) from popular tensor DSLs. More importantly, the grammar of the TensIR is such that it restricts the search space of the possible TensIR programs while translating input programs into tensor DSLs. We show TensIR grammar below.

Fig 1. Grammer for TensorIR

Design of Tenspiler

Figure below shows the overall design of Tenspiler. In essence, it is divided into three main components: synthesis, verification, and code generation. We cover all the three steps below.

Fig 2. Tenspiler Design

Program Synthesis - synthesizing TensIR program

Now that we have discussed TensIR and its purpose, we now discuss how our program synthesis step leverages it. The goal of Tenspiler’s program synthesis step is to search for a TensIR program that is semantically equivalent to the input program in a source language. Technically, the semantically-equivalent program in the target language is called program summary (PS) in the field of program synthesis. In addition to PS, the output of our synthesis step is also loop invariants (INV) for the input program. Conceptually, loop invariants are boolean expressions that capture relationships between input and output variables of a loop that hold true for any iteration of the loop. INV are required for our synthesis problem because the input programs had loops and loop invariants are required to prove the equivalence of such programs.

In Tenspiler, we formulate the search problem as a Syntax-guided Synthesis problem [5] (called SyGuS in short). The key idea in syntax-guided synthesis is that the syntax of the target language (TensIR in our case) guides the synthesis (i.e., search) of the target program. Our SyGuS problem is characterized by three parameters:

  1. The specification: it describes the program property that the synthesized TensIR program should satisfy – intuitively, it is the condition that checks if the synthesized program is equivalent to the input program. In Tenspiler, we use verification conditions (VCs) as the specification for the synthesis phase as VCs provide full guarantees (up to a bound) on the equivalence of the input program and the synthesized program. Intuitively, VCs are logical expressions that encode the semantics of the input program.
  1. The search space that describes the set of possible TensIR programs. We use TensIR grammar to define the search space. Conceptually, the search space imposes syntactic constraints on the structure of the output TensIR program.
  1. The search algorithm that searches for the TensIR program. In Tenspiler, we use an enumerative algorithm, meaning it enumerates all possible programs up to a certain depth.

Verification

Given the limitations of the current SyGuS solvers, our synthesis step finds PS and INV by validating them against VC for a bounded set of program states only. In other words, our synthesis step does not guarantee full semantic equivalence between the input program and the synthesized TensIR program. To ensure complete semantic equivalence, we need a verification step. In Tenspiler, we leverage an existing SMT solver (CVC5) and ask it to find a PS and INV that violate VC (i.e., a counter example). Inability of the solver to find such an example indicates full semantic equivalence, otherwise we add the found counter example as a clause in VC and repeat the step.

Code generation

TensIR, as an IR, allows us to keep the program synthesis step tractable and also to verify the semantic equivalence between the input and the output programs. Given our goal of generating an output program in the target language though, we implement a rule-based code generator that accepts the TensIR program as an input and outputs the target program in the target languages. Tenspiler accepts input programs written in C/C++/Python and is able to generate output programs in six different AI-specific DSLs: Numpy, Tensorflow, PyTorch, Apple MLX, TPC-C, and Gemmini.

Tenspiler accepts input programs written in C/C++/Python and is able to generate output programs in six different AI-specific DSLs: Numpy, TensorFlow, PyTorch, Apple MLX, TPC-C, and Gemmini.

Optimizations

Our initial experiments with our system revealed that even with a constrained search space, our solvers were unable to synthesize the target programs in a reasonable time. Consequently, to expedite the search, we implemented four different optimizations that, in essence, had two key objectives: (i) restricting the program states, and (ii) restricting the search space by analyzing the input program to potentially rule out the search space grammar rules which would not be applicable to that input program. These optimizations together help Tenspiler in keeping the synthesis problem tractable. As we will see in our experimental results, Tenspiler solves all of the synthesis problems in under 15 min on a MacBook Pro with Intel Core i5 processor.

Experimental evaluation

The aim of our experimental evaluation was to measure: 1) how many of the input programs can Tenspiler translate to 6 different AI DSLs correctly — within a timeout of 1hr on a MacBook Pro, 2) how fast it can synthesize/translate those programs to those DSLs, and 3) what is the performance of those translated programs on the latest AI hardware – in other words, does the translated code show better performance?

Experimental setup

Input programs. We applied Tenspiler to translate 10 real-world C++ benchmarks to six different DSLs mentioned above. These benchmarks represent programs belonging to different domains such as image processing (OpenCV), AI (CodeLLaMA C++ version), etc. In total, these benchmarks constituted 69 C++ programs. We show the benchmarks and their domains below.

Table 1. Benchmarks used for evaluation and their application domains

Hardware. We ran the synthesis and verification steps of all 69 programs on a 2-GHz MacBook Pro, containing a 4-core Intel Core i5 processor. To measure the performance of the translated programs, we ran them on different types of devices, ranging from a general-purpose CPU to the latest AI accelerators. The exact combinations of target DSLs and the devices that were used to execute those DSL programs is shown below.

Table 2. DSLs and the hardware devices used to evaluate generated code in those DSLs


Baseline.
The baseline for comparing the performance of the translated programs was the performance of the input C++ programs, compiled with gcc-8.3.0 -O3 and ran on Intel Xeon 8380 CPU.

Synthesis config. During the synthesis phase, we applied all the optimizations mentioned earlier with a 1 hour timeout.

Experimental results

Overall, Tenspiler could translate all of the 69 C++ programs into equivalent DSL programs for different target DSLs.

Synthesis timings

Figure below shows the average synthesis timing for 10 input benchmarks.

Overall, Tenspiler synthesized the correct target DSL programs in under an average of 15 minutes.

Fig 3. Tenspiler's average synthesis time for different benchmarks

Performance of the generated target programs

Improved performance of the migrated code is one of the key benefits of migrating code to newer hardware devices. Some of the devices that we used for our evaluation are accelerator devices (called the device), which are typically attached to the host (generally a CPU) via an interface such as PCIe. In a typical execution with an accelerator device, the data required by the accelerator device is transferred from the host to the device’s memory and the result transferred back from the device to the host.

Considering this execution model, we resorted to measure the performance improvement of the migrated code in two ways:

  1. Kernel execution time: This is the amount of time taken to execute the migrated code on the accelerator device, excluding any data copy overhead. In other words, this measurement assumes that the time taken for copying the necessary data to the device and from the device is ignored.
  1. End-to-end execution time: This is the amount of time taken by the complete program in an end-to-end manner. In other words, it measures the time taken by the host to copy necessary data to the accelerator device, the time taken by the accelerator device to execute the migrated code, and then the time taken by the host to copy the output of the migrated code back to the host.

Note that for CPU-only setup, we also measure both the times by treating the CPU as the accelerator device.

Fig 4. Execution speedup achieved by Tenspiler synthesized kernels on different hardware

As can be seen from the chart, migrating C++ programs to DSLs such as TensorFlow, PyTorch, TPC-C, Numpy and running them on AI accelerator devices such as NVidia-V100, Intel’s Gaudi2 accelerator, or even a general-purpose Intel Xeon 8380 CPU delivered considerable performance improvements over the baseline (up to 557X). This improvement highlights the performance benefit offered by these devices and the value of the migration.

Fig 5. Overall Performance improvement achieved by tenspiler synthesized benchmarks on different hardare

Accounting for the data transfer overhead for the accelerator devices changes the performance improvement picture. Specifically, the improvements offered by the accelerator devices are no longer as dramatic as 557X, but rather more modest such as up to 15X offered by TPC-C with Intel’s Gaudi2 accelerator. CPU based devices such as Intel Xeon 8380 CPU or Apple’s M1 Pro, on the other hand, offer much better improvement (up to 17X) in this setup as these devices are not constrained by memory bandwidth limitations faced by the accelerator devices.

Conclusion

All in all, the results show that Tenspiler could migrate all 69 C++ programs (belonging to different domains) to different domain-specific languages for tensor operations, and more importantly, it also formally verified that the generated DSL programs are semantically-equivalent to the input C++ programs.

Finally, the performance evaluation of Tenspiler showed that it could migrate the input programs in a reasonably short period of time (less than 15 mins). And more importantly, the migrated programs enjoy considerable performance improvement (from 15X to 557X) when executed on the latest AI accelerator devices or even commodity CPU platforms.

Finally, the performance evaluation of Tenspiler showed that it could migrate the input programs in a reasonably short period of time (less than 15 mins). And more importantly, the migrated programs enjoy considerable performance improvement (from 15X to 557X) when executed on the latest AI accelerator devices or even commodity CPU platforms.


Next steps

In this article, we presented our work, in collaboration with University of California at Berkeley, that applies a formal approach (aka symbolic approach) to the problem of migrating legacy code to newer hardware devices and their programming languages. During the course of this work, we however realized that the synthesis time taken by Tenspiler is directly related to the complexity of the input programs, and even the aggressive search space pruning optimizations might not be enough to keep the synthesis tractable. 

In the next article, we will cover an application of AI advancements to this problem. We will analyze if and how AI can assist in the code transpilation problem. AI systems, nonetheless, are probabilistic — they do not guarantee correctness of the output that is guaranteed by our current formal approach. We will then see if we can combine AI with a formal reasoning based approach (aka neuro-symbolic approach) to build a practical code transpilation system that addresses the synthesis limitation of Tenspiler, while also ensuring the correctness of the generated code.

Acknowledgements

References

[1] Jeff Gray, Kathleen Fisher, Charles Consel, Gabor Karsai, Marjan Mernik, and Juha-Pekka Tolvanen. 2008. DSLs: the good, the bad, and the ugly. In Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications (OOPSLA Companion '08).

[2] Maaz Bin Safeer Ahmad, Jonathan Ragan-Kelley, Alvin Cheung, and Shoaib Kamil. Automatically translating image processing libraries to halide. ACM Trans. Graph., 38(6), nov 2019. doi:10.1145/3355089.3356549.

[3] José Wesley de Souza Magalhães, Jackson Woodruff, Elizabeth Polgreen, and Michael F. P. O’Boyle. C2TACO: Lifting tensor code to taco. In Proceedings of the 22nd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, GPCE 2023, page 42–56, New York, NY, USA, 2023. Association for Computing Machinery. doi: 10.1145/3624007.3624053.

[4] Sahil Bhatia, Sumer Kohli, Sanjit A. Seshia, and Alvin Cheung. Building Code Transpilers for Domain-Specific Languages Using Program Synthesis. In Karim Ali and Guido Salvaneschi, editors, 37th European Conference on Object-Oriented Programming (ECOOP 2023), volume 263 of Leibniz International Proceedings in Informatics (LIPIcs), pages 38:1–38:30, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2023.38, doi:10.4230/LIPIcs.ECOOP.2023.38.

[5] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. In 2013 Formal Methods in Computer-Aided Design, pages 1–8, 2013. doi:10.1109/FMCAD.2013.6679385.

[6] Maaz Bin Safeer Ahmad, Jonathan Ragan-Kelley, Alvin Cheung, and Shoaib Kamil. Automatically translating image processing libraries to halide. ACM Trans. Graph., 38(6), nov 2019. doi:10.1145/3355089.3356549.

[7] L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135–151, 2002.

[8] Mazen AR Saghir. Application-specific instruction-set architectures for embedded DSP applications. Citeseer, 1998.

[9] Sunbeom So and Hakjoo Oh. Synthesizing imperative programs from examples guided by static analysis. In International Static Analysis Symposium, pages 364–381. Springer, 2017.

[10] Christopher D Rosin. Stepping stones to inductive synthesis of low-level looping programs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2362–2370, 2019.