\begin{abstract} Contemporary GPUs offer the potential of substantial performance improvements over general purpose computational devices -- from an average 2.5X speedup over CPUs on 14 standard ``throughput computing'' benchmarks~\cite{Lee10} to 80X on data parallel workloads~\cite{Che09}. However, the commonly used frameworks for general purpose GPU (GPGPU) programming, OpenCL~\cite{Muns10} and NVIDIA's CUDA~\cite{NvidCU}, require highly specialized low level programming and performance tuning knowledge to achieve these improvements. A desire to bring these performance benefits to a wider audience has motivated a flurry of recent work on the GPU acceleration of high level languages. There has been much recent work on enabling GPU acceleration of higher level languages to ease GPGPU programming and to bring this performance potential to a wider audience. However, these projects are typically (1) intimitely tied to a specific source language; (2) expose a constrained or non-idiomatic programming model; and (3) suffer from long GPU compile times, inhibiting rapid prototyping. Our system Parakeet provides an intelligent runtime library and JIT compiler for array-oriented subsets of existing high level languages. Parakeet includes the following components: (i) a language-agnostic front end for translating programs from source languages to Parakeet's array-oriented intermediate language (IL); (ii) an IL interpreter that, upon reaching an array operator, can quickly synthesize and execute a data parallel GPU program to implement that operator; and (iii) a library interface to the source language's interpreter, allowing all of the source language's standard features and tools to be used. We evaluate Parakeet on two standard benchmarks: Black-Scholes option pricing, and K-Means clustering. We compare high level array-oriented implementations to hand-written, tuned GPU versions from the CUDA SDK~\cite{NvidSD} and the Rodinia GPU benchmark suite~\cite{Che09}. Taking orders of magnitude shorter source code as input, Parakeet executes these benchmarks only a little slower than hand-optimized code. \end{abstract} \section{Introduction} \label{Intro} % Again, the following is the old Intro. My new version follows The ubiquity and performance potential of modern GPUs has led to much interest in enabling their use for the execution of general-purpose programs. Unfortunately, the two widely used GPGPU frameworks -- OpenCL \cite{Muns10} and NVIDIA's CUDA \cite{NvidCU} -- require the programmer to have extensive knowledge of low level architectural details in order to fully harness this performance potential. Hence, many recent projects attempt to lower the barrier of entry to programming GPUs by allowing the use of high level languages~\cite{Cata11,Chaf11,Chak11,Main10,Sven08,Tard06}. % Enable programmers to write GPGPU programs in productivity language GPUs are able to achieve impressive performance because they have been highly specialized for data parallel workloads. Array languages are thus well suited for high level GPU programming due to their idiomatic preference for data parallel array operations over explicit loops (``collection-oriented''~\cite{Sip91} programming). We observe that many of the above projects are structured around the use of data parallel array operators such as map and reduce. % Heart of Parakeet - an interpreter for our high level IL w/ array ops In this paper, we present Parakeet, an intelligent runtime for executing high level array programs on GPUs. Parakeet is neither a new programming language, nor is it tied to any specific language. The Parakeet library is designed to accelerate the array oriented constructs of existing dynamic languages. A key design goal is to harness the elegance and parallelizability of functional array operators while sacrificing as little programmer convenience as possible. Specifically, Parakeet supports the use of array language semantics such as scalar promotion as well as the restricted use of mutable state. % Q front-end, agnosticism We have implemented our first Parakeet front end for Q \cite{Borr08}, a descendant of APL that is widely used in financial computing. Q is particularly amenable to GPU acceleration by Parakeet as its use of array operators is very extensive -- its idiomatic style makes sparser use of loops than even Matlab \cite{Moler80} or Python's NumPy extensions \cite{dubo96}. We are also developing a front end for NumPy. % Bullet-point, drive-home contributions The main contributions of this paper are the following: \begin{itemize} \item A detailed analysis of the constraints imposed by graphics hardware and their implications for the implementation of high level languages. \item A demonstration that fully dynamic GPU compilation of high level languages can be realized with minimal overhead and a discussion of the opportunities that opens up. \item An intermediate language that is both sufficiently expressive to capture significant subsets of high level languages, while being restricted to constructs which allow for compilation to efficient GPU programs. \item A system in which programmers can write complex algorithms in existing high level languages that is automatically parallelized into efficient GPU programs. \end{itemize} \begin{figure*}[t!bh] \begin{center} \leavevmode \includegraphics[scale=0.6, trim=10pt 180pt 10pt 120pt]{Pipeline.pdf} \end{center} \caption{Overview of Parakeet} \label{fig:overview} \end{figure*}