ALPyNA is a loop parallelisation framework that applies classical loop parallelisation techniques as popularised by Allen and Kennedy[5]. ALPyNA generates and JIT compiles GPU kernels from linear looop nests written in imperative Python.
Loop domain sizes are determined for each instance of the executing loop nest by using runtime introspection. An analytical cost model [2] is used to determine the optimal device (CPU or GPU) to generate and JIT compile code for. The cost model requires a very short profiling period at installation time before the first execution.
ALPyNA has been tested with Python v3.5.
ALPyNA has been tested to run on Ubuntu and Debian Linux, with CUDA drivers. However it should be able to run with the installation of the following dependences.
- CUDA (see nvidia installation instructions).
- Python virtual-environment
- Numpy
- Numba
- Create a new virtual environment.
python3 -m venv <venv-root>/alpyna-virt-envand switch to the virtual environment.source <venv-root>/alpyna-virt-env/bin/activate. - If required install the astor python module.
pip install astor - Enable Numba to locate CUDA library paths. (On my Debian installation:)
export NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so export NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice
The ALPyNA cost-model (ACM) requires a short one time profiling run on each hardware set-up.
Add the nvidia GPU hardware characteristics to the hw_param map in src/Hardware/cuda_gpu.py and
set the gpu_model to the model in your hardware setup.
gpu_model = 'gtx-1060'
hw_parm = {
'gtx-1060': {
"sm" : 9 ,
"ws" : 32 ,
"wsched" : 4
},
'titan-xp': {
"sm" : 30 ,
"ws" : 32 ,
"wsched" : 4
}
}The variables are :
- Number of Streaming Multiprocessors (sm)
- Warp Size (ws)
- Number of warp schedulers within each Streaming Multiprocessor (wsched)
Bug: Also change the line
self.hwin constructor ofclass GPU_Exec_Cost()which is currently hardcoded to the model of the GPU. This is on the list of things to do.
The hardware parameters for the CPU and the GPU should be passed into the profiling tool within the
source Static_Profile_Setup.py.
if __name__ == '__main__':
cpu_param = CPU_Param(800, 8192)
gpu_param = CUDA_GPU_Param(1500, 1536, 9 * 2, 8192)
_init_profile(cpu_param, gpu_param)- CPU_Param uses two parameters:
- CPU single core maximum frequency
- Last level cache Size
- GPU_Param uses four parameters:
- GPU single core maximum frequency
- Last level cache size
- Cache ratio : This is the number of L1 GPU caches that share the Last level cache within the GPU.
- Data Transfer bandwidth. This was calculated offline using
nvprof. Given in units of MiBps.
Installation time profiling is executed by executing the command python3 Static_Profile_Setup.py.
This will generate a file .alpyna_profile.jsonwhich will be used in all subsequent execution runs of ALPyNA.
The static_nalyse() function (in Static_Analysis_Driver.py) is called once per program invocation. It takes as parameters all the source code intended to be analysed for GPU/CPU code generation and JIT compilation. It statically analyses the code, generates skelatal kernels and in-memory data-structures to be used by ALPyNA's runtime.
The static-analyse() function returns a Python module which can be de-referenced with the name of the function required by the programmer from this point onwards. For e.g. the test harness looks like the following code:
def init_test_harness(filename):
with open(filename, mode="r") as fd:
_opts = flt_util.ALPyNA_Options()
_opts.read_config()
code = fd.read()
_as_mod = parloop.static_analyse(code, _opts)
return _as_mod
return None
if __name__ == '__main__':
alp_obj = init_test_harness("tests.py")
# e.g. Create numpy arrays arg_1 and arg_2
alp_obj.my_loopy_func(arg_1, arg_2)- Automate probing of hardware characteristics both for installation profile generation and for normal execution.
- Add support for openCL GPUs.
-
Dejice Jacob. 2020. Opportunistic acceleration of array-centric Python computation in heterogeneous environments. PhD thesis (University of Glasgow), February 16, 2021, UK, doi: 10.5525/gla.thesis.82011
-
Dejice Jacob, Phil Trinder, and Jeremy Singer. 2020. Pricing Python Parallelism: a Dynamic Language Cost Model for Heterogeneous Platforms. In Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages (DLS ’20), November 17, 2020, Virtual, USA, doi: 10.1145/3426422.3426979.
-
Dejice Jacob, Phil Trinder, and Jeremy Singer. 2019. Python Programmers Have GPUs too: Automatic Python Loop Parallelization with Staged Dependence Analysis. In Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages (DLS ’19), October 20, 2019, Athens, Greece, 42-54 doi: 10.1145/3359619.3359743
-
Dejice Jacob and Jeremy Singer. 2019. ALPyNA: acceleration of loops in Python for novel architectures. In Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY 2019). ACM, New York, NY, USA, 25-34. doi: 10.1145/3315454.3329956.
-
Ken Kennedy and John R. Allen. 2001. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach