Skip to content

djichthys/alpyna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALPyNA - Automatic Loop Parallelisation in Python for Novel Architectures

ALPyNA is a loop parallelisation framework that applies classical loop parallelisation techniques as popularised by Allen and Kennedy[5]. ALPyNA generates and JIT compiles GPU kernels from linear looop nests written in imperative Python.

Loop domain sizes are determined for each instance of the executing loop nest by using runtime introspection. An analytical cost model [2] is used to determine the optimal device (CPU or GPU) to generate and JIT compile code for. The cost model requires a very short profiling period at installation time before the first execution.

ALPyNA has been tested with Python v3.5.


Installation

Prerequisites

ALPyNA has been tested to run on Ubuntu and Debian Linux, with CUDA drivers. However it should be able to run with the installation of the following dependences.

  1. CUDA (see nvidia installation instructions).
  2. Python virtual-environment
  3. Numpy
  4. Numba

Install a virtual environment

  1. Create a new virtual environment. python3 -m venv <venv-root>/alpyna-virt-env and switch to the virtual environment. source <venv-root>/alpyna-virt-env/bin/activate.
  2. If required install the astor python module. pip install astor
  3. Enable Numba to locate CUDA library paths. (On my Debian installation:)
    export NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so 
    export NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice
    

Install time profiling

The ALPyNA cost-model (ACM) requires a short one time profiling run on each hardware set-up.

Add the nvidia GPU hardware characteristics to the hw_param map in src/Hardware/cuda_gpu.py and set the gpu_model to the model in your hardware setup.

gpu_model = 'gtx-1060'
hw_parm = {
'gtx-1060': {
              "sm"  : 9 ,
              "ws"  : 32 ,
              "wsched" : 4
            },
'titan-xp': {
              "sm"  : 30 ,
              "ws"  : 32 ,
              "wsched" : 4
            }
}

The variables are :

  • Number of Streaming Multiprocessors (sm)
  • Warp Size (ws)
  • Number of warp schedulers within each Streaming Multiprocessor (wsched)

Bug: Also change the line self.hw in constructor of class GPU_Exec_Cost() which is currently hardcoded to the model of the GPU. This is on the list of things to do.

The hardware parameters for the CPU and the GPU should be passed into the profiling tool within the source Static_Profile_Setup.py.

if __name__ == '__main__':
    cpu_param = CPU_Param(800, 8192)
    gpu_param = CUDA_GPU_Param(1500, 1536, 9 * 2, 8192)
    _init_profile(cpu_param, gpu_param)
  1. CPU_Param uses two parameters:
    1. CPU single core maximum frequency
    2. Last level cache Size
  2. GPU_Param uses four parameters:
    1. GPU single core maximum frequency
    2. Last level cache size
    3. Cache ratio : This is the number of L1 GPU caches that share the Last level cache within the GPU.
    4. Data Transfer bandwidth. This was calculated offline using nvprof. Given in units of MiBps.

Installation time profiling is executed by executing the command python3 Static_Profile_Setup.py. This will generate a file .alpyna_profile.jsonwhich will be used in all subsequent execution runs of ALPyNA.

Execution

The static_nalyse() function (in Static_Analysis_Driver.py) is called once per program invocation. It takes as parameters all the source code intended to be analysed for GPU/CPU code generation and JIT compilation. It statically analyses the code, generates skelatal kernels and in-memory data-structures to be used by ALPyNA's runtime.

The static-analyse() function returns a Python module which can be de-referenced with the name of the function required by the programmer from this point onwards. For e.g. the test harness looks like the following code:

def init_test_harness(filename):
    with open(filename, mode="r") as fd:
        _opts = flt_util.ALPyNA_Options()
        _opts.read_config()

        code = fd.read()
        _as_mod = parloop.static_analyse(code, _opts)
        return _as_mod
    return None

if __name__ == '__main__':
    alp_obj = init_test_harness("tests.py")
    # e.g. Create numpy arrays arg_1 and arg_2
    alp_obj.my_loopy_func(arg_1, arg_2)

To do

  1. Automate probing of hardware characteristics both for installation profile generation and for normal execution.
  2. Add support for openCL GPUs.

References

  1. Dejice Jacob. 2020. Opportunistic acceleration of array-centric Python computation in heterogeneous environments. PhD thesis (University of Glasgow), February 16, 2021, UK, doi: 10.5525/gla.thesis.82011

  2. Dejice Jacob, Phil Trinder, and Jeremy Singer. 2020. Pricing Python Parallelism: a Dynamic Language Cost Model for Heterogeneous Platforms. In Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages (DLS ’20), November 17, 2020, Virtual, USA, doi: 10.1145/3426422.3426979.

  3. Dejice Jacob, Phil Trinder, and Jeremy Singer. 2019. Python Programmers Have GPUs too: Automatic Python Loop Parallelization with Staged Dependence Analysis. In Proceedings of the 15th ACM SIGPLAN International Symposium on Dynamic Languages (DLS ’19), October 20, 2019, Athens, Greece, 42-54 doi: 10.1145/3359619.3359743

  4. Dejice Jacob and Jeremy Singer. 2019. ALPyNA: acceleration of loops in Python for novel architectures. In Proceedings of the 6th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY 2019). ACM, New York, NY, USA, 25-34. doi: 10.1145/3315454.3329956.

  5. Ken Kennedy and John R. Allen. 2001. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach

About

Parallelisation for Python loops

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE.txt
Unknown
COPYING.LESSER

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors