Abstract:
In last decade processor technology made huge advancement after the
invention of multicore processor for production of higher processing cycles
with less power consumption. Thus, most of the clusters and supercomputers
are made up of nodes that have multicore CPUs and low power specialized
coprocessors like GPUs and Xeon Phi. Traditionally, clusters and
supercomputers were divided into shared memory, where processors talk
to each other through memory shared between processes and distributed
memory machines, where processors talk to each other through message
passing over network. With the clusters of multicore and existence of
coprocessor in node, software developer need to build software that utilizes
underline resource properly. A hybrid of both shared and distributed
programming technique is required that is known as hybrid parallelism. In
this work we have added hybrid parallelism support in MPJ Express.
MPJ Express is an implementation of mpiJava Bindings. In the previous
release of MPJ Express (v0.38), it either supports the pure shared memory
model (multicore mode) or distributed memory model (cluster mode). We
have added a new device named hybrid device, which takes advantage of
both multicore and cluster modes. This new device allows MPJ Express
to exploit hybrid parallelism seamlessly and transparent to user. Hybrid
device enables existing and new applications build on MPJ Express to exploit
hybrid parallelism as it does not require application rewriting e ort. In
addition, cost of MPJ Express bu ering layer is evaluated and compared
with the performance numbers of other Java MPI libraries. The performance
evaluation reveals that the hybrid communication device|without any
modi cations at application level|helps parallel applications achieve better
speedups and scalability by exploiting multicore architecture. Moreover,
quanti cation of the cost incurred by bu ering and its impact on overall
performance of software is done. Comparative performance is witnessed as
hybrid device improves application performance and achieve upto 90% of the
theoretical bandwidth available for point-to-point, collective communication
and application benchmarks.
ii