June 22–26, 2014
Leipzig, Germany

Presentation Details

Name: Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences
Time: Wednesday, June 25, 2014
12:00 pm - 12:30 pm
Room:   Hall 3
CCL - Congress Center Leipzig
Speaker:   Hari Subramoni, Ohio State University
Abstract:   The Dynamic Connected (DC) InfiniBand transport protocol has re cently been introduced by Mellanox to address several shortcomings of the older Reliable Connection (RC), eXtended Reliable Connection (XRC), and Unreliable Datagram (UD) transport protocols. DC aims to support all of the features provided by RC — such as RDMA, atomics, and hardware reliability — while allowing processes to communicate with any remote process with just one DC queue pair (QP) like UD. In this paper we present the salient features of the new DC protocol including its connection and communication models. We design new verbs-level collective benchmarks to study the behavior of the new DC transport and understand the performance / memory trade-offs it presents. We then use this knowledge to propose multiple designs for MPI over DC. We evaluate an implementation of our design in the MVAPICH2 MPI library using standard MPI benchmarks and applications. To the best of our knowledge, this is the first such design of an MPI library over the new DC transport. Our experimental results at the microbenchmark level show that the DC based design in MVAPICH2 is able to deliver 42% and 43% improvement in latency for large message All-to-one exchanges over XRC and RC respectively. DC based designs are also able to give 20% and 8% improvement for small message One-to-all exchanges over XRC and RC respectively. For the All-to-all communication pattern, DC is able to deliver performance comparable to RC/XRC while outperforming in memory consumption. At the application level, for NAMD on 620 processes, the DC based designs in MVAPICH2 outperforms designs based on RC, XRC and UD by 22%, 10%, and 13% respectively in execution time. With DL-POLY, DC outperforms RC and XRC by 75% and 30% respectively in total completion time while delivering performance similar to UD.

Hari Subramoni, Khaled Hamidouche, Akshay Venkatesh, Sourav Chakraborty & Dhabaleswar Panda, Ohio State University