Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Published in ICPP'23: 52nd International Conference on Parallel Processing, 2023

Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPC workflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs' high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of-the-art incremental checkpointing and compression techniques.

Recommended citation: Tan, N., Luettgau, J., Marquez, J., Terianishi, K., Morales, N., Bhowmick, S., ... & Nicolae, B. (2023, August). Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication. In ICPP'23: 52nd International Conference on Parallel Processing.

Runtime Steering of Molecular Dynamics Simulations Through In Situ Analysis and Annotation of Collective Variables

Published in PASC '23: Proceedings of the Platform for Advanced Scientific Computing Conference, 2023

This paper targets one of the most common simulations on petascale and, very likely, on exascale machines: molecular dynamics (MD) simulations studying the (classical) time evolution of a molecular system at atomic resolution. Specifically, this work addresses the data challenges of MD simulations at exascale through (1) the creation of a data analysis method based on a suite of advanced collective variables (CVs) selected for annotation of structural molecular properties and capturing rare conformational events at runtime, (2) the definition of an in situ framework to automatically identify the frames where the rare events occur during an MD simulation and (3) the integration of both method and framework into two MD workflows for the study of early termination or termination and restart of a benchmark molecular system for protein folding —the Fs peptide system (Ace-A_5(AAARA)_3A-NME)— using Summit. The approach achieves faster exploration of the conformational space compared to extensive ensemble simulations. Specifically, our in situ framework with early termination alone achieves 99.6% coverage of the reference conformational space for the Fs peptide with just 60% of the MD steps otherwise used for a traditional execution of the MD simulation. Annotation-based restart allows us to cover 94.6% of the conformational space, just running 50% of the overall MD steps.

Recommended citation: Caino-Lores, S., Cuendet, M., Marquez, J., Kots, E., Estrada, T., Deelman, E., ... & Taufer, M. (2023, June). Runtime Steering of Molecular Dynamics Simulations Through In Situ Analysis and Annotation of Collective Variables. In Proceedings of the Platform for Advanced Scientific Computing Conference (pp. 1-11).

Computational and Communication Infrastructure Challenges for Resilient Cloud Services

Published in Computers, 2022

Fault tolerance and the availability of applications, computing infrastructure, and communications systems during unexpected events are critical in cloud environments. The microservices architecture, and the technologies that it uses, should be able to maintain acceptable service levels in the face of adverse circumstances. In this paper, we discuss the challenges faced by cloud infrastructure in relation to providing resilience to applications. Based on this analysis, we present our approach for a software platform based on a microservices architecture, as well as the resilience mechanisms to mitigate the impact of infrastructure failures on the availability of applications. We demonstrate the capacity of our platform to provide resilience to analytics applications, minimizing service interruptions and keeping acceptable response times

Recommended citation: Martinez, H. F., Mondragon, O. H., Rubio, H. A., & Marquez, J. (2022). Computational and Communication Infrastructure Challenges for Resilient Cloud Services. Computers, 11(8), 118.

An Intelligent Approach to Resource Allocation on Heterogeneous Cloud Infrastructures

Published in Applied Sciences, 2021

Cloud computing systems are rapidly evolving toward multicloud architectures supported on heterogeneous hardware. Cloud service providers are widely offering different types of storage infrastructures and multi-NUMA architecture servers. Existing cloud resource allocation solutions do not comprehensively consider this heterogeneous infrastructure. In this study, we present a novel approach comprised of a hierarchical framework based on genetic programming to solve problems related to data placement and virtual machine allocation for analytics applications running on heterogeneous hardware with a variety of storage types and nonuniform memory access. Our approach optimizes data placement using the Hadoop File System on heterogeneous storage devices on multicloud systems. It guarantees the efficient allocation of virtual machines on physical machines with multiple NUMA (nonuniform memory access) domains by minimizing contention between workloads. We prove that our solutions for data placement and virtual machine allocation outperform other state-of-the-art approaches.

Recommended citation: Marquez, J.; Mondragon, O.H.; Gonzalez, J.D. An Intelligent Approach to Resource Allocation on Heterogeneous Cloud Infrastructures. Appl. Sci. 2021, 11, 9940.

Performance comparison: Virtual machines and containers running artificial intelligence applications

Published in International Conference on Information Technology & Systems, 2021

With the continuous growth of data that can be valuable for companies and scientific research, cloud computing has shown itself as one of the emerging technologies that can help solve many of these applications that need the right level of computing and ubiquitous access to them. Cloud Computing has a base technology that is virtualization, which has evolved to provide users with features from which they can benefit. There are different types of virtualization and each of them has its own way of carrying out some processes and of managing computational resources. In this paper, we present the comparison of performance between virtual machines and containers, specifically between an instance of OpenStack and docker and singularity containers. The application used to measure performance is a real application of artificial intelligence. We present the obtained results and discuss them.

Recommended citation: Marquez J.D., Castillo M. (2021) Performance Comparison: Virtual Machines and Containers Running Artificial Intelligence Applications. In: Rocha Á., Ferrás C., López- López P.C., Guarda T. (eds) Information Technology and Systems. ICITS 2021. Advances in Intelligent Systems and Computing, vol 1330. Springer

Heterogeneity-aware data placement in Hybrid Clouds

Published in International Conference on Cloud Computing, 2019

In next-generation cloud computing clusters, performance of data-intensive applications will be limited, among other factors, by disks data transfer rates. In order to mitigate performance impacts, cloud systems offering hierarchical storage architectures are becoming commonplace. The Hadoop File System (HDFS) offers a collection of storage policies that exploit different storage types such as RAM_DISK, SSD, HDD, and ARCHIVE. However, developing algorithms to leverage heterogeneous storage through an efficient data placement has been challenging. This work presents an intelligent algorithm based on genetic programming which allow to find the optimal mapping of input datasets to storage types on a Hadoop file system.

Recommended citation: Marquez J.D., Gonzalez J.D., Mondragon O.H. (2019) Heterogeneity-Aware Data Placement in Hybrid Clouds. In: Da Silva D., Wang Q., Zhang LJ. (eds) Cloud Computing – CLOUD 2019. CLOUD 2019. Lecture Notes in Computer Science, vol 11513. Springer

IoT in education: Integration of objects with virtual academic communities

Published in New advances in information systems and technologies, 2016

The Internet of Things (IoT) is a new concept that allows objects to be connected to Internet. This connectivity allows the emergence of new forms of interaction between objects and people. In educational environments the IoT could be applied to improve teaching and learning experiences. This paper proposes a new architecture for integrating objects available in educational environments with virtual academic communities (VAC). This new architecture is based on the paradigm of layered architectures and architectural styles such as REST. The proposed architecture consists of four layers: hardware/communications, messaging, services, and application. Test of the proposed architecture were made through the implementation of a case study, which was focused on practical classes of a typical digital electronics course.

Recommended citation: Marquez, J., Villanueva, J., Garcia, A. and Solarte, Z., 2016. IoT in Education: Integration of Objects with Virtual Academic Communities. New Advances in Information Systems and Technologies vol.1 , 444, pp.201-212.

Architecture for integrating real objects with virtual academic communities

Published in 2015 Fifth International Conference on e-Learning (econf), 2015

This paper is about the number 2. The number 3 is left for future work.

Recommended citation: V. J. A. Villanueva, F. J. D. Marquez, A. Z. M. Solarte and A. G. Dávalos, "Architecture for Integrating Real Objects with Virtual Academic Communities," 2015 Fifth International Conference on e-Learning (econf), Manama, Bahrain, 2015, pp. 385-391. doi: 10.1109/ECONF.2015.74