Task-Level Checkpointing System for Task-Based Parallel Workflows

Citation DataLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN: 1611-3349, Vol: 13835 LNCS, Page: 251-262

Publication Year2023

2
Citations
0
Usage
1
Captures
0
Mentions
0
Social Media

Metric Options: Counts1 Year3 Year

Metrics Details

Citations
2
- Citation Indexes
  2
Captures
1
- Readers
  1

Conference Paper Description

Scientific applications are large and complex; task-based programming models are a popular approach to developing these applications due to their ease of programming and ability to handle complex workflows and distribute their workload across large infrastructures. In these environments, either the hardware or the software may lead to failures from a myriad of origins: application logic, system software, memory, network, or disk. Re-executing a failed application can take hours, days, or even weeks, thus, dragging out the research. This article proposes a recovery system for dynamic task-based models to reduce the re-execution time of failed runs. The design encapsulates in a checkpointing manager the automatic checkpointing of the execution, leveraging different mechanisms that can be arbitrarily defined and tuned to fit the needs of each performance. Additionally, it offers an API call to establish snapshots of the execution from the application code. The experiments executed on a prototype implementation have reached a speedup of 1.9 × after re-execution and shown no overhead on the execution time on successful first runs of specific applications.

Bibliographic Details

DOI10.1007/978-3-031-31209-0_19

URL IDhttp://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=85161451529&origin=inward; http://dx.doi.org/10.1007/978-3-031-31209-0_19; https://link.springer.com/10.1007/978-3-031-31209-0_19; https://dx.doi.org/10.1007/978-3-031-31209-0_19; https://link.springer.com/chapter/10.1007/978-3-031-31209-0_19

AUTHOR(S)

Pere Vergés; Francesc Lordan; Jorge Ejarque; Rosa M. Badia

PUBLISHER(S)

Springer Science and Business Media LLC

TAG(S)

Mathematics; Computer Science

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know