High Performance Computing and Data Infrastructures
2° Year of course - First semester
Frequency Not mandatory
- 6 CFU
- 48 hours
- English
- Trieste
- Obbligatoria
- Standard teaching
- Oral Exam
- SSD INF/01
- Advanced concepts and skills
(knowledge and understanding): The class aim to perfect students’ knowledge by deepening their understanding of storage, management, and access to HPC and data infrastructure. Concepts related to HPC, data center, large storage capacity, Big Data and data Interoperability in the scientific research scenario will be provided.
(applying knowledge and understanding) The topics discussed in the class will be applied to specific domains. Accessing and experiences with ORFEO data center will allow students to apply all concepts and tools presented to manage a hpc and data infrastructure. Moreover, Using tools like UML, XML/XSD, Persistent ID, and concepts like FAIR principles, the student will design an application to manage, curate e access the data.
(making judgments) The concept of Intensive Data Science and the associated tools (provided to the students will guide them in integrating their own domain data resources in a shared data publishing scenario.
(communication skills) In a scientific environment where data are distributed, big sized and generated by various projects and require HPC resources to be processed and analyzed, the student will be able to choose among data structures, access and management alternatives, justifying her/his technological choices.
(learning skills) The lessons will be given in an interdisciplinary context. The student will autonomously apply the learned concepts to her/his own specific research domain.
No specific prerequisite is required. It is strongly recommended to follow before this course the HPC and cloud computing course. Programming and usage/management of database systems are the foundations on which the class content is developed. Hence, a basic understanding of these topics will be helpful.
This course consists of an introduction to computational infrastructure for high performance computing and data management. It is a follow-up of the HPC and cloud computing course presenting some of the concepts there discussed more from the infrastructure point of view.
It will give some details on how to plan, install and manage computational infrastructures both for providing HPC resource and data infrastructures. The two of them very often are strictly intercorrelated.
The course will present tools and methods also to provide access to such infrastructures and to guarantee data resource interoperability.
No specific book will be followed. Some reference books are hereafter listed. Further details will be provided by means of notes and references to auxiliary materials.
High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment 1st ed. 2023 Edition
by Edson Borin (Editor), Lúcia Maria A. Drummond (Editor), Jean-Luc Gaudiot (Editor), & 3 more
- "UML Database Modeling Workbook", Michael Blaha
- "Python and HDF5 - Unlocking scientific data", Andrew Colette
- "Reference Model for an Open Archival Information System (OAIS)" recommendation by Consultative Committee for Space Data Systems (CCSDS)
- Lectures and examples will be provided to students through a web- accessible solution (yet to be defined).
The course is structured in two main parts:
Part 1: HPC and data infrastructure elements (16 hours)
Topics discussed includes:
Definition and recap of HPC concepts and data infrastructure
HPC/data infrastructure hardware and software component
Challenges in managing HPC/data infrastructure
HPC/data infrastructure on the cloud
Part 2: Method and tools for data management and interoperability on FaiR data infrastructure (32 hours)
This part will consist of an introduction on Big Data, Open Data and FAIR principles concepts applied to data infrastructure. There will be two main blocks: data and metadata models and structures, data resource interoperability and access.
Data and Metadata Models and Structures will discuss data models and their definitions and design. Unified Model language will be used to model the data and ORM will map the data to relational database schemas. Various data structure formats will be presented, including common tabular formats and image hierarchical structures. These structures will be shown applied in commons standards, such as CSV, XSD, JSON, and HDF5. These common file formats will be used in various practical examples of metadata query-ability.
Data Resource Interoperability and Access will discuss interoperability, starting with the role and types of persistent identifiers. In addition, the data curation lessons will include data preservation concepts and the usage of standardized vocabularies, ontologies and semantics. To conclude this section will be shown a practical use of metadata models for discovery and multimodal research.
In this part of the course, participants will learn about various cloud storage tools and platforms for adequate data storage, backup, and management. The participant will start by understanding classical storage systems, comparing the most common ones, and seeing where they can be deployed with advantages and disadvantages. Then, the notion of distributed storage systems that are key for any cloud platform will be given. Ceph and MinIo, as Popular Open source Cloud Storage Platforms, will be discussed in detail. Furthermore, notions of cloud computing, such as container and container orchestration with simple tools, will be introduced since all the solutions above rely on such technologies. Finally, backup strategies will be covered.
The course will be delivered through a combination of lectures, hands-on exercises, and case studies. A significant portion of the hours will be dedicated to hands-on sessions. Some external seminars on specific topics and case studies could also be given.
Lecture notes/viewgraphs, software and services used during the lessons will be provided usually through a git repository (web accessible). Non- attending students are kindly requested to contact the lecturers to agree on how to undergo preparation and final test.
Knowledge verification will consist in preparing and presenting a project of hpc and data management facing the content of the lectures.
The exam will be evaluated the according to the following criteria:
- completeness of the project with respect to all the course contents
- critical thinking on the pros and cons of the data management solutions adopted in the project, highlighting criticalities, problems peculiarities of the analysed use case
- degree of understanding of the theoretical and practical aspects of the subject
- clarity of the exposition of the project
If feasible, the preparation and presentation of a partial or full demo of the project is encouraged, but not mandatory. At presentation time there will be a discussion with Q&A.
This course explores topics closely related to one or more goals of the United Nations 2030 Agenda for Sustainable Development (SDGs)