POSSIBILITIES OF DISTRIBUTED SYSTEMS IN GRID AND CLOUD
Constantine the Philosopher University in Nitra (SLOVAKIA)
About this paper:
Conference name: 9th International Conference on Education and New Learning Technologies
Dates: 3-5 July, 2017
Location: Barcelona, Spain
Abstract:
The parallel data processing, parallel computing, cloud computing and grid systems are very popular terms nowadays. Most users use and are familiar with public cloud systems and mainly with Service as a Service platform, e.g. Dropbox and Gmail. When we want to inspect the principles of cloud and grid systems deeper we need some platform that will allow us to do it. The situation is even harder when we want to teach students basic principles of grid and cloud systems. A lot of systems developed for training purposes are already outdated and thus not suitable for use. We are dealing with possibilities of training and teaching grid and cloud systems together with basic principles of parallel and distributed computing in closely related to the research of our department. This research is mainly focused on datamining and web log mining. We describe current possibilities and options for teaching grid and cloud computing. The beginning of grid and cloud is the parallel computing. There are many languages and programming interfaces for parallel computing, but it seems that the easiest way for students is to use the message passing interface implemented by the MPICH libraries and middleware. Our experience shows, that for demonstration purposes and to understand the principle of inter-process communication is the execution of the simple parallel program in the well knows environment the best option. After student understands this principle, we can move to the lower level and to try to write simple parallel codes. We can choose from many programming languages, but, again our best experience is with Python. The underlying architecture for building mpi environment is Raspberry Pi minicomputer connected to computer network. With this setup, we can quickly create and customize the environment for each student and start to build the Beowulf cluster. The student quickly understands what the cluster is and how to setup it. We use Raspberries because we can use trial and error approach. We do not need to reinstall or reconfigure the whole computer system in the classroom, we simply switch the SD card and we can start over. The next step in building knowledge about distributed systems to use and understand job scheduling used in Grid systems. We describe our experience and methodology of using the HTCondor job scheduler. Another step in teaching and learning of distributed systems is to build a larger cluster using the Apache Hadoop implementation in the regular classroom. We use Raspberries in this setup again. As we have more groups of students, we need a clean install every time we start with a new group. Students must cooperate each other to create a three-node cluster. There are not many code examples how to use the Hadoop. Most users can only find the well know word counter. As our department research is focused on data mining and web log mining we are dealing with processing a large amount of text log files. These log files are from different sources and the first level in analyzing log files is the preprocessing of these files and making cleaning of data. Student sees that the learning of distributed systems is not only theoretical, but the can touch the research held on the department. The overall aim of this paper is to summarize the current state of the art in the domain of grid and cloud systems and to share our experience in interconnecting the teaching of grid and cloud systems with the research in data mining.Keywords:
Distributed systems, Apache Hadoop, Data mining.