DIGITAL LIBRARY
DATAPOLISH: THE DEVELOPMENT OF A CLOUD-BASED SERVICE TO SUPPORT THE CLEANING OF DATASETS
Technological University of Dublin (IRELAND)
About this paper:
Appears in: EDULEARN24 Proceedings
Publication year: 2024
Pages: 2824-2829
ISBN: 978-84-09-62938-1
ISSN: 2340-1117
doi: 10.21125/edulearn.2024.0762
Conference name: 16th International Conference on Education and New Learning Technologies
Dates: 1-3 July, 2024
Location: Palma, Spain
Abstract:
As part of a Masters Programme in one of Ireland’s largest universities, the students have the option to either complete an individual dissertation project, or they can work as a team of 4-6 students to develop a large, industry-ready software system that addresses a social need. In Semester 1 of the 2023-2024 academic year, five (5) students developed a project to create an online, cloud-based service to create a data-cleaning web application that will enhance the quality of data and dependability. As defined by the DAMA-DMBOK, quality data includes metrics such as accuracy, completeness, reliability, relevance, and consistency (Cupoli et al., 2014). The challenge is to allow users to improve data quality in a robust and user-friendly manner.

The system was developed by investigating existing products, as well as identifying specific target users, namely data science students and working professionals. The key features of the system include the following:
- Uploading data to the system.
- Navigate to the data preview page to verify the correct upload of data.
- Perform a pivot operation on the data and export it from the preview page.
- Identifying the column with the highest number of missing values.
- Investigating potential correlations between numeric variables within the dataset.
- Assessing the overall quality of the data.
- Identifying and removing duplicate entries.
- Cleaning the data based on insights obtained from the data dashboard.
- Exporting the final, cleaned data.

A range of evaluation approaches were undertaken, these included usability testing, accessibility testing, cognitive walkthrough, the think-aloud protocol, and expert feedback. The feedback indicated that the system delivered a data-cleaning tool that meets the needs of data science students and professionals, and it is a user-centric product, driving strategic improvements that address user demands and future trends in data science.
Keywords:
Data Science, Data Analysis, Automated Processing.