DIGITAL LIBRARY
TRANSFORMING A LARGE YELP DATA SET: TECHNIQUES FOR DATA PREPROCESSING, ANALYTICS, AND VISUALIZATION
Marist College (UNITED STATES)
About this paper:
Appears in: INTED2020 Proceedings
Publication year: 2020
Pages: 1239-1244
ISBN: 978-84-09-17939-8
ISSN: 2340-1079
doi: 10.21125/inted.2020.0423
Conference name: 14th International Technology, Education and Development Conference
Dates: 2-4 March, 2020
Location: Valencia, Spain
Abstract:
Data mining and analytics are thriving fields seeking to extract meaning from the immense collections of data generated by daily online activity. This collaborative research experience with undergraduate university students and a faculty mentor focuses on data acquisition, pre-processing and cleansing processes, as well as analytical and visualization techniques, using a large data set acquired as part of the Yelp dataset challenge. Yelp is a directory and forum for crowd-sourced reviews of businesses. The data set, obtained from Yelp as collections of JavaScript Object Notation (JSON) objects, includes files on businesses, check-ins, reviews, tips, and users. In total, this data collection included information regarding 85,901 businesses, 61,049 check-ins, 2,685,067 reviews, 648,902 tips, and 686,556 users.

The data files were initially converted into comma separated value (CSV) files through a Python script and then imported into Microsoft Excel to view and sort the records and fields. During the preprocessing phase, data were then reformatted and prepared to account for inconsistencies across the data sets, such as removing null values and deleting extraneous data, enabling a clean data set for subsequent evaluation and analyses.

The data were modeled using an entity-relationship diagram (ERD) and were statistically analyzed using techniques in R and RStudio and Tableau. RStudio, an open-source software suite that facilitates statistical computations using the programming language R, was used to discover correlations and evaluate regression equations for some of the data points in the set. Data were also imported into Tableau, an interactive software suite used for data visualization. Tableau was used to produce intuitive and aesthetically appealing charts, graphs, and dashboards of the variables. The results of RStudio and Tableau aid in demonstrating insights into the various trends embedded in the data.
Keywords:
Data Preprocessing, data analytics, data visualization, undergraduate research experience.