DIGITAL LIBRARY
BUILDING DATA PIPELINES USING PYTHON: A HANDS-ON JOURNEY IN WEB SCRAPING, ANALYSIS, AND VISUALIZATION
1 Pace University (UNITED STATES)
2 College of Technology, CUNY (UNITED STATES)
About this paper:
Appears in: ICERI2025 Proceedings
Publication year: 2025
Pages: 5133-5139
ISBN: 978-84-09-78706-7
ISSN: 2340-1095
doi: 10.21125/iceri.2025.1445
Conference name: 18th annual International Conference of Education, Research and Innovation
Dates: 10-12 November, 2025
Location: Seville, Spain
Abstract:
In today’s data-centric world, the ability to extract, clean, analyze, and visualize real-world data is foundational in data analytics. As part of a hybrid-format academic course—combining alternating weeks of in-person and online instruction—we completed two major hands-on projects that developed practical skills in web scraping, data preprocessing, and visualization using Python. The hybrid learning environment encouraged technical experimentation, peer collaboration, and iterative learning. The first project, Basketball Data Scraping and Visualization, focused on sourcing historical NBA game statistics (2020–2024) from Basketball-Reference.com. Using Python libraries such as requests, Beautiful Soup, pandas, and matplotlib, we extracted structured HTML tables, cleaned the data into usable DataFrames, and created over 25 visualizations. These included bar charts, scatter plots, and line graphs that illustrated trends in team performance, scoring averages, seasonal fluctuations, and win-loss ratios across franchises. We refined the visual output through advanced styling, layout adjustments, and labeling techniques to ensure clarity and readability. The second project, Hacker News Web Scraper and Community Insights, addressed a platform with inconsistent HTML structure and minimal markup. We built a scalable multi-page scraper capable of extracting post titles, ranks, vote counts, and comment counts. This required handling common scraping challenges, including HTTP 403 errors, bot detection, and irregular DOM elements, which we mitigated through adaptive scraping logic, dynamic headers, and error handling routines. Visualization of the extracted data was achieved using both seaborn and matplotlib, allowing us to present key engagement metrics through histograms, box plots, pie charts, and time-series graphs. Seaborn’s high-level syntax facilitated rapid visualization, while Matplotlib offered granular control for polishing presentation elements. Both projects followed a unified pipeline: data acquisition, transformation, analysis, and presentation. We gained experience not only in technical scripting but also in designing scalable workflows and interpreting complex datasets. The contrast between the two projects: structured sports data versus dynamic social platforms challenged us to adapt tools and strategies while preserving analytical rigor. The hybrid course structure further enhanced our learning by pairing independent development in online settings with collaborative troubleshooting during live sessions. This dual approach simulated real-world, distributed team environments, reinforcing self-directed learning and group problem-solving. Ultimately, these projects helped solidify our proficiency with essential data science tools and practices. Working extensively with Python libraries such as BeautifulSoup, pandas, requests, matplotlib, and seaborn deepened our understanding of the end-to-end data pipeline. More importantly, the experience strengthened our analytical thinking, error-handling strategies, and visual storytelling skills. These outcomes form a strong foundation for future work in areas such as real-time scraping, dashboard development, and applied machine learning
Keywords:
Web Scraping, Beautiful Soup, Python, Basketball-Reference, Hacker News, Data Visualization, Pandas, Matplotlib, Seaborn, HTML Parsing, Sports Analytics, Technology Trends, Community Engagement, Data Mining, Real-Time Analytics, Hybrid Learning, Automation, Exploratory Data Analysis.