J. Zaldumbide1, M. San Andrés2

1Escuela Politécnica Nacional (ECUADOR)
2Universidad Tecnológica Israel (ECUADOR)
Development of scientific and technological research produce big volume and variety of data. It has been increasing exponentially, and data analysis has been difficult to manage. Universities want to organize data to get information and knowledge. For instance, a big amount of data concerning scientific research is periodically generated by universities as a main evidence of their activities. Moreover, universities need to measure their scientific research output to improve their position in rankings, and consequently, to improve their prestige and international recognition. Due to this situation, different methods are used to evaluate those research processes. For example, the impact factor is the main indicator to evaluate research, but it is not effective because it is not realistic according to the context, depend on subject area and type document. Another technique was developed by Shieh, that should be installed, and it is not a real-time solution. According to that situation, this article presents a data lake architecture to solve the reporting and creation of near-real-time indicators using open source tools. This new approach is a high-scalable architecture that supports a big amount of data considering volume, variety, and velocity. This architecture can get around 6000 JSON fields in less of 1 hour and register them on a NoSQL Database. As an example of the information processed, it was possible to show types of documents published by Ecuadorian researchers. In addition, the solution could get data from other repositories and every university can deploy and implement it.