Implementation of the Web Scraping as Extract-Transform-Load (ETL) Module in the Data Warehouse Simulator
Abstract
Data and information are the main assets in an organization. Especially when the data is very fast, it causes the demand for information to also increase. This phenomenon must be anticipated by the organization, because by paying attention to proper data growth it can increase the organization's profits. However, it takes a different architecture from conventional databases to store this kind of data. This storage architecture is called the Data Warehouse. One of the important Data Warehouse components is the Extract Transform Load (ETL) section. The purpose of ETL is to collect, filter, process and combine relevant data from various sources for storage into a data warehouse. In this paper, we propose a simple ETL model that uses web scraping technique for data retrieval. Web scraping are techniques that have been used to collect data from web sites. Its reliability in data collection, as well as its accuracy in sorting data makes it the right model for the ETL process. However, there are still some adjustments that must be made so that the desired data can be obtained. Among other things, the accuracy in sorting HTML elements and knowledge of finding the exact location of the desired data.
References
E. Sirait, 2016, “IMPLEMENTASI TEKNOLOGI BIG DATA DI LEMBAGA PEMERINTAHAN INDONESIA,” Jurnal Penelitian Pos dan informatika, vol. 6, no. 2.
M. I. Afandi and E. D. Wahyuni, 2019, “Data Warehouse Implementation For University Executive Information System with Speech Command Feature,” in International Seminar of Research Month Science and Technology for People Empowerment.
Adnan, A. A. Ilham and S. Usman, 2017, “Performance analysis of extract, transform, load (ETL) in apache Hadoop atop NAS storage using ISCSI,” in 4th International Conference on Computer Applications and Information Processing Technology (CAIPT), Kuta Bali.
B. Pan, G. Zhang and X. Qin, 2018, “Design and Realization of an ETL Method in Business Intelligence Project,” in The 3rd IEEE International Conference on Cloud Computing and Big Data Analysis, Chengdu.
A. Prema and A. Pethalakshmi, 2013, “Novel approach in ETL,” in International Conference on Pattern Recognition, Informatics and Mobile Engineering, Salem.
P. S. Diouf, A. Boly and S. Ndiaye, 2018, “Variety of data in the ETL processes in the cloud: State of the art,” in 2018 IEEE International Conference on Innovative Research and Development (ICIRD), Bangkok.
A. Kabiri and D. Chiadmi, 2012, “A method for modelling and organazing ETL processes,” in Second International Conference on the Innovative Computing Technology (INTECH 2012), Casablanca.
T. Rizaldi and H. A. Putranto, 2017, “Perbandingan Metode Web Scraping Menggunakan CSS Selector dan Xpath Selector,” Teknika, vol. 6, no. 1, pp. 43-46.