ETL Cluster
Problem: There is a lot of data in the existing databases. It needs to be ETLed to the ST cloud instance.
Solution: I currently have a Raspberry pi 3 connected to the NMBG network. It is running my WDIETL code. The code works as follows
launch the script providing a directory and number of point_ids to upload
read a json file with a list of point_ids to upload. point_ids marked as “imported” are skipped.
connect to the NM_Aquifer DB, extract location info
upload location/thing to ST if doesn’t already exist
extract records from the WaterLevelContinuousPressure table
add a datastream, sensor and observed properties
if datastream exists get how many observations it has.
when extracting the records from NM_Aquifer skip the n records already in the datastream
add the observations
Some datastreams have ~15-20K observations. It takes ~60-90minutes to upload 15-20K observations.
multiple instances could be setup to run WDIETL. Some task management via Celery has been explored and implemented on a single computer. I have yet to setup multiple computers to run workers.
Big question is how much effort to put into this system? Is this the best approach to get data into the database?
How do we manage the ETL process? Its going to be both sporadic and protracted.