ETL Cluster

Problem: There is a lot of data in the existing databases. It needs to be ETLed to the ST cloud instance.

 

Solution: I currently have a Raspberry pi 3 connected to the NMBG network. It is running my WDIETL code. The code works as follows

  1. launch the script providing a directory and number of point_ids to upload

  2. read a json file with a list of point_ids to upload. point_ids marked as “imported” are skipped.

  3. connect to the NM_Aquifer DB, extract location info

  4. upload location/thing to ST if doesn’t already exist

  5. extract records from the WaterLevelContinuousPressure table

  6. add a datastream, sensor and observed properties

    1. if datastream exists get how many observations it has.

    2. when extracting the records from NM_Aquifer skip the n records already in the datastream

  7. add the observations

Some datastreams have ~15-20K observations. It takes ~60-90minutes to upload 15-20K observations.

 

multiple instances could be setup to run WDIETL. Some task management via Celery has been explored and implemented on a single computer. I have yet to setup multiple computers to run workers.

Big question is how much effort to put into this system? Is this the best approach to get data into the database?

How do we manage the ETL process? Its going to be both sporadic and protracted.