Data Pipeline

This figure schematically depicts the various pipelines for data served to a consumer.

1) XLS,CSV,SQL datasets first go through a minimal transformation (none if common data standards is followed?) and added to CKAN. XLS and CSV files can be uploaded to the CKAN datastore but SQL will likely stay in the host DB. The dataset in the CKAN Icebox are not for automatic public consumption but are added to a ST (SensorThings) ETL queue. The ST ETL process runs periodically (nightly) to pop datasets from the queue, transform and upload to a ST Instance (in the case of NMBG a GCP database and server). The ST ETL process is managed via a web-based dashboard, allowing functionalities such as queue management, import prioritization, start, stop processes manually, visualize activity etc. The datasets are accessible to the consumer via an API Management layer.

2) The data source is already a REST API service offered by the institution.

3) The data source is a mature REST API (preferable ST) service offered by the agency and it is already conforming to a standard specification. These datasets are findable via CKAN and accessible directly from the agency (potentially through its own API layer). There are two options for aggregating this dataset with other datasets 1)WDI-side or 2)Consumer-Side. In the WDI-side, WDI provides an aggregated API proxy via the API Management layer. In the Consumer-side case, the consumer is responsible for aggregating the data manually of via an app.

4) The data source is highly heterogenous and does not easily conform to an of the existing services. These datasets are put into the CKAN Icebox then released to the public. At some later date these datasets could be served via a WDI service.