Cyber-Infrastructure Plan: July 2021

The purpose of this document is to provide descriptions of the current cyber-infrastructure plan for the New Mexico Water Data Initiative (NMWDI). This document is intended for IT staff of the Water Data Act directing agencies, or others who wish to participate in sharing data through the NMWDI. This is a Draft Plan which continues to evolve as the NMWDI team learns how agencies will incorporate this work. Please provide any feedback to Stacy.Timmons@nmt.edu.

Software Stack
Table 1 provides an outline of the conceptual software stack used by NMWDI. The Application layer provides
Graphical User Interfaces (GUIs) to the data and is the primary interface must users will use. The API management layer provides two main features; 1) granular access control to the APIs and 2) perform on-the-fly transformations to the data e.g unit conversions. The FROST layer is a collection of FROST servers loaded with data from the data aggregation layer. Finally the data collection layer exists to automate and streamline the process of moving data from sensors and field notes to databases.

Table 1. NMWDI conceptual software stack.

LAYER

APPLICATION

LAYER

APPLICATION

APPLICATION

GUI interfaces to the APIs (i.e. dashboards, apps for use on mobile device, or web maps)

API MANAGEMENT

Control access to APIs, perform on-the-fly transformations e.g. unit conversion.

FROST

Collection of FROST servers storing and sharing data in SensorThings format

DATA AGGREGATION

Extract and Load (EL) datasets into a “Data Lake”

DATA COLLECTION

Automated data collection and send directly to cloud/database

Although based on existing practices and open source software resources are required to design, implement and implement the various components of the software stack

Data Aggregation/Integration

The NMWDI cyber-infrastructure requires multiple components in order to address the large range in technical capacity at each data provider, the numerous data formats and vocabularies, and the various use cases of water datasets. Development and maintenance of this infrastructure requires only a small investment from each agency. A full stack developer at each agency would greatly increase the speed and efficiency of integrating agency datasets with NMWDI. All the software used and recommended by NMWDI are open source, freely available and supported by an active global community. The software can be run on agency hardware or in the Cloud. Cloud computing costs for hosting a FROST server are approximately $30/month. 

A primary goal of NMWDI is to provide FAIR access to New Mexico’s water data. NMWDI with guidance from IOW[1] and USGS has chosen the OGC SensorThings specification for sharing location based time series data. Instead of building our own implementation of the SensorThings specification, we opted to use the existing open source project FRaunhofer Opensource SensorThings-Server, aka FROST. FROST is the first complete, open-source implementation of the OGC SensorThings API Part 1: Sensing. FROST consists of a PostGIS database backend and Apache Tomcat web frontend. FROST is a dockerized application, making it simple to deploy and manage. 

With access to a FROST server, an agency only needs to address how their data is added to a FROST server. Large datasets that are stored in a database and updated regularly need to be added to FROST via an automated process with little to no human interaction. NMWDI’s ByteFlow platform provides the infrastructure for agencies to automate, orchestrate and monitor their data integration pipelines.

NMWDI’s ByteFlow platform is based on two established open source projects, AirByte and AirFlow. AirByte is used as a basic EL tool; data are Extracted from the source and Loaded unaltered to NMWDI’s BigQuery data lake (See Figure 1). AirByte provides a standard workflow for onboarding data that provides significant robustness and flexibility. Airflow is used to monitor and orchestrate the various services that are required to move the data from provider to consumer facing service (i.e. UI and API) automatically.

Figure 1. Example of current data pipeline using groundwater levels, with data providers including NMBGMR, EBID and PVACD just for demonstration purposes. Workflow includes using the ByteFlow platform. This demonstration includes AirFlow to trigger AirByte to provide data through the Big Query and Cloud Composer to the FROST server in SensorThings API format.

 

The ByteFlow platform is cloud based and accessible by any authorized user. One of the hopes of NMWDI is that over time this existing infrastructure can be used by the major water data producers in the state for streamlining and automating their own data integration procedures. For example, and Agency may want to make a new dataset available via SensorThings. Instead of starting from scratch, a full stack developer at the Agency could collaborate with NMWDI and contribute her own simple EL code to the ByteFlow platform. She would not need to worry about provisioning the necessary resources or developing her own workflow, as the infrastructure and processes are already in place.

 If agencies are unable to send data to NMWDI infrastructure, another option is to deploy the IOW HubKit. HubKit is a dockerized service collection that combines a FROST Server and database with a simple wizard tool. HubKit allows users to map csv files or Excel worksheets to the SensorThings data model, and ingest these file-based data sources into their own FROST Server for sharing with NMWDI. Using HubKit involves hosting existing csv or Excel files (or exporting such files from an existing data management system) in a local or web-accessible location, formatting them such that there are columns identifying data collection locations, parameters, result values, units, etc. Then data provider use the HubKit wizard to specify which columns correspond to which elements of the SensorThings data model. HubKit can also be configured to periodically check whether the files have been updated with new data. In this way, a participating agency can publish their data in the SensorThings format without needing to cede data management or storage to the NMWDI. However, it does require the agency to manage this software, install it on IT infrastructure that it owns, and expose the SensorThings API endpoint as a publicly accessible URL.

Table 2 (below) includes following data integration scenarios are addressed by the NMWDI cyber-infrastructure.

Table 2. Data integration scenarios. Many of the steps of this federated process provide full ownership and responsibility with the data provider. Steps in green bold text are handled fully by NMWDI or by the data provider in close collaboration with NMWDI.

Scenario

Solution A

Solution B

Data provider has raster data or non-time series data that cannot easily be mapped to SensorThings, that will not be updated

  1. Create or log in on CKAN account

  2. Add link to Custom API on CKAN complete with metadata and tags

 

Data provider has a time-series or discrete sample dataset in csv, txt or Excel format. The dataset is historical, static, and will not be updated 

  1. Create or log in on CKAN account

  2. Add link to Custom API on CKAN complete with metadata and tags

 

Data provider has a time-series dataset in csv, txt or Excel format. The data is updated on a semi-regular basis

  1. Create or log in on CKAN account

  2. Add link to Custom API on CKAN complete with metadata and tags

  3. Adapt existing WDI CKAN-to-Airbyte Connector

  1. Set up the IOW HubKit

  2. Share the URL of the HubKit SensorThings endpoint with NMWDI

  3. Periodically add data to the files HubKit monitors

Agency has a time-series dataset in a database but currently does not have an API

  1. Manually export csv “reports” from database

  2. Create or log in on CKAN account

  3. Add link to Custom API on CKAN complete with metadata and tags

  4. Adapt existing WDI CKAN-to-Airbyte connector

  1. Set up the IOW HubKit

  2. Share the URL of the HubKit SensorThings endpoint with NMWDI

  3. Periodically add data to the files HubKit monitors

Agency has a time-series dataset in a database and has an API but is not compliant with SensorThings specification

  1. Create or log in on CKAN account

  2. Add link to Custom API on CKAN complete with metadata and tags

  3. Create Custom API-to-Airbyte connector

 

Agency has a time-series dataset stored and shared via an instance of FROST

No additional work required by the data provider! Just tell NMWDI where to find it!

 

API Management

The proliferation of APIs for sharing NM water data requires that NMWDI use an API management platform. The API management platform provides two main benefits 1) granular access control to the APIs and 2) performing on-the-fly transformations on the data. 

 Data that is added to NMWDI via the ByteFlow process will already be standardized and consistent therefore any on-the-fly transformations will be minor or non-existent all together. However there will be numerous datasets that NMWDI does not have direct control over that are shared via mature and stable APIs but which still require transformation. Two examples of this situation are the US Bureau of Reclamation (BOR) Rise API and the US Geological Survey (USGS) Groundwater Levels SensorThings API. In both cases the APIs are stable and mature but are not directly interoperable with the NMWDI APIs. This is because in the case of BOR, the Rise API uses a similar but different data model than SensorThings. In the case of USGS, the data is in the SensorThings format but not reported the same way as NMWDI SensorThings, i.e. USGS uses groundwater “elevation” and NMWDI uses “depth to water below ground surface”. Proxying these existing Federal APIs via the NMWDI API management platform to make them interoperable with NMWDI’s APIs greatly reduces the burden of developing applications that integrate and visualize the wide variety of water-related state and federal datasets. 

 Numerous open source on-premise or cloud based API platforms currently exist. NMED has used Apigee with some success and has suggested other agencies could “Apigee”-back on the existing NMED subscription. NMWDI will continue to explore the best and most cost-effective options for an API management platform. 

 


[1] IOW, or Internet of Water, is a project of Duke University that assists Federal, State, and local agencies and NGOs in the United States in implementing data management practices enabling FAIR (findable, accessible, interoperable, reusable) data. IOW provides substantial in-kind consulting, stakeholder engagement, and technology development support to the NMWDI team.