rDataFusion: A Project-Specific Multi-Data Fusion Tool for Discovering, Integrating, and Visualizing Heterogenous Long-term Data Sets

Ifeanyi H. Nwigboji1, Marguerite Mauritz-Tozer1, Sergio A. Vargas-Zesati1, Craig E. Tweedie1

1System Ecology Lab, University of Texas at El Paso, El Paso Texas

To understand ecosystem change over a range of spatial and temporal scales and levels of biological organization and interaction, multiple streams of ecological data need to be collected, integrated, and analyzed. However, due to the size and complexity of data of these data streams and many other challenges (e.g., personnel turnover, methodological changes, and gaps in observing records), managing, analyzing, sharing, and visualization of these data has posed a significant challenge. To resolve these challenges, we are developing a multi-data fusion tool called rDataFusion, which is capable of aggregating heterogeneous data sets collected from a range of automated and semi-automated sensors and manual observations over a decade-long period. rDataFusion is being developed using the free, open-source software R shiny. rDataFusion, when completed, will (or will be able to) integrate and filter data from two instrument nodes and different data streams that include micro-meteorological variables (e.g., temperature, relative humidity), soil conditions (e.g., temperature and soil moisture), and ecosystem trace gas and energy fluxes. After initial compilation and filtering, users can visualize data in near real-time to check that all sensors are running properly, and/or ensure preliminary flagging for data that is deemed out of range or problematic in some way. They can also add/edit field metadata. rDataFusion, currently, has the capacity for exploratory data analysis through quality control and quality assurance processes and allow for identifying missing values, outliers, and gap-filling. Future goals are to incorporate Machine Learning to filter and flag unusual data based on the alignment of related sensors, gap-fill missing or problematic data, visualize data to allow for preliminary summaries and interpretations, and compare data across time or by site. The overarching goal is to develop a customizable analytic tool that aids researchers with improved capacities for aggregating different streams of data from a single intensive site by providing an open-source multi-data fusion tool that facilitates data management, sharing, and analysis and serves as a template for other research groups with similar challenges.