Sixteen core phenology sites were selected based on locations of long term air temperature and vegetation data at reference stands. Distribution of study sites was augmented by adding a few additional sites. At these Core sites, air temperature was measured year round and plant phenology, insect and bird abundances was measured during springtime from 2009-2014.
Additional sensors were placed at 40 Core Bird sites and 124 Auxillary Bird sites. These locations were stratified across elevation, forest type, and distance to roads to insure that the full environmental gradient was sampled, with a minimum distance between sampling points of 300m.
Evaluation of the high resolution temperatures (15 and 20 minute intervals) was conducted and questionable data flagged. Before calculation of hourly averages, flagged data were removed and missing values calculated using regression relationships with other temperature data from Andrews Forest for that period.
Script name: WaTeR.py (used to create entities 2 and 4)
This script was designed to flag, clean, average by hour, and fill data from air temperature sensors deployed at the HJ Andrews.
A revised Python program was used to flag the raw data in entity 5. This program also cleans and fills, but these data are not provided. A researcher can download the Python program from Bitbucket, the raw data from entity 5, and run the program. There is also a visualization program.
https://bitbucket.org/account/user/hjandrews/projects/PHEN
Datasets: The WaTeR.py script creates folders: flagged, cleaned, reference, and filled. These contain:
Requires: Python, SciPy and NumPy
Note: This program was written for data loggers started during June (daylight savings time). Since ONSET uses the computer clock for time stamps, raw times are in PDT not PST. Hourly averaging changes times to PST and matches reference file format (where the hour represents the average of temperatures in the preceding hour). If original logger start dates are not in PDT, there is a line of code (currently 313) which can be turned on.
Settings: This may be run for a file or folder. The file or folder should be found in the same directory as the script. Input the file or folder name (e.g. INPUTFOLDER="Folder") and comment out the unused line (e.g. #INPUTFILE). Input files should contain the site name as this name is retained through processing. Date limits should be specified under the '#Date limits' heading. These form the bookends in which the program will attempt to fill gaps using the available reference files. Reference files are stored together in a folder (e.g. REFERENCE_DIR = "RS data for PC sites") and are labeled with *reformatted* to distinguish them from reference files which have not been modified to match the required input format. All sites are added as they are run, so this should be run twice if they are not already included in the reference folder and you want them as reference files for sites run in the same batch. Site files will not be used if they have *cleaned*, *filled* or *flagged* in their file name to avoid using processed data. Reference files must be in the correct format and include *reformatted* in their filename or they will not be used. These labels are consistent with the output files from this script as well as the script to convert reference data downloaded from the Andrews website (convert_reference_data.py).
Description: This script serves to flag, prune, average (by hour), and fill air temperature data as detailed below. This script was applied to the raw data that created entities 2 and 4
Step 1: Flagging (Original time steps); Output file – (input file name)_flagged_00-0000.csv, where 00-0000 is the month-year of the last data point
Flagging identifies for each line (date/time) entry:
Step 2: Pruning (Original time step); Output file – (input file name)_cleaned_00-000.csv
Pruning removes lines containing extreme, air_past, air, jump and nodata.
Step 3: Averaging (Hourly time step); Output file – (input file name)_00-0000_reformatted.csv
Averaging uses only values remaining after pruning. The number of values used to calculate the average is included as a new column. Notes: The command for saving this output file includes the path for the reference folder, if that folder is changed, it should also be changed in this section (or a new folder will be made with the files but they will not be used for filling). Averaging follows the convention used for Andrews weather stations where the hour represents the average of temperatures in the preceding hour. The output is in PST (while all previous outputs are in PDT, matching the raw input).
Step 4: Filling (Hourly time step); Output file – (input file name)_filled_00-000.csv
The script uses cleaned data and compares remaining entries to reference files (see Settings). This is done as a linear regression of the cleaned data with each reference file, the output includes the R2 which can be found in the text file corresponding to the input file name. Prior to filling, the script creates placeholder hours bounded by the date range specified under “Date limits” which is the range in which filling is attempted. These values are set at 1000 degrees. The script aims to fill missing (1000 degree) data by moving sequentially through the reference data in order of fit (R2). The linear regression equation is used to modify the reference value for that data point and it replaces the 1000 degree placeholder. The reference file used in the temperature value filling is listed in a neighboring column. If all reference files are examined and no data is found to replace the missing value placeholder, the placeholder is retained, thus 1000 degrees should be treated as "no data".
Step 5: Max, min, mean (Daily time step); Output file – (input file name)_daily_00-0000.csv
The script ignores 1000 degree data and calculates daily max, min and mean temperature values from the filled dataset. The number of records (hours) used in the calculation is listed in the column 'count.'
Figures: Data points over time with flagged (and cleaned) data shown in red. Located within the "flagged" folder as .pdfs.
After sensors were downloaded, high resolution data were put through a series of programs for quality control and for filling missing values before generating the hourly averages (entities 2 and 4). A major concern of quality control was to detect when the sensors were buried by snow because temperatures would be representative of the snowpack and not the air temperature. When burial was detected, data were filled using the regression relationships with other sensors. Regressions were calculated using the best fit with other sensors during periods of time when the full data were available.
Note: Spikes in data associated with direct light impacting the sensors were not evaluated as part of the QC programs.
Note: For Phenology Core sites additional manual QA/QC was conducted to evaluate and correct snow flags and temperature spikes, resulting in data filling or reverting to original data, as needed.
For these entities (1-4), quality assurance and quality control was conducted on all temperature data collected. All data were averaged into hourly segments and run through a Python script to identify and flag impossible values, periods of missing data, and when sensors were buried by snow. Data were further checked via manual QAQC and values were compared to those from nearby temperature stations to identify any erroneous snow flags (i.e., data flagged as snow burial when there was no snow at that site), as well as temperature spikes, missing data, and other questionable values not identified by automated QAQC.
Raw data for entity 5 was flagged and not filled using the hja_hobo_clean Python programs. These data were checked for burial by snow, detection of extreme values and jumps, as well as the influence of high/extreme light intensity.
Raw data for entity 6 was flagged and not filled using the GCE Toolbox workflow and based on the flagging algorithms in the hja_hobo_clean workflow. Like entity 5, these data were checked for burial by snow, detection of extreme values and jumps, as well as the influence of high/extreme light intensity. Rather than having multiple data columns for each flag, as in entity 5, these data use an aggregated flagging system for each data variable (temperature and light).