Difference between revisions of "DataPro setup and workflow"

From IARC 207 Wiki
Jump to navigation Jump to search
imported>Bob
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Initial setup==
 
==Initial setup==
 
It's good to settle on a standard directory structure if you can.  For linux, most things go in a bin directory.  You can do the same with windows.  Recommended though to not put any spaces characters ' ' in folder names etc as I don't know if there are ill consequences.  Be conservative, descriptive names, well organized, use underscore character in place of spaces.  Logical places you might put them:
 
It's good to settle on a standard directory structure if you can.  For linux, most things go in a bin directory.  You can do the same with windows.  Recommended though to not put any spaces characters ' ' in folder names etc as I don't know if there are ill consequences.  Be conservative, descriptive names, well organized, use underscore character in place of spaces.  Logical places you might put them:
/var/data/bin/
+
/var/site/bin/
/home/site/bin/
+
/home/site/bin/
c:\data\bin/
+
c:\data\bin/
  
  
 
Then, suggested directory structure for the site's themselves:
 
Then, suggested directory structure for the site's themselves:
/var/data/$site_name - root directory of the site
+
/var/site/$site_name - root directory of the site
/var/data/$research_area/$site_name - another logical location for the root directory of the site.
+
/var/site/$research_area/$site_name - another logical location for the root directory of the site.
/var/data/$site_name/config/ - location of the configuration files used by datapro and related utilities for processing the data
+
/var/site/$site_name/config/ - location of the configuration files used by datapro and related utilities for processing the data
/var/data/$site_name/raw/ - location for the raw data (I personally think a good practice is to have the original data from LoggerNet / loggerdata etc
+
/var/site/$site_name/raw/ - location for the raw data (I personally think a good practice is to have the original data from LoggerNet / loggerdata etc
/var/data/$site_name/outputs/ - location for the automatically processed data: raw data that has been analyzed by datapro
+
/var/site/$site_name/outputs/ - location for the automatically processed data: raw data that has been analyzed by datapro
/var/data/$site_name/web/ - location for a shorter time series, like the last 3 weeks suitable for a web page.
+
/var/site/$site_name/web/ - location for a shorter time series, like the last 3 weeks suitable for a web page.
/var/data/$site_name/qc/ - location for the automated qa/qc log
+
/var/site/$site_name/qc/ - location for the automated qa/qc log
/var/data/$site_name/error/ - location for datapro error logs
+
/var/site/$site_name/error/ - location for datapro error logs
  
 
To run:
 
To run:
python /var/data/bin/csv_utilities/datapro.py --key_file=/var/data/$site_name/config/keyfile.txt
+
python /var/site/bin/csv_utilities/datapro.py --key_file=/var/site/$site_name/config/keyfile.txt
  
 
Then, watch for error messages. Depending on your system there may be some python libraries that need to be installed.  If you use windows, enthought python distribution is a nice all-encompassing choice or cygwin for making sure you have all of the right things on the computer.   
 
Then, watch for error messages. Depending on your system there may be some python libraries that need to be installed.  If you use windows, enthought python distribution is a nice all-encompassing choice or cygwin for making sure you have all of the right things on the computer.   
Line 33: Line 33:
 
# once it's running I do a couple more things.  ln -s is used to create symlinks for a few other items.  I maintain an automatically generated diagnostics web page at http://ngeedata.iarc.uaf.edu/data/data.html.  It expects battery voltage to be placed in 'battery.csv' and Internal Panel Temperature to be in 'paneltemp.csv' if you don't use those names you can use ln -s to create a symlink. I then will also make sym links so that the processed data is available over the internet without making a second full copy of the data.
 
# once it's running I do a couple more things.  ln -s is used to create symlinks for a few other items.  I maintain an automatically generated diagnostics web page at http://ngeedata.iarc.uaf.edu/data/data.html.  It expects battery voltage to be placed in 'battery.csv' and Internal Panel Temperature to be in 'paneltemp.csv' if you don't use those names you can use ln -s to create a symlink. I then will also make sym links so that the processed data is available over the internet without making a second full copy of the data.
  
==Version 2 of this workflow description ==
+
==Version 2 of this workflow for fully automated sites ==
  
  
Here is the code for DataPro plus some other information... I should figure out a good way to transfer the rest of this to the wiki at some point, too:<br>
+
Here is the code for DataPro plus some other information...:<br>
 
[[DataPro_and_related_utilities]]
 
[[DataPro_and_related_utilities]]
 +
===Grab these items===
 +
cd /var/site/lib
 +
git clone https://github.com/rwspicer/csv_utilities.git
 +
git clone https://github.com/frankohanlon/bash_utils.git
 +
git clone https://github.com/frankohanlon/python_utils.git
  
 
All of this definitely works in linux / mac osx.  I think datapro runs in Windows but a lot of the helper utilities and stuff are more linux / osx specific just due to that being where all of this stuff is running. Probably wouldn't be too hard to get it running in windows if you wanted to.  I have some other utilities, too.  Check out these in here:​<br>
 
All of this definitely works in linux / mac osx.  I think datapro runs in Windows but a lot of the helper utilities and stuff are more linux / osx specific just due to that being where all of this stuff is running. Probably wouldn't be too hard to get it running in windows if you wanted to.  I have some other utilities, too.  Check out these in here:​<br>
Line 45: Line 50:
  
 
That loads it up.  If you look at main_cron (which is a text file) you'll see line 16 is the schedule for running bash script 'cron_04min.sh'.
 
That loads it up.  If you look at main_cron (which is a text file) you'll see line 16 is the schedule for running bash script 'cron_04min.sh'.
<pre>6      *      *    *    * (/home/bbusey/working_files/bin/cron_04min.sh) > /home/bbusey/working_files/bin/logs/ND_4min_log.txt</pre>
+
<pre>6      *      *    *    * (/var/site/bin/cron_04min.sh) > /var/site/logs/ND_4min_log.txt</pre>
  
 
If you open 'cron_04min.sh' you'll see a few things.  I use the information on lines 2 & 3 to track how long it takes for this script to run:
 
If you open 'cron_04min.sh' you'll see a few things.  I use the information on lines 2 & 3 to track how long it takes for this script to run:

Latest revision as of 12:20, 19 October 2018

Initial setup

It's good to settle on a standard directory structure if you can. For linux, most things go in a bin directory. You can do the same with windows. Recommended though to not put any spaces characters ' ' in folder names etc as I don't know if there are ill consequences. Be conservative, descriptive names, well organized, use underscore character in place of spaces. Logical places you might put them:

/var/site/bin/
/home/site/bin/
c:\data\bin/


Then, suggested directory structure for the site's themselves:

/var/site/$site_name - root directory of the site
/var/site/$research_area/$site_name - another logical location for the root directory of the site.
/var/site/$site_name/config/ - location of the configuration files used by datapro and related utilities for processing the data
/var/site/$site_name/raw/ - location for the raw data (I personally think a good practice is to have the original data from LoggerNet / loggerdata etc
/var/site/$site_name/outputs/ - location for the automatically processed data: raw data that has been analyzed by datapro
/var/site/$site_name/web/ - location for a shorter time series, like the last 3 weeks suitable for a web page.
/var/site/$site_name/qc/ - location for the automated qa/qc log
/var/site/$site_name/error/ - location for datapro error logs

To run:

python /var/site/bin/csv_utilities/datapro.py --key_file=/var/site/$site_name/config/keyfile.txt

Then, watch for error messages. Depending on your system there may be some python libraries that need to be installed. If you use windows, enthought python distribution is a nice all-encompassing choice or cygwin for making sure you have all of the right things on the computer.

If you have any questions just ask Bob Busey or Ross Spicer.

Initial site set up workflow

  1. create the initial directory structure listed above.
  2. I usually don't use the site creator though I certainly could. More often I start with an existing site's configuration files... copy them to a new config and edit to reflect the new site.
  3. the file that controls qa/qc parameters, filenames, any functions to be applied etc is the parameter csv file. It's often easiest to just edit this on a computer with a graphical spreadsheet program like Excel or Libre Office to get things proper. The header information can often be copied & pasted straight from the raw .dat file
  4. next, edit the .txt key file to reflect where file and directory locations are all placed.
  5. run datapro as shown above.
  6. if it works, awesome. if it doesn't, make note of the errors and try to address them.
  7. once it's running I do a couple more things. ln -s is used to create symlinks for a few other items. I maintain an automatically generated diagnostics web page at http://ngeedata.iarc.uaf.edu/data/data.html. It expects battery voltage to be placed in 'battery.csv' and Internal Panel Temperature to be in 'paneltemp.csv' if you don't use those names you can use ln -s to create a symlink. I then will also make sym links so that the processed data is available over the internet without making a second full copy of the data.

Version 2 of this workflow for fully automated sites

Here is the code for DataPro plus some other information...:
DataPro_and_related_utilities

Grab these items

cd /var/site/lib
git clone https://github.com/rwspicer/csv_utilities.git
git clone https://github.com/frankohanlon/bash_utils.git
git clone https://github.com/frankohanlon/python_utils.git

All of this definitely works in linux / mac osx. I think datapro runs in Windows but a lot of the helper utilities and stuff are more linux / osx specific just due to that being where all of this stuff is running. Probably wouldn't be too hard to get it running in windows if you wanted to. I have some other utilities, too. Check out these in here:​

  • bin.zip (ask Bob for access to this if you want to implement)

Roughly, speaking I use crontab for scheduling. I keep the schedule as a file in the bin directory. It is called 'main_cron'. To load it into the crontab (from the command line):

$ crontab main_cron

That loads it up. If you look at main_cron (which is a text file) you'll see line 16 is the schedule for running bash script 'cron_04min.sh'.

6      *      *    *     * (/var/site/bin/cron_04min.sh) > /var/site/logs/ND_4min_log.txt

If you open 'cron_04min.sh' you'll see a few things. I use the information on lines 2 & 3 to track how long it takes for this script to run:

CRONSTART=`date +%s%N`
CRONTIME=$(date +'"%G-%m-%d %H:00:00",')

Those basically use the program date to output a couple quantities. The first is the starting time for this batch file, units are in seconds. The second is the standard CSI format date/time. You'll see I also track station processing time. In some sense it's a meaningless quantity but in another, I've caught problems early when I've noticed the total processing time deviating from it's usual. When Ross rewrote datapro he also made it way more efficient so this became less important. So these lines are for the cron_04min total run time and there are some other commands that track individual batch files.

Line 4 identifies one of the utility scripts and assigns the location to a variable name. This one particular one reviews the latest output from the instruments and then tallies up how many bad data points there are. I'd like to track this in a more sophisticated fashion at some point but haven't gotten to it yet:

STATUS=/home/bbusey/working_files/bin/utilities/status_update.py

Line 7-13 are the ones in cron_04min.sh that handle the example site, ss_met. Line 7 is just the tracking starter again:

START=`date +%s%N`

Line 8 triggers another batch file, which is full of things related to ss_met.

~/working_files/bin/process_ss_met.sh

I've been trying to get a bit more structured in file naming. So most of the bash scripts that are run from the crontab start with the prefix 'cron' and then here, the processing bash script for the super-site met station has the prefix 'process'.
Here are lines 9-13. They run after 'process_ss_met.sh' has completed.

END=`date +%s%N`
ELAPSED=`echo "scale=8; ($END - $START) / 1000000000" | bc`
temp1=$CRONTIME$ELAPSED
echo $temp1 >> /home/bbusey/working_files/global_status/outputs/ss_met_process.csv
echo "Super Site Met: "$ELAPSED

END dumps the current time in seconds into a variable....
ELAPSED is a bit of math using the program bc, total run time for process script is computed here.
The line where we set temp1 variable to a couple quantities is just putting the elapsed time into that TOA5 CSI format. Column 1 is the date/time and column 2 is the total run time in seconds. the echo $temp1 >> etc puts this quantity into a text file for logging and later visualization on the diagnostic page.

So, everything else in that file is just extra essentially. Next i wanted to step through 'ss_met.sh' That first script is basically the wrapper for it. The first eleven lines once again initialize some variables, in this case paths to directories, locations of important scripts. I put them at the top to make it so that when things change from time to time there is less search & replace:

#!/bin/sh
SHELL=/bin/bash
START=`date +%s%N`
CRONTIME=$(date +'"%G-%m-%d %H:00:00",')
DATAPRO="/home/bbusey/working_files/bin/csv_utilities/datapro.py"
GLOM="/home/bbusey/working_files/bin/utilities/glom_together.py"
TRACK_DELAY="/home/bbusey/working_files/bin/utilities/track_delay.py"
DATA_CLIPPER="/home/bbusey/working_files/bin/dataclip_wrapper"
STATUS="/home/bbusey/working_files/bin/utilities/status_update.py"
ROOT_DIR="/home/bbusey/working_files/ss_met"

Lines 16-23 in 'process_ss_met.sh' is basically DataPro running through the data in several different data tables. Here's one:

python $DATAPRO --key_file=$ROOT_DIR/config/ss-met_AT_key_2012-07.txt

With DataPro there are two important files,this key file, and the csv paramter file. If you pop open this key file you'll see roughly that it covers:

  • where the input data file is (I call the *.dat file from LoggerNet the raw data). So, this is the raw data straight from LoggerNet.
  • the location & file name of the csv parameter file. The parameter file basically maps the many columns of the raw data to a series ouf output file names (after processing). It also includes coefficients for a number of functions that can be applied to the data as well as some QA thresholds.
  • the directory where this processed data should be dumped into.
  • Also specified whether it is a CR10X array based logger or table data based logger. Either works but if you have a CR10X outputting julian days etc then you'll need to specify the Array ID in the key file.

With the processing complete the first of the additional diagnostic parameters is computed. I have a utility called TRACK_DELAY which does like it sounds, looks at the current time and looks at the last date in the output file and computes the difference. The quantity is tracked like a bunch of the others and also shows up on the diagnostic web page. I think this is the one that I watch the most.

python $TRACK_DELAY --infile=$ROOT_DIR/outputs/battery.csv --timezone=Etc/GMT+0

So, once datapro has run through the raw data from each of the .dat logged by this logger, there is a bit more data management that takes place. On line 27 you'll see:

$DATA_CLIPPER $ROOT_DIR/ _3wks.csv 4 2016

Data clipper trims the output data to a header plus in this case 2016 lines of data, the most recent 3 weeks roughly speaking. So, directory structure is basically:

ss_met/config === location of the configuration files, the key file and csv parameter files
ss_met/raw == the location of the raw .dat from loggernet 
ss_met/outputs == the location of all of the processed data for ss_met
ss_met/web == the trimmed data after running the $DATA_CLIPPER routine.
ss_met/web/one/ == the files in here are just the very last line of data from the ss_met/outputs/ files. A subset of these are used by the diagnostics web page.

Lines 32 to 52 are some reformatting utilities. Basically, grouping common variables together for display on a web page.

python $GLOM $ROOT_DIR/config/ss-met-glom_at.txt

Once again there is a key file here for the glom utility and a corresponding csv parameter file.

Lines 55 & 56 transfer the processed data to the IARC web page for public viewing.

Line 58 is a short utility which runs through all of the data files and counts up all of the bad data points. Tracking over time, if I'm paying attention then I see when the automated QA/QC captures something funky.

python $STATUS $ROOT_DIR/outputs/ $ROOT_DIR/outputs/data_status.csv

Finally, line 60-64 wrap up the processing bash script, computing the total time again like was done for the cron job:

END=`date +%s%N`
ELAPSED=`echo "scale=8; ($END - $START) / 1000000000" | bc`
echo $ELAPSED
temp1=$CRONTIME$ELAPSED
echo $temp1 >> $ROOT_DIR/outputs/site_process.csv

I think that's basically all of the processing and stuff that takes place. I didn't mention it above but there are a number of data manipulation functions built into datapro so, like unit conversions if you want, thermistor resistance to temperature, that sort of thing. And it's pretty easy to add more if you the ones that are there don't suffice (and if you do then we can add them back into the main DataPro archive). Check out bin/csv_utilities/csv_lib/equations.py


So, that's basically the processing component. In the web_site.zip below I included all the configuration files etc for the super-site met as a fully reproducible example if you want. Or you could also check out the files listed on the wiki.


Moving on to the web page part, here is the javascript / css / html for getting the diagnostic page up and running:​

  • web_site.zip​ (see Bob if you want to try this out)

data.html is this page:
http://ngeedata.iarc.uaf.edu/data/data.html

I pretty much use a bunch of javascript to put the page together using the contents of all of those outputted data files from earlier. the files in the ss_met/outputs/one are read in and populate the html table among other things. As far as how the page comes together, check out web_site/diag.json. Here's an entry for the ss_met site again:

	{
		"StationName": "ss_met",
		"StationDescriptiveName": "SuperSite Met Station",
		"StationAltName": "ss_met",
		"DataFileNameBase": "ss_met/outputs/",
		"last_data_file": "/web/one/battery_1.csv",
		"timezone": "Etc/GMT+0"
	},

There are echoes in this configuration from the old EE Internet gp.py that was used on the old Bullen point web page setup:
http://ine.uaf.edu/werc/projects/bullen/stations.html


I got started with this whole datapro enterprise trying to get old CR10X Array based data files to be shown using the gp.py python stuff that EE Internet wrote. So, with data.html, the key piece of the puzzle which starts everything running is down at the bottom of the page, javascript on lines 85-92:

<script type="text/javascript">
    function init() { 
        make_diag_table();
        data_toggle('ss_met/web/delay_3wks.csv','Oldness @ Super Site Met');
    }

				
</script>


The init() function is found in:

web_site/js/interface_to_dygraph_diagpage.js

Check out line 17 of that file for function make_diag_table(). That's pretty much where the magic happens, that function. You'll see that there are a few hard coded files that the javascript looks for. Some sites don't have all of these, or the battery voltage file might have a different name. I use symlinks on linux to still make sure there is an appropriate file there.


I think that's the gist of everything. The last piece of the puzzle I guess is:

web_site/ss_met_summary.html

I write that file by hand (well, more accurately I do a file save-as from an existing one and then edit it to reflect the new station being worked on. The file name ties into the diag.json. The name for this is "StationAltName" from the diag.json plus the suffix "_summary.html".