Quarterly Metrics ================= .. contents:: :depth: 3 Phase 2, Quarter 1 ------------------ =================== ============== ============== Metric Q1 Value Q2 Value** =================== ============== ============== Start Date 2014-10-01 2015-01-01 End Date 2014-12-31 2015-03-31 ** Data Downloads 13,833 9,915 Data Uploads 516 22 Data Additions* 53,353 2,278 Num Member Nodes 26 26 Num ITK Tools 6 6 CN Uptime 99.9996% 99.9996% Num data files 157,500 160,942 Data Total Bytes 1,450,075 MiB 1,609,166 MiB Num metadata files 212,668 216,770 =================== ============== ============== \* *Data Additions* indicates the number of new data objects added to Member Nodes as recorded by the Coordinating Nodes through the synchronization process. This number differs from the number of Data Uploads because Member Nodes may choose to add content to their data repository through mechanisms other than the DataONE service interfaces. By definition, this is necessary for Tier 1 Member Nodes as do not support the ability to write content through the DataONE service interfaces (supported by Tier 3 and higher). \** Preliminary metrics, covers the period from 1 Jan, 2015 through 2015-02-27 at about 09:30ET Gathering Metrics for Monthly and Quarterly Reports --------------------------------------------------- The metrics to be reported on are listed in the Project Execution Plan and are summarized in the `Phase 2 Metrics Worksheet`_. For Phase 2, Q1 the metrics to be reported by CI are:: Data Downloads Data Uploads Number of Member Nodes Number of tools in Investigator Toolkit Uptime of CNs Number of data files available Total size of data files available Number of metadata files available Additions for Phase 2, Q2 include:: Search Events Number of users that enter queries only Number of users that access the data repositories Data Downloads ~~~~~~~~~~~~~~ Interpreted as the number of "get" requests (logged as READ events) made for objects with an object format classified as DATA. The log aggregation solr end point is:: https://cn.dataone.org/cn/v1/query/logsolr/select A query to retrieve the number of READ events for format type of DATA is:: formatType:DATA AND event:READ A query to retrieve the number of READ events for format type of DATA over the last 91 days is:: formatType:DATA AND event:read AND dateLogged:[NOW-91DAY TO NOW] A query to retrieve the number of READ events for format type of DATA within a specific time range is:: formatType:DATA AND event:read AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.000Z] It is also necessary to ignore requests from systems internal to DataONE such as the Coordinating Nodes. This can be done by filtering out the IP address of the CNs (128.111.54.80, 160.36.13.150 and 64.106.40.6) in the query by adding the clause:: AND -ipAddress:(128.111.54.80 OR 160.36.13.150 OR 64.106.40.6) The complete query would then be:: formatType:DATA AND event:read AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.999Z] AND -ipAddress:(128.111.54.80 OR 160.36.13.150 OR 64.106.40.6) Expressed as a URL:: https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D%20AND%20-ipAddress%3A(128.111.54.80%20OR%20160.36.13.150%20OR%2064.106.40.6) CURL:: curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D%20AND%20-ipAddress%3A(128.111.54.80%20OR%20160.36.13.150%20OR%2064.106.40.6)" | xml sel -t -m "//result" -v "@numFound" -n Data Uploads ~~~~~~~~~~~~ Interpreted as the number of "create" requests (logged as create events) made for objects with an object format classified as DATA. Analyzing these log events will indicate the number of data uploads made through the DataONE service interfaces. Member Nodes may also have alternative mechanisms for populating their data repositories, and content added in this manner are not reflected in the logs, though can be determined by querying DataONE for content that was newly added over the time period in question. A query to retrieve the number of CREATE events for format type of DATA within a specific time range and excluding CNs (and so determine uploads through the DataONE service interfaces):: formatType:DATA AND event:create AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.999Z] AND -ipAddress:(128.111.54.80 OR 160.36.13.150 OR 64.106.40.6) Alternatively, a query to retrieve the number of new DATA objects added within a specific time range (querying against the query/solr search end point):: dateUploaded:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.999Z] AND formatType:DATA URL for log records:: https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Acreate%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D URL for new objects (with optional grouping by node):: https://cn.dataone.org/cn/v1/query/solr/?q=dateUploaded:%5B2014-10-01T00:00:00.000Z+TO+2014-12-31T23:59:59.999Z%5D+AND+formatType:DATA&facet=true&facet.field=datasource&rows=0 CURL for log records:: curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Acreate%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D" | xml sel -t -m "//result" -v "@numFound" -n CURL for new content:: curl -k -s "https://cn.dataone.org/cn/v1/query/solr/?q=dateUploaded:%5B2014-10-01T00:00:00.000Z+TO+2014-12-31T23:59:59.999Z%5D+AND+formatType:DATA&facet=true&facet.field=datasource&rows=0" | xml sel -t -m "//result" - v "@numFound" -n Number of Member Nodes ~~~~~~~~~~~~~~~~~~~~~~ This is the number of node entries of type "mn" in the node list reported by the CN. CURL:: curl -k -s "https://cn.dataone.org/cn/v1/node" | xml sel -t -m "//node[@type='mn']" -v "identifier" -n | wc -l Number of tools in Investigator Toolkit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This number needs to be manually counted and should include packages intended for public use. - Java libclient - python libclient - python CLI - ONEMercury - ONE-R - Member Node Dashboard Uptime of CNs ~~~~~~~~~~~~~ Currently this is recorded manually with the assumption that operations started on July 1, 2012 at 12:00 UTC. Uptime is calculated as the percentage of of the total duration that systems have been operational. This is:: uptime = operationalPeriod / totalPeriod = (totalPeriod - downTime) / totalPeriod There have been two downtime events logged: - 30 seconds configuration error during upgrade - 5 minutes undetected switch failure Thus uptime can be calculated on the command line with:: EVENTS=(30 300) T0=$(date -j -u -f %Y-%m-%d-%H-%M-%S 2012-07-01-12-00-00 +%s) T1=$(date -j -u -f %Y-%m-%d-%H-%M-%S 2014-12-31-23-59-59 +%s) PERIOD=$( bc <<< "$T1-$T0" ) DOWNTIME=$( IFS="+"; bc <<< "${EVENTS[*]}" ) UPTIME=$( bc <<< "scale=5;100.0*($PERIOD-$DOWNTIME)/$PERIOD") echo $UPTIME Number of Data Files ~~~~~~~~~~~~~~~~~~~~ The number of data files can be determined by querying the search index for the number of objects with objectType of DATA. Note that to get an actual total, the request needs to be authenticated as a CN or equivalent trusted entity. ..Note:: ``curl`` on OSX does not work with client certificates. It is necessary to install a new version or just run the command from a linux system. Solr query:: formatType:DATA URL:: https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0 Command Line for public content:: curl -k -s "https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0" \ | xml sel -t -m "//result[@name='response']" -v "@numFound" -n Command line for all content:: curl -k -s --cert cnode.pem "https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0" \ | xml sel -t -m "//result[@name='response']" -v "@numFound" -n Total Size of Data Files ~~~~~~~~~~~~~~~~~~~~~~~~ The total size of data files can be determined by querying the search index for the number of objects with objectType of DATA and including a request for summary statistics on the size field. Note that to get an actual total, the request needs to be authenticated as a CN or equivalent trusted entity. Solr query:: formatType:DATA&rows=0&stats=true&stats.field=size URL:: https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0&stats=true&stats.field=size Command line:: URL="https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0&stats=true&stats.field=size" BYTES=$(curl -k -s --cert cnode.pem "${URL}" \ | xml sel -t -m "//lst[@name='size']" -v "double[@name='sum']" -n) python -c "print \"%.0fMiB\" % ($BYTES / 1048576)" Number of Metadata Files ~~~~~~~~~~~~~~~~~~~~~~~~ The number of metadata files can be determined by querying the search index for the number of objects with objectType of METADATA. Note that to get an actual total, the request needs to be authenticated as a CN or equivalent trusted entity. Solr query:: formatType:METADATA URL:: https://cn.dataone.org/cn/v1/query/solr/?q=formatType:METADATA&rows=0 Command Line:: curl -k -s --cert cnode.pem "https://cn.dataone.org/cn/v1/query/solr/?q=formatType:METADATA&rows=0" \ | xml sel -t -m "//result[@name='response']" -v "@numFound" -n Search Events ~~~~~~~~~~~~~ There are two common entry points for searching the CNs, one is through the ONEMercury interface, the other is through the search index using the REST query or search calls. The ONEMercury interface also uses the REST search interface, so it should be possible to use the log for those actions as a complete representations of search events. Number of users that enter queries only ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Number of users that access the data repositories ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Defining a "User" ----------------- Verifying Log Record Counts --------------------------- The following procedure is used to independently verify the content of the logsolr index. In Summary: 1. Obtain a list of Member Node Base URLs 2. Retrieve log records from each Member Node for the specified period 3. Load log records into an analysis environment that supports the necessary queries. 4. Perform queries to retrieve stats comparable to that provided by the logsolr index. 1. Member Node Base URLs ~~~~~~~~~~~~~~~~~~~~~~~~ :: d1listnodes -b "https://cn.dataone.org/cn" urn:node:KNB https://knb.ecoinformatics.org/knb/d1/mn Knowledge Network for Biocomplexity urn:node:ESA https://data.esa.org/esa/d1/mn ESA Data Registry urn:node:SANPARKS https://dataknp.sanparks.org/sanparks/d1/mn SANParks Data Repository urn:node:USGSCSAS http://mercury-ops2.ornl.gov/clearinghouse/mn USGS Core Sciences Clearinghouse urn:node:ORNLDAAC http://mercury-ops2.ornl.gov/ornldaac/mn ORNL DAAC urn:node:LTER https://tropical.lternet.edu/knb/d1/mn LTER Network Member Node urn:node:CDL https://merritt.cdlib.org:8084/knb/d1/mn Merritt Repository urn:node:PISCO https://data.piscoweb.org/catalog/d1/mn PISCO MN urn:node:ONEShare https://oneshare.unm.edu/knb/d1/mn ONEShare Repository urn:node:mnORC1 https://mn-orc-1.dataone.org/knb/d1/mn DataONE ORC Dedicated Replica Server urn:node:mnUNM1 https://mn-unm-1.dataone.org/knb/d1/mn DataONE UNM Dedicated Replica Server urn:node:mnUCSB1 https://mn-ucsb-1.dataone.org/knb/d1/mn DataONE UCSB Dedicated Replica Server urn:node:TFRI https://metacat.tfri.gov.tw/tfri/d1/mn TFRI Data Catalog urn:node:USANPN https://mynpn.usanpn.org/knb/d1/mn USA National Phenology Network urn:node:SEAD http://seadva.d2i.indiana.edu:8081/sead/rest/mn SEAD Virtual Archive urn:node:GOA https://goa.nceas.ucsb.edu/goa/d1/mn Gulf of Alaska Data Portal urn:node:KUBI https://bidataone.nhm.ku.edu/mn University of Kansas - Biodiversity Institute urn:node:LTER_EUROPEhttps://data.lter-europe.net/knb/d1/mn LTER Europe Member Node urn:node:DRYAD https://datadryad.org/mn Dryad Digital Repository urn:node:CLOEBIRD https://dataone.ornith.cornell.edu/metacat/d1/mn Cornell Lab of Ornithology - eBird urn:node:EDACGSTORE https://gstore.unm.edu/dataone/ EDAC Gstore Repository urn:node:IOE https://data.rcg.montana.edu/catalog/d1/mn Montana IoE Data Repository urn:node:US_MPC https://dataone-prod.pop.umn.edu/mn Minnesota Population Center urn:node:EDORA http://mercury-ops2.ornl.gov/EDORA_MN/mn Environmental Data for the Oak Ridge Area (EDORA) urn:node:RGD http://mercury-ops2.ornl.gov/RGD_MN/mn Regional and Global biogeochemical dynamics Data (RGD) urn:node:GLEON https://poseidon.limnology.wisc.edu/metacat/d1/mn GLEON Data Repository 2. Retrieve Log Records ~~~~~~~~~~~~~~~~~~~~~~~ For each node:: export NODE="https://knb.ecoinformatics.org/knb/d1/mn" ./d1logrecords -c ../.dataone/cnode.pem \ -C 999999 \ -X - \ -B 2014-10-01T00:00:00.000+00:00 \ -D 2014-12-31T23:59:59.000+00:00 > KNB.xml 3. Put Records in an Analysis Environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The "analysis environment" used was XMLStarlet. First wrap the logEntry elements per file in a top level Log element:: #!/bin/bash LOGRECORDS=$1 echo "" > tmpxml cat $LOGRECORDS >> tmpxml echo "" >> tmpxml mv tmpxml $LOGRECORDS Now count the read and create events. Can not however, do DATA restriction as that requires a join with another data resource:: #!/bin/bash LOGRECORDS=$1 nread=$(cat $LOGRECORDS | xml sel -t -m "//Log" \ -v "count(logEntry[event/text()='read'])" -n) ncreate=$(cat $LOGRECORDS | xml sel -t -m "//Log" \ -v "count(logEntry[event/text()='create'])" -n) echo $LOGRECORDS $nread $ncreate Note that the CNs can be excluded from the counts by using:: count(logEntry[event/text()='read' \ and not(contains('128.111.54.80 160.36.13.150 64.106.40.6', ipAddress))]) With the results (nodes not listed returned 0 for both read and create):: NODE Read Create CDL 33980 16805 CLOEBIRD 68 0 DRYAD 101327 0 EDAC 0 0 ESA 814 0 GOA 1388 1 KNB 109733 19 LTER 163151 100 ONEShare 49 0 PISCO 1995 1 TFRI 4185 29 USANPN 23 0 US_MPC 2597 1032 mnORC1 1145 28 mnUCSB1 1235 35 mnUNM1 1169 29 Equivalent values using logsolr would be found using the query:: nodeId:urn\:node\:CDL AND event:read AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.000Z] URL:: https://cn.dataone.org/cn/v1/query/logsolr/select?q=nodeId%3Aurn%5C%3Anode%5C%3ACDL%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.000Z%5D Script:: #!/bin/bash FNAME=$(basename "$1") NID="${FNAME##*.}" nread=$(curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=nodeId%3Aurn%5C%3Anode%5C%3A${NID}%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.000Z%5D" | xml sel -t -m "//result" -v "@numFound" -n) ncreate=$(curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=nodeId%3Aurn%5C%3Anode%5C%3A${NID}%20AND%20event%3Acreate%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.000Z%5D" | xml sel -t -m "//result" -v "@numFound" -n) echo $FNAME $nread $ncreate Scripts ------- :: .. include:: metrics_report.sh References ---------- .. _Phase 2 Metrics Worksheet: https://docs.google.com/spreadsheets/d/1bRUyK7Xat88ywDkfa5Py03ytMxxTVsn86KCUbir1TlI/edit#gid=0