Member Nodes
============

:Revision History:

  - 20091115 Vieglais.  General edits
  - 20091110 Vieglais.  Pulled together content various sources

.. contents::

Introduction
------------

This document is an amalgamation of various sources including the original
DataONE proposal, notes taken at various VDC meetings, the draft document
"DataONE Member Nodes: Characteristics and Processes" by Sandusky and Vision
(initiated at the Virtual Data Center coordination meeting held in Durham, NC
August 11-13, 2009) and other miscellaneous notes. 

This document summarizes information about DataONE Member Nodes (MN),
including why organizations like libraries, research centers, universities,
museums, non-profit organizations, and governments may wish to participate by
hosting a DataONE MN; characteristics of MNs; roles and responsibilities for
MNs in the DataONE context; a MN directory and implementation schedule; and
processes for hosting a MN.

Why DataONE Includes Member Nodes 
---------------------------------

DataONE MNs are an essential part of the DataONE architecture and provide a
number of benefits. MNs are:

- A key interface to DataONE users and stakeholders

  MNs provide a direct means of engagement between organizations and the
  larger DataONE enterprise at a variety of scales and levels of commitment,
  ranging from moderately scaled servers housed in laboratories to more
  complex data archives hosted by scientific collaboratives. MNs may be large
  and complex – on the scale of discipline-specific data repositories – or
  relatively simple, a discipline-agnostic repository hosted by academic
  research library. Organizations hosting MNs become part of the DataONE
  community and are welcome to participate in the International Users Group or
  contribute to one or more of the technical or community engagement working
  groups.

- Fundamental to supporting access, discovery, and re-use of scientific data

  MNs are the primary site of user interaction with DataONE services.
  Scientists, students, and citizens interact with MNs through software tools
  that utilize standardized interfaces. This supports many different usage
  scenarios, such as data and metadata management and replication, analysis
  and modeling, and scientific workflow systems. Data and metadata requested
  from DataONE – as the result of a query, for example – may be provided from
  any of the MNs that hold a copy of that data and metadata. Replication of
  multiple copies of data distributed geographically and across a range of
  technologies are important for the long-term (ranging from many decades to
  centuries) preservation and curation of scientific data.

- Essential for the replication and persistence of data 

  DataONE is designed to be a scalable, system distributed across six
  continents comprised of a relatively large number of MNs – potentially
  numbering in the hundreds – which are hosted by a diverse set of independent
  organizations, including research libraries, non- and for-profit research
  centers, governmental bodies, non-governmental organizations, museums,
  colleges and universities, and non-profit organizations. The set of MNs in
  DataONE hold a sufficient number of geographically distributed copies of
  data and its associated metadata. Data and metadata at MNs will be subject
  to ongoing automated inspection to ensure the content remains stable and in
  a viable format.

- Key to sustainability

  MNs are part of DataONE’s sustainability strategy: by providing a relatively
  small fraction of the technical and organizational capacity to provide
  trustworthy stewardship of scientific data, each independent organization
  contributes to the overall sustainability of DataONE. DataONE thus avoids
  over-reliance upon certain types of organizations or organizations located
  across a limited geographic scope.

- Essential to scalability


- Key to community engagement


Organizations are motivated to host MNs for diverse reasons.

- Serve organization needs (data preservation/availability)

While many important data repositories have been built to serve particular
scientific communities (e.g., the Virtual Observatory; the Long Term
Ecological Research Network; the National Snow and Ice Center), the long-term
sustainability, range of services, and discoverability of the holdings of any
single repository is limited by the resources available to them as individual
repositories. The DataONE architecture allows individual repositories like
these to also serve as DataONE MNs and thus take advantage of unique and
leading-edge DataONE services, such as DataONE’s robust data preservation and
curation services, wide geographic distribution, federated access and
discovery services, etc.


- Serve range of user needs

- Protect, preserve, recognize, 

- Be a leader

For many organizations, hosting a DataONE MN is a relatively simple way to be
involved in the long-term preservation of humankind’s scientific record.
Academic libraries, for example, have moved from being place-based services
reliant on large collections of physical materials to providing
locally-optimized services which are now dominated by digitized information
prioritizing unique, local collections and faculty expertise. Moving into
providing long-term, discipline-agnostic access to scientific data is a
natural progression for libraries that simultaneously provides a stable
species in DataONE’s organizational ecosystem, supporting DataONE’s
sustainability goals.


- Reduce organizational costs

Providing robust digital preservation and curation services for scientific
data is expensive, and sufficient resources do not exist to provide these
kinds of services at every museum, library, or research center. An
organization can host a MN optimized for interaction with its community for
data deposit, management, and access while avoiding needed resources to
provide complete and robust solutions for all aspects of the data curation
challenge. Instead, each MN can rely upon centrally defined and managed
services for trustworthy data and metadata replication, federated access,
provenance, etc.


- Deliver more capabilities

- Leverage expertise

Characteristics of Member Nodes
-------------------------------

An organization wishing to host a MN must meet certain minimum requirements in
order to be considered for membership in DataONE. This section summarizes sets
of minimum and ideal MN characteristics.

Minimum Requirements
~~~~~~~~~~~~~~~~~~~~

A prospective MN:

- Must either (1) have and currently host data and metadata collections and /
  or (2) be willing to provide computing and storage capacity for data and
  metadata generated by the larger DataONE community

- Must have basic cyberinfrastructure capabilities (e.g., server, storage,
  network / communications, physical and administrative security, capacity to
  administer the cyberinfrastructure)

- Must serve a community of practice (e.g., a campus, a set of data users, a
  discipline or sub-discipline)

- Must demonstrate its institutional commitment to creating and maintaining a
  MN

- Must provide a key point of contact

- Must have the ability to contribute resources (e.g., people to engage in
  DataONE user or working groups or design, implementation, testing
  activities; cyberinfrastructure, including storage space, etc.; data and
  metadata from one or more disciplines appropriate for inclusion in DataONE;
  direct financial support; etc.)

- Must have or be willing do adopt explicit policies on data access and use
  that are consistent with DataONE policies

- Must agree to replicate other MN data and metadata


Ideal Characteristics
~~~~~~~~~~~~~~~~~~~~~

A prospective MN:

- Should provide robust user authentication services consistent with DataONE
  standards

- Should use or adopt accepted standards for metadata and globally unique
  identifiers for data and metadata (GUIDs)

- Should be willing to contribute resources, whether financial or in-kind, to
  participate in DataONE

- Should have or adopt well-defined and well-documented data management
  policies and procedures consistent with DataONE policies and procedures

- Should have a record of providing expertise and leadership in a geographic
  or disciplinary community

- Should be willing to integrate elements of the DataONE Investigator Toolkit
  into its MN

Types of Member Nodes
~~~~~~~~~~~~~~~~~~~~~

[Distinguish between member nodes that are primary sources of scientific data
(Dryad, DAAC, LTER sites) and those that are simply providing additional
storage and greater geographic distribution (think: LOCKSS box) or collocated
with an Institutional Repository (based on DSpace at a University library); I
think Ryan Scherle referred to this as “blind data space” in his meeting
notes. There may be other types, but I can’t think of any.]


Member Node Services
~~~~~~~~~~~~~~~~~~~~


Potential Member Nodes (Year 1 Focus)
-------------------------------------

List of some potential Member Nodes, mostly drawn from the proposal and from
meeting notes.


From the proposal text:

.. pull-quote::

  As part of the sustainability of DataNetONE, the storage of data sets will be
  distributed across Member Nodes. One copy will typically reside at the
  originating Member Node, and replicas will be created at two or more other
  locations, such as other Member Nodes, the Coordinating Nodes, commercial
  providers such as Amazon S3, and the rapidly evolving world of cloud storage
  such as the planned Google Science Storage capability. Member Nodes will
  include the **broad array of science stakeholders, including University
  libraries, research networks like the Long Term Ecological Research Network
  and the Organization of Biological Field Stations, synthesis centers like
  NCEAS and NESCent, government agencies like the USGS and NASA, and emerging
  environmental observatories like NEON, WATERS, and OOI**. Many institutions have
  a substantial investment in existing data management infrastructure, so the
  requirements for participating as a Member Node will be specified as a set of
  Service Interfaces that must be implemented at each Member Node. This allows
  the nodes to either utilize their existing infrastructure by providing the
  required Service Interface implementations, or they can deploy the Member Node
  software stack provided by DataNetONE to create these services. Thus, each
  node may provide a different subset of the services depending on the needs of
  their clients and their existing infrastructure. Also, each Member Node will
  be scaled according to the needs of its client base, and will contribute
  differing levels of resources to the DataNetONE collaboration. Smaller nodes
  such as an individual field research station may only provide modest storage
  resources, but DataNetONE gains significantly in the aggregate from many such
  nodes. Larger initiatives such as observatories and agencies will bring
  substantial resources and will benefit from the coordination with the rest of
  the science community that DataNetONE provides. DataNetONE envisions twelve or
  more Member Nodes throughout the world by year 3 of the project and
  anticipates accelerated growth thereafter.


Notes from 20090112 Meeting
---------------------------

Here the topic was focussing on Metacat_, Mercury_, and DSpace_ as potential member node types.


Mercury
~~~~~~~

(a) ID free text string, internal based on hashed metadata, at dataset level
    use a DOI (for ORNL DAAC). NBII/LTER records are totally replaced each update

(b) Query syntax: http GET 

    - assumes client is web page 

    - or as RSS query 

    - internal API could be exposed as a web service 

    - expose Lucene index directly to queries 

    - free text 

    - some boolean combinations allowed, mostly ANDing values together; query
      is key/value pairs in URL)

(c) CRUD: none to external clients

(d) harvest:

    - OAI-PMH, ftp push or pull, configured centrally, can harvest from SOAP,
      can crawl a web site

    - has internal parsers for metadata extraction from content, some
      transformation/embedding of metadata into record

    - no harvest of data

(e) replication: none; OAI-PMH could be leveraged; Google site index

(f) federated identity and authentication: metadata editor has
    internal user database; used Apache Pluggable authentication

(g) access control: metadata is open, data is controlled by submitter
    (external to Mercury)

(h) object retrieval type: metadata get the XML; can get XSLT
    transform;

(i) metadata standards: support multiple, map to custom federation schema


Metacat
~~~~~~~

(a) authority/namespace/id/rev --> LSID

    - full versioning of metadata and data

    - harvest and replication are ID-aware, only transfer changed items

(b) Query syntax: EarthGrid Query spec and web service SOAP API,
    fully boolean combination queries, namespace aware, language neutral
    client libraries; also supports DiGIR/SRB/GEON

(c) CRUD: EarthGrid Get/Put/Delete web service API; internal API;
    client language libraries

(d) harvest: custom harvest protocol based on GUID; sites
    registration protocol; own XML schema; metadata and data

(e) replication: custom replication; supports event-based triggers,
    timed synchronization; data and metadata separately

(f) federated identity and authentication: distributed LDAP with
    referrals

(g) access control: ACL list (rwx permissions on data and metadata)

(h) object retrieval type: data (binary blob); metadata format is at
    client request, defaults to XML, can use XSLT to get other formats

(i) metadata standards: support any xml, client mapping in query


DSpace
~~~~~~

(a) Handle 

    - assigned to latest version of data objects, revision history is metadata

    - Store publication DOI as well to access bibliographic md

(b) Query syntax: DSpace exposes an openURL interface, searching by
    conversion to Lucene query; PREMIS metadata fields; want to support
    SRU/SRW (maybe in a few months)

(c) CRUD: internal java API, not exposed externally in Dryad

(d) harvest: OAI-PMH

(e) replication: LOCKSS

(f) federated identity and authentication: epersons module?  no user
    id in Dryad yet (internal accounts); PINs for modifications; could
    use LDAP or PAM but don't now

(g) access control: can embargo data;  metadata is always public

(h) object retrieval type: intermediate XML can be transformed; data
    is bitstream; mime type provided in metadata

(i) metadata standards: dublin core universal, can store fields from
    multiple standards, but Dryad doesn't accept these external formats
    (e.g., EML, PREMIS, Darwin Core)


.. _Metacat: http://knb.ecoinformatics.org/software/metacat/

.. _Mercury: http://mercury.ornl.gov/

.. _DSpace: http://www.dspace.org/