Member Nodes ============ :Revision History: - 20091115 Vieglais. General edits - 20091110 Vieglais. Pulled together content various sources .. contents:: Introduction ------------ This document is an amalgamation of various sources including the original DataONE proposal, notes taken at various VDC meetings, the draft document "DataONE Member Nodes: Characteristics and Processes" by Sandusky and Vision (initiated at the Virtual Data Center coordination meeting held in Durham, NC August 11-13, 2009) and other miscellaneous notes. This document summarizes information about DataONE Member Nodes (MN), including why organizations like libraries, research centers, universities, museums, non-profit organizations, and governments may wish to participate by hosting a DataONE MN; characteristics of MNs; roles and responsibilities for MNs in the DataONE context; a MN directory and implementation schedule; and processes for hosting a MN. Why DataONE Includes Member Nodes --------------------------------- DataONE MNs are an essential part of the DataONE architecture and provide a number of benefits. MNs are: - A key interface to DataONE users and stakeholders MNs provide a direct means of engagement between organizations and the larger DataONE enterprise at a variety of scales and levels of commitment, ranging from moderately scaled servers housed in laboratories to more complex data archives hosted by scientific collaboratives. MNs may be large and complex – on the scale of discipline-specific data repositories – or relatively simple, a discipline-agnostic repository hosted by academic research library. Organizations hosting MNs become part of the DataONE community and are welcome to participate in the International Users Group or contribute to one or more of the technical or community engagement working groups. - Fundamental to supporting access, discovery, and re-use of scientific data MNs are the primary site of user interaction with DataONE services. Scientists, students, and citizens interact with MNs through software tools that utilize standardized interfaces. This supports many different usage scenarios, such as data and metadata management and replication, analysis and modeling, and scientific workflow systems. Data and metadata requested from DataONE – as the result of a query, for example – may be provided from any of the MNs that hold a copy of that data and metadata. Replication of multiple copies of data distributed geographically and across a range of technologies are important for the long-term (ranging from many decades to centuries) preservation and curation of scientific data. - Essential for the replication and persistence of data DataONE is designed to be a scalable, system distributed across six continents comprised of a relatively large number of MNs – potentially numbering in the hundreds – which are hosted by a diverse set of independent organizations, including research libraries, non- and for-profit research centers, governmental bodies, non-governmental organizations, museums, colleges and universities, and non-profit organizations. The set of MNs in DataONE hold a sufficient number of geographically distributed copies of data and its associated metadata. Data and metadata at MNs will be subject to ongoing automated inspection to ensure the content remains stable and in a viable format. - Key to sustainability MNs are part of DataONE’s sustainability strategy: by providing a relatively small fraction of the technical and organizational capacity to provide trustworthy stewardship of scientific data, each independent organization contributes to the overall sustainability of DataONE. DataONE thus avoids over-reliance upon certain types of organizations or organizations located across a limited geographic scope. - Essential to scalability - Key to community engagement Organizations are motivated to host MNs for diverse reasons. - Serve organization needs (data preservation/availability) While many important data repositories have been built to serve particular scientific communities (e.g., the Virtual Observatory; the Long Term Ecological Research Network; the National Snow and Ice Center), the long-term sustainability, range of services, and discoverability of the holdings of any single repository is limited by the resources available to them as individual repositories. The DataONE architecture allows individual repositories like these to also serve as DataONE MNs and thus take advantage of unique and leading-edge DataONE services, such as DataONE’s robust data preservation and curation services, wide geographic distribution, federated access and discovery services, etc. - Serve range of user needs - Protect, preserve, recognize, - Be a leader For many organizations, hosting a DataONE MN is a relatively simple way to be involved in the long-term preservation of humankind’s scientific record. Academic libraries, for example, have moved from being place-based services reliant on large collections of physical materials to providing locally-optimized services which are now dominated by digitized information prioritizing unique, local collections and faculty expertise. Moving into providing long-term, discipline-agnostic access to scientific data is a natural progression for libraries that simultaneously provides a stable species in DataONE’s organizational ecosystem, supporting DataONE’s sustainability goals. - Reduce organizational costs Providing robust digital preservation and curation services for scientific data is expensive, and sufficient resources do not exist to provide these kinds of services at every museum, library, or research center. An organization can host a MN optimized for interaction with its community for data deposit, management, and access while avoiding needed resources to provide complete and robust solutions for all aspects of the data curation challenge. Instead, each MN can rely upon centrally defined and managed services for trustworthy data and metadata replication, federated access, provenance, etc. - Deliver more capabilities - Leverage expertise Characteristics of Member Nodes ------------------------------- An organization wishing to host a MN must meet certain minimum requirements in order to be considered for membership in DataONE. This section summarizes sets of minimum and ideal MN characteristics. Minimum Requirements ~~~~~~~~~~~~~~~~~~~~ A prospective MN: - Must either (1) have and currently host data and metadata collections and / or (2) be willing to provide computing and storage capacity for data and metadata generated by the larger DataONE community - Must have basic cyberinfrastructure capabilities (e.g., server, storage, network / communications, physical and administrative security, capacity to administer the cyberinfrastructure) - Must serve a community of practice (e.g., a campus, a set of data users, a discipline or sub-discipline) - Must demonstrate its institutional commitment to creating and maintaining a MN - Must provide a key point of contact - Must have the ability to contribute resources (e.g., people to engage in DataONE user or working groups or design, implementation, testing activities; cyberinfrastructure, including storage space, etc.; data and metadata from one or more disciplines appropriate for inclusion in DataONE; direct financial support; etc.) - Must have or be willing do adopt explicit policies on data access and use that are consistent with DataONE policies - Must agree to replicate other MN data and metadata Ideal Characteristics ~~~~~~~~~~~~~~~~~~~~~ A prospective MN: - Should provide robust user authentication services consistent with DataONE standards - Should use or adopt accepted standards for metadata and globally unique identifiers for data and metadata (GUIDs) - Should be willing to contribute resources, whether financial or in-kind, to participate in DataONE - Should have or adopt well-defined and well-documented data management policies and procedures consistent with DataONE policies and procedures - Should have a record of providing expertise and leadership in a geographic or disciplinary community - Should be willing to integrate elements of the DataONE Investigator Toolkit into its MN Types of Member Nodes ~~~~~~~~~~~~~~~~~~~~~ [Distinguish between member nodes that are primary sources of scientific data (Dryad, DAAC, LTER sites) and those that are simply providing additional storage and greater geographic distribution (think: LOCKSS box) or collocated with an Institutional Repository (based on DSpace at a University library); I think Ryan Scherle referred to this as “blind data space” in his meeting notes. There may be other types, but I can’t think of any.] Member Node Services ~~~~~~~~~~~~~~~~~~~~ Potential Member Nodes (Year 1 Focus) ------------------------------------- List of some potential Member Nodes, mostly drawn from the proposal and from meeting notes. From the proposal text: .. pull-quote:: As part of the sustainability of DataNetONE, the storage of data sets will be distributed across Member Nodes. One copy will typically reside at the originating Member Node, and replicas will be created at two or more other locations, such as other Member Nodes, the Coordinating Nodes, commercial providers such as Amazon S3, and the rapidly evolving world of cloud storage such as the planned Google Science Storage capability. Member Nodes will include the **broad array of science stakeholders, including University libraries, research networks like the Long Term Ecological Research Network and the Organization of Biological Field Stations, synthesis centers like NCEAS and NESCent, government agencies like the USGS and NASA, and emerging environmental observatories like NEON, WATERS, and OOI**. Many institutions have a substantial investment in existing data management infrastructure, so the requirements for participating as a Member Node will be specified as a set of Service Interfaces that must be implemented at each Member Node. This allows the nodes to either utilize their existing infrastructure by providing the required Service Interface implementations, or they can deploy the Member Node software stack provided by DataNetONE to create these services. Thus, each node may provide a different subset of the services depending on the needs of their clients and their existing infrastructure. Also, each Member Node will be scaled according to the needs of its client base, and will contribute differing levels of resources to the DataNetONE collaboration. Smaller nodes such as an individual field research station may only provide modest storage resources, but DataNetONE gains significantly in the aggregate from many such nodes. Larger initiatives such as observatories and agencies will bring substantial resources and will benefit from the coordination with the rest of the science community that DataNetONE provides. DataNetONE envisions twelve or more Member Nodes throughout the world by year 3 of the project and anticipates accelerated growth thereafter. Notes from 20090112 Meeting --------------------------- Here the topic was focussing on Metacat_, Mercury_, and DSpace_ as potential member node types. Mercury ~~~~~~~ (a) ID free text string, internal based on hashed metadata, at dataset level use a DOI (for ORNL DAAC). NBII/LTER records are totally replaced each update (b) Query syntax: http GET - assumes client is web page - or as RSS query - internal API could be exposed as a web service - expose Lucene index directly to queries - free text - some boolean combinations allowed, mostly ANDing values together; query is key/value pairs in URL) (c) CRUD: none to external clients (d) harvest: - OAI-PMH, ftp push or pull, configured centrally, can harvest from SOAP, can crawl a web site - has internal parsers for metadata extraction from content, some transformation/embedding of metadata into record - no harvest of data (e) replication: none; OAI-PMH could be leveraged; Google site index (f) federated identity and authentication: metadata editor has internal user database; used Apache Pluggable authentication (g) access control: metadata is open, data is controlled by submitter (external to Mercury) (h) object retrieval type: metadata get the XML; can get XSLT transform; (i) metadata standards: support multiple, map to custom federation schema Metacat ~~~~~~~ (a) authority/namespace/id/rev --> LSID - full versioning of metadata and data - harvest and replication are ID-aware, only transfer changed items (b) Query syntax: EarthGrid Query spec and web service SOAP API, fully boolean combination queries, namespace aware, language neutral client libraries; also supports DiGIR/SRB/GEON (c) CRUD: EarthGrid Get/Put/Delete web service API; internal API; client language libraries (d) harvest: custom harvest protocol based on GUID; sites registration protocol; own XML schema; metadata and data (e) replication: custom replication; supports event-based triggers, timed synchronization; data and metadata separately (f) federated identity and authentication: distributed LDAP with referrals (g) access control: ACL list (rwx permissions on data and metadata) (h) object retrieval type: data (binary blob); metadata format is at client request, defaults to XML, can use XSLT to get other formats (i) metadata standards: support any xml, client mapping in query DSpace ~~~~~~ (a) Handle - assigned to latest version of data objects, revision history is metadata - Store publication DOI as well to access bibliographic md (b) Query syntax: DSpace exposes an openURL interface, searching by conversion to Lucene query; PREMIS metadata fields; want to support SRU/SRW (maybe in a few months) (c) CRUD: internal java API, not exposed externally in Dryad (d) harvest: OAI-PMH (e) replication: LOCKSS (f) federated identity and authentication: epersons module? no user id in Dryad yet (internal accounts); PINs for modifications; could use LDAP or PAM but don't now (g) access control: can embargo data; metadata is always public (h) object retrieval type: intermediate XML can be transformed; data is bitstream; mime type provided in metadata (i) metadata standards: dublin core universal, can store fields from multiple standards, but Dryad doesn't accept these external formats (e.g., EML, PREMIS, Darwin Core) .. _Metacat: http://knb.ecoinformatics.org/software/metacat/ .. _Mercury: http://mercury.ornl.gov/ .. _DSpace: http://www.dspace.org/