168 views
# InvenioRDM OCFL Case Study ###### tags: `OCFL` ## Roadmap • InvenioRDM started its first OCFL Sprint on 15/11/21, and built a test repository from which it was possible to output files via the RDM API. An OCFL snapshot was subsequently generated using the Brown University ocfl-java-http client. • OCFL Sprint-2 started 6/12/21, developed the OCFL-Core python client and completed Invenio design for aggregating repository records and generating an OCFL root automatically, based on the Sprint-1 assets, and using Simeon Warner's validator from https://github.com/zimeon/ocfl-py • OCFLCore is available at: https://github.com/inveniosoftware/ocflcore • there is now a preliminary OCFL export module: https://github.com/inveniosoftware/invenio-ocfl Sprint-2 also generated an OCFL community extension proposal for support of JSONSchema, as well as updating this case study for OCFL GitHub. ## InvenioRDM OCFL Case Study ### Sprint-1 Progress An InvenioRDM instance was constructed at https://ocfl.dev.data-futures.org/ using records from multiple exisiting *hasdai* Invenio repositories. Extracted metadata + content was then written into an OCFL root using the Brown ocfl-java-http client (https://github.com/Brown-University-Library/ocfl-java-http) and is available at https://github.com/data-futures/ocfl-test-data. This was used for a validation fixture. Using ocfl-java-http allowed rapid generation of a compliant OCFL root and allowed us to identify a work-plan to automate InvenioRDM snapshots into a preservable OCFL structure. Records will initially be structured with all Invenio versions mapping to a single OCFL object - this avoids duplicating data within versions of a record - see https://github.com/OCFL/spec/issues/363 Inventories, checksums and boilerplate were all generated by Brown's Java client. We employ SHA-512 - though verifying native Invenio MD5 when retrieving from RDM, then calculating the SHA-512, which is verified by the OCFL API. ### Next Steps and Issues addressed during Sprint-2 1. In order to ensure an internally consistant OCFL representation of a repository's content, we believe that the schemata in use must be preserved in the OCFL. We will propose a community extension detailing this use case. 2. The JSON metadata exported from RDM API requires some customization to make it more suitable for long-term preservation. For example the API returns link URLs which would be unlikely to be valid in a future epoch and are therefore inappropriate for long-term preservation. Further testing to ensure that a complete set of metadata is preserved is required. 3. Fields containing values from controlled vocabularies are included in InvenioRDM's JSON export, in both reference an de-referenced forms, and avoid the need to export CVs explicitly, although this will be considered in future development. Example of 'languages' field as exported - ``` "languages": [ { "id": "eng", "title": { "en": "English" } } ], ``` 4. Since the main focus is for long-term preservation of repository contents, record ownership information is preserved as part of the exported metadata, but is not necessarily expected to contribute to access control policies when such an archive is reused. In the case of restoring a backup, the information stored would at least allow records to be grouped / re-claimed in another epoch on another platform. 5. OCFL exports will include all 'published' as well as unpublished records (with that status preserved) together with any access restriction information. However it is expected that policies enacted by administrators using the archive will determine actual control of access at that time. 6. Validation of the metadata deposited in the OCFL export will be checked for completness during Feburary 2022 for potential inclusion in the March InvenioRDM release. 7. Further consideration to the case of storing and referencing controlled vocabularies will be undertaken with a view to making the OCFL content as complete as possible. ENDS