Data infrastructure at The Franklin
Data infrastructure at the Franklin
Instruments at the Franklin have been procured from a wide variety of different manufacturers and produce a vast range of different filetypes, reflecting the different experimental techniques that are practiced at the institute. The generated data, which is expected to reach dozens of terabytes per day by 2022, needs to be safely archived, processed automatically, and catalogued to facilitate retrieval later. The Franklin relies on both private and public cloud, combined with storage on Ceph object store, as well as dedicated software packages and webapps to accomplish these goals.
Our main source for computing resources is an Openstack deployment hosted by UKRI-STFC, which allows us to spin up virtual machines (VMs) running the Linux operating system. These VMs are used primarily for:
- Scicat: our data catalogue.
- Guacamole: clientless remote desktop. New VNC connections can be created on demand after spawning new (GPU powered) VMs with Docker containers that have been designed for specific tasks such as Cryo-EM data analysis using CCPEM and Relion. Access is provided through Fedid LDAP and ORCID OpenID Connect authentication.
- Gitlab-CI-runner: to support our software development on STFC’s Gitlab instance.
- SFTP: a file server connected to STFC’s CephFS fileshares.
- Grafana: provides analytics and interactive visualization of all relevant datastreams within the Franklin
Ceph object store
UKRI-STFC provides the Franklin with an object store that primarily will be used for archiving the data that has been recorded by the various instruments. The object store is currently capable of handling dozens of petabytes worth of files, while offering fault-tolerance by replicating the data across disks. The buckets that will be the datafiles (objects), are accessible both indirectly through Scicat, as well as directly through the S3 API, which is widely supported by file transfer clients and software libraries.
Confronted with many different file formats and closed source instrument software, it is not possible for us to come up with custom software solutions to trigger data archiving, analysis and cataloguing solutions for each instrument. Instead, we are developing the RFI-File-Monitor, an extensible software package written in Python, that will monitor the directories that the data will be written to, and kick-off a user-defined pipeline of operations that will process the file accordingly: copy to Ceph, register in Scicat, automatic data processing on Kubernetes etc.