Home Artificial Intelligence and Informatics Open Datasets at the Franklin

The Rosalind Franklin Institute is hosting synthetic cryo-electron microscopy (cryo-EM) datasets which have resulted from work done at the Rosalind Franklin Institute, as well as by collaborators at the CCP-EM group in the UKRI-STFC Scientific Computing Department and The Alan Turing Institute. These datasets are all CC-BY4.0 licensed to allow reuse. The datasets are hosted on the Rosalind Franklin Institute’s Globus collection named: “Rosalind Franklin Institute Echo AI parakeet”. They can be downloaded after registering on Globus using a Google or ORCID account.

What is a synthetic dataset

A synthetic dataset is generated from a simulation of the cryo-electron tomography/microscopy data collection process. To generate synthetic datasets we use software called Parakeet which was developed by James Parkhurst et al. at the Rosalind Franklin Institute. Parakeet is designed to allow the impact of experimental parameters on tomographic reconstructions to be explored through the in-silico simulation of the sample and transmission electron microscope imaging of it. It includes utilities for the reconstruction and analysis of tilt series.

A slice from a Parakeet-generated synthetic tilt series of apoferritin
Synthetic micrograph from a Roodmus-generated single particle analysis cryo-EM dataset containing SARS-CoV-2 spike proteins. Particles positions are labelled using green boxes

Whilst Parakeet allows for the simulation of synthetic tomography datasets, software for using Parakeet to generate synthetic single particle analysis cryo-EM datasets has been developed by Joel Greer from CCP-EM and Maarten Joosten from TU Delft. Roodmus uses the simulation functionality of Parakeet and includes functionality for comparing ground truth information to metadata generated during reconstructions.

For the cryo-ET datasets, software called Gromacs was used to generate atomistic trajectories, which were then used by Parakeet to generate synthetic datasets at various timepoints.

How to use synthetic datasets

Datasets are provided as zipped files to facilitate single-click downloading. Datasets are organised by whether they are suitable for use with single particle analysis (SPA) or sub-tomogram averaging (STA). Beyond this, datasets are sorted into directories based on the project they were created for. As a result, datasets which are useful for a given task or study should be found grouped together. Each dataset is accompanied by a README file which explains what is contained within the dataset and why it was created.

Datasets can be reused as they are or you can customise/extend them. This can be done by installing Parakeet and then customising the YAML metadata file(s) which specify the parameters which were used to create each individual tilt-series/micrograph. Providing Parakeet with a YAML file allows it to reproduce an existing image or to create a new one from a customised YAML file.

Please find more instructions on how to use Parakeet here:  https://rosalindfranklininstitute.github.io/parakeet/index.html.

Schematic showing the organisation of the Rosalind Franklin Institute Echo AI parakeet Globus collection

 

This work was supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/W006022/1, particularly the “AI for Science” theme within that grant & The Alan Turing Institute.

Rosalind Franklin Institute