4 years ago we, the ScienceCloud group at DeIC, launched the scientific data-management service data.deic.dk.
It was a first attempt at building something like this out of open-source components: FreeBSD, ZFS, Apache, ownCloud/Nextcloud and involved engaging with similar projects in other countries (Switzerland, Australia, Germany, Holland) and open-source environments (Nextcloud, CS3).
Today this service hosts about 100 TB of data for 1000+ researchers, is very busy and still growing in popularity. It is a one-machine setup, under quite heavy load, and to scale further, we decided early to launch an R&D project with the aim of creating a horizontally scalable data-management service with additional research features.
The result was sciencedata.dk which we opened last year for testing. Our plan was to expose adventurous researchers to new features and collect feedback for building a truly national scientific data store – condoned by all stakeholders, from scientists to university management, libraries, national databases and archives.
We’ve been doing just that and I’d like to extend my thanks to all the wonderful and dedicated researchers from DTU and SDU for taking the time to test the service and provide feedback.
The next step will be to use all this feedback to create a hardened and secure production service. But the plan doesn’t stop here of course: We’re convinced that a national scientific data store must be seamlessly coupled with scientific data processing facilities.
That was indeed the intention behind the now defunct NSAS project, which we supported. It was also the intention behind our own R&D project compute.deic.dk, which is in an abandoned state due to lack of time and resources.
So, what is the problem here? – we have a research data-management prototype, but no coupling to computing resources – how hard can it be?
1. In my judgment, it is actually rather technically hard – but far from impossible.
2. Somebody needs to do it – and we simply have a hard time attracting people with adequate skills.
3. If, by computing resources we mean existing national, large-scale, DeIC-funded HPC resources, it cannot be done by us alone, but needs the active, technical participation from one or several of these resources. We’ve found that hard too, but will keep trying.
4. More generally, coupling large-scale data-management and computing resources needs interest and agreement both on a national level (the university managements, the ministry) and (somewhat related) in our own management. Both have also been hard to obtain.
- I still personally believe there is a very strong case for a science cloud for Danish research – allowing processing of data from a multitude of sources on large-scale computing facilities.
- I also believe we’ve shown the feasibility of building the all-important data-management component of such a cloud with open-source ingredients.
- Given the lack of high-level consensus, we’ll proceed by creating a hardened data service and creating a compute service prototype.
Stay tuned and if you happen to be a skilled, infrastructure-interested data-scientist, do let us hear from you