Sciences’ Data Challenge: How One University is Solving the Collaboration Disconnect

Sciences’ Data Challenge: How One University is Solving the Collaboration Disconnect

To adequately support the humongous volumes of research data that are produced every day by science and research institutes throughout the world, there is an urgent demand for storage technology that can allow easy access and transfer of research-based digital data. Raw computing power is growing much more rapidly than bandwidth, I/O, or storage capacity and this discrepancy is only further exacerbated by challenges associated with the analysis of research data.

Shawn McKee is a Research Scientist in the Physics Department at the University of Michigan and Director of the ATLAS Great Lakes Tier 2 Center — a computing center that provides computation, storage, middleware, and networking for High Energy Physics for the ATLAS experiment. He is also the founding director of the newly established Center for Network and Storage-Enabled Collaborative Computational Science at U Michigan.

[Tweet “How @UMich is Solving Sciences’ Collaboration Disconnect #HPC #BigData #storage”]

We had a chance to talk with McKee about the work and goals that are his focus at the Center. It was an incredible opportunity to hear more about the work that is happening at the forefront of computational science, and I wanted to share some of his words with you. Here are excerpts from a long interview from earlier this year.

The Data Challenge: Large, Diverse, and Distributed

Shawn McKee: At the Center for Network and Storage-Enabled Collaborative Computational Science, we’re trying to figure out what are the best practices, what are the key challenges, and possible technological answers for groups of scientists who work on distributed, large, or diverse data.

One question that all scientists are asking these days is, “How can we effectively do our science when we have this big challenge of managing all of this information and access to storage?”

The Center was partially motivated by a project that we’re working on called OSiRIS, for which we were awarded one of the four National Science Foundation’s CC*DNI (Campus Cyberinfrastructure – Data, Networking, and Innovation) DIBBS (Data Infrastructure Building Blocks) awards last year. The grant awarded was based on the idea of creating a distributed storage infrastructure that spans many institutions while serving many science domains.

Distributed Data Makes Big Data More Complicated

Shawn McKee: One of the issues that we always have to deal with is transferring data between locations. It comes from one place and then it has to be kept someplace else. Then the data needs to be transformed, analyzed, filtered, sorted, and so on. New things need to be created with the data. And you have to keep track of everything that you’re doing to the data with that process. You have to really understand exactly what was done to extract the scientific results that you may have found.

It’s challenging enough when you have the amount of data that you can move between, say, two workstations or two laptops. But as you get to larger and larger amounts of data, it doesn’t scale well adding more and more laptops or workstations.

Here Comes The Next Challenge

Here comes the next challenge – you need to collaborate with people on your team and most often they are not in the same department. You might have a colleague down the hall that you work with, but it’s also likely that you have a couple of colleagues at other institutions, maybe even across the country.

[Tweet “Why @UMich research scientists aspire for transparent storage #CERN #BigData #HPC”]

Trying to work together under those conditions is tricky because where you’re storing the data is frequently a bottleneck for computing on that data. When you have to do some sort of process on it, there’s a certain amount of time that it takes to transfer the data. If it’s a few gigabytes, that’s not much as you can move that around pretty quickly. But a few terabytes? Now that depends on what kind of networking infrastructure you have and how fast your disks are. When you get to 10s of terabytes and even beyond into petabytes, all sorts of challenges emerge in efficiently using the networks to move things around.

A First Disconnect Between Storage Systems and the Networks They Connect To

Shawn McKee: Another big challenge is that there’s frequently a disconnect between the storage end systems and the networks they’re connected to. You can get 100Gbps network adapters, plug them in, and run memory-to-memory tests at ~100Gbps between two end systems. The results show 99.9 Gbps as expected. Then you test the storage systems and tune them up to achieve 50, 60, or even a 100 gigabits a second on the storage system. However, when you try to go from the storage system through that 100Gbps card that you tested across the network, to another 100Gbps card and then into another storage system – even if all of them tested out fairly high – what we often find is that when you put that whole chain together, you get something significantly less than the lowest measured bottleneck.

When you put that whole chain together you get something significantly less than the lowest measured bottleneck.

There are little mismatches along that whole sequence going from the disk – or even the flash devices – into the bus of the computer, into the network cards, through its driver stack, out on the network, through TCP, and with its packets and encapsulation, all over to the other end where the data are received and reorganized and then send it back to the bus and onto the disks. Those little interchange points don’t always sync up efficiently. So we see little bursts and lags and stops and drops and some of those get amplified because of the behavior of the network.

A Second Disconnect Between Code and the Infrastructure That Runs It

One of the challenges I see in the scientific field, which is different from other industries, is that typically the scientists who know the process that they need to go through with the data are often not programmers. They don’t necessarily know the most efficient ways to do things. So what can end up happening is that they code something that does the process according to how they imagine it, but it turns out to be a very poor match to how the infrastructure itself is put together.

Scientists want to focus on science, but they end up having to learn a lot about the infrastructure.

A lot of time is spent trying to figure out why code is behaving so badly and why they can’t get through the data they need to.

Can Transparent, Collaborative Infrastructure Exist?

Shawn McKee: At the center we are building a single storage infrastructure that is distributed across our three Michigan universities and we’re tying it together with software-defined networking. We’re trying to use the network programmatically to help us manage how users get to their data, independently of which site they’re located at, and how the storage infrastructure itself communicates and moves information around. Our goal is to give any science user at any one of the institutions the same level of access and the same visibility of the data that they’re working on together as a group.

We’re trying to build an infrastructure that’s transparent so that those users can directly see, access, change, and modify the data. Our goal is to try to create this kind of infrastructure, show that it works for multiple different science domains, and, hopefully, show that it is scalable beyond the size of one region. Right now we’re within the state of Michigan, but we’d like to stretch this out further. The National Science Foundation would like to see solutions that can scale to a national level, because these challenges — how to manage and work with all of this data — are widespread.

One Solution Across All Science Domains

The vision that we have with OSiRIS — that’s obviously my bias – is to see an infrastructure in place that would work for all researchers, so that the institution wouldn’t have to provide a customized solution for each science domain. Instead, they could have something they can scale and adjust to everyone’s needs. From an ‘economies of scale’ perspective, you’d have one solution to maintain and support. Yet from the science domain’s user, it has to have enough customizability to meet various needs. It would need to have a lot of storage capacity and good streaming performance so that once we start pulling data out or pushing data in, it moves very fast between whatever resources are sourcing or syncing the data to that infrastructure. The storage infrastructure would have to support both the computing needs and the infrastructure needs to be robust and resilient.

If you’re a bioinformatics person, you don’t want to have to solve puzzles about storage systems and networks and data transfer tools – you just want to do your science.

The bottom line for scientists is that they can finally do what they want to do rather than deal with technical issues.

Further Reading

To learn more about Shawn McKee and the work of the Center for Network and Storage-Enabled Collaborative Computational Science at the University of Michigan visit http://micde.umich.edu/centers/cnseccs/

To learn about how the University of Michigan uses SanDisk® Fusion ioMemoryTM PCIe application accelerators to help them share massive volumes of data between the CERN Laboratory in Geneva, Switzerland and 100 computing centers around the world, read the case study here.