In my first blog in this series on understanding Machine Learning and building successful implementations, I talked about data access patterns in the machine learning pipeline. Understanding data access patterns will help data scientists to decide what is the right storage infrastructure for their projects. Data infrastructure is what enables Machine Learning possibilities. Yet once you get started there are critical data challenges of Machine Learning you need to first address:
- Quality
- Sparsity
- Integrity
Let’s take an in depth look at each of these and how your team can get started overcoming these challenges of Machine Learning:
1. Quality
Many data scientists want to leverage data from external sources. Yet often, there is no quality control or guarantee on how the original data is captured. Can you trust the accuracy of the external data?
∣ Can you trust the accuracy of external data?
Here’s a great example for illustration. A sensor on buoys floating in the ocean collects data on the ocean’s temperature. However, when the sensor could not collect the temperature, it would record 999. In addition, the year’s digit was recorded with only two numbers before the year 2000. Yet after year 2000, the number of digits recorded was changed to four. You need to understand the quality of the data and how to prepare it. In this case, the scientists analyzing the buoy data could use medium, mean, min, max to visualize the raw data, catch these database errors and cleanse it accordingly.
2. Sparsity
Sparsity in this context applies to metadata. Very often metadata fields are not complete – some are filled and some are left blank. If the data is being generated from a single source, this may be due to human lack of discipline or knowledge. However, when data comes from diverse sources, without a standard definition of metadata, each dataset may have completely different fields. So, when you combine them together, the fields that are completed may not correspond.
Currently, there is no industry standard on what metadata to capture. Yet metadata is as important as the data itself. When you have the same type of data with different metadata fields populated, how do you correlate and filter the data?
∣ How do you correlate and filter data?
If we go back to the buoy example, the initial data sensors collected water temperature every ten minutes, while newer buoys collect it every three minutes. The only way to corralate the data is through the metadata disclosing when it was captured. When scientists are doing historical analysis they need metadata in order to be able to adjust their models accordingly.
3. Integrity
Data integrity is an assurance of data accuracy and consistency. The chain of data custody is critical to prove that data is not compromised as it moves through pipelines and locations. When the capture and ingestion of the data is under control, you can relatively easily validate its integrity. However, when collaborating with others it becomes difficult to validate. There is no security certificate for external data when the data was generated. You also cannot ensure the data was recorded exactly as intended nor that the data you receive is exactly the same as it was when it was originally recorded.
There are some interesting concepts regarding IoT data and blockchain, however until such a concept is widely adopted, data integrity depends on a combination of security technologies and policies. For example, since data can be compromised at rest or during transfer data transferred via network should be using https and it should be encrypted at rest. Access control, on the other hand, should be policy driven to avoid human errors.
How to Get Started
Data quality, sparsity, and integrity directly impact the final model accuracy, and are some of the biggest challenges facing machine learning today. Organizations that have clear data definitions, policies and explore industry specific data standards will benefit in both short-term and long-term projects.
If you’re not there yet, your organization should start by defining data collection policy, metadata format, and apply standard security techniques.
If you’re not there yet, your organization should start by defining its own data collection policy, metadata format, and apply standard security techniques. Data quality and sparsity go hand in hand. As a next step, set up metadata policies and make sure the qualitative data captured can be used to verify the validity of the data. Lastly, to ensure data integrity, digital certificate can be applied at data generation, SSL should be enforced during transfer, and always have encryption enabled at rest.
Secure Data Collaboration
If you are in an industry that needs to constantly exchange data with outside organizations, it is best to open source your data and meta formats as these standards will be more far reaching than many proprietary ones. Better yet, you can initiate an industry wide open standard committee to include others to participate and contribute. One good example is Open Targets (https://www.opentargets.org/), which is a “public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritization.”[1]
Research data ecosystems in particular have grown highly complex, with collaborators from inside and outside the organization needing fast access to data, and a way to simplify data management. The challenges of Machine Learning are plenty. Starting your project with right data and infrastructure is the first step.
[1] From Open Targets web site