Open-Source Data's R&D Promise and Pitfall: Why Data Quality Still Matters

Open datasets accelerate R&D innovation, but inconsistent quality poses a threat to reliability. The partnership approach, combining open and proprietary elements, offers trustworthy results.
Dec. 3, 2025
7 min read

3 Key Highlights

1. The Opportunity Open datasets have accelerated early-stage research, enabling rapid prototyping of catalysts, materials and molecular properties without generating data from scratch.

2. The Challenge Inconsistent formats, missing metadata and variable quality in open datasets compromise model reliability and reproducibility—algorithms are only as reliable as their data.

3. The Solution Coexistence of open and proprietary data sources can maximize innovation speed while ensuring the precision and quality control needed for dependable R&D outcomes.

The reliability of data can literally make or break innovations. Whether designing a new catalyst, scaling a reaction from lab to plant or ensuring safe plant operations, trustworthy data underpins each decision. Over the past decade, open-source datasets have become more common in chemistry and materials research, lowering barriers to experimentation and fueling rapid innovation.

The promise? Faster discovery and broader collaboration.

The pitfall? Not all data is created equal.

However, open datasets often aggregate information from many sources. Inconsistent formats, missing metadata on how results were obtained or variations in computational methods can compromise model accuracy and reproducibility. At the end of the day, an algorithm is only as reliable as the data fed into it. This raises a critical question: How should open-source and proprietary data approaches coexist to maximize innovation and ensure precision?

The Promise of Open-Source Data in R&D

Open data has undeniably accelerated early-stage research. Public repositories like the Materials Project or Catalysis-Hub have made thousands of molecular structures, catalytic surfaces and experimental results freely available. This enables researchers to explore ideas that would be impractical if they had to generate all data from scratch. One recent report noted that open datasets and pre-trained models led to “rapid progress” in predicting molecular properties, catalyst performance and metal-organic framework behaviors. The result is a flourishing of innovation at the conceptual stage: new catalysts proposed, novel materials screened and property prediction models prototyped, all fueled by readily accessible data.

The Perils of Inconsistency and Data Variability

If open-source datasets are the rocket fuel for innovation, data inconsistency is the turbulence that can knock that rocket off course. The very nature of open data, aggregated from literature or contributed by diverse sources, means it can suffer from variability in quality and context. A public catalyst database might include structures optimized using different theoretical methods; however, subtle differences can lead to significant changes in predicted performance.

Real-world studies underscore these issues. In a recent assessment of aqueous solubility data (a key property for pharmaceuticals and process design), researchers found that “Existing solubility datasets show variance in the chemical space of compounds covered, measurement methods, experimental conditions, but also in the non-standard representations, size and accessibility of data.”

These inconsistencies, from experimental noise or reporting errors, directly hinder the development of reliable predictive models. Therefore, a model trained on inconsistent data learns an inconsistent reality.

The variability problem doesn’t imply that open data is unusable. Rather, it means that using it requires care. This is a crucial consideration as AI-powered materials simulation tools proliferate. If an AI model’s training data comes from dozens of open sources with inconsistent standards, engineers are right to question whether its predictions can be trusted for billion-dollar process decisions. Data reliability, or lack thereof, becomes the Achilles’ heel.

Proprietary Data: The Gold Standard for Precision

In contrast to the wild west of open data, proprietary datasets in industry are typically curated under much tighter controls. Companies in the chemical and process industries have long recognized that data is a strategic asset. They generate high-quality proprietary data from carefully designed experiments, pilot plant runs or high-fidelity simulations. The result is an internal “single source of truth” that their engineers trust when scaling up a process or making safety decisions. Such consistency yields data that models can rely on for accurate predictions.

There’s a reason companies guard these proprietary datasets closely. Data quality translates to risk mitigation. When you’re designing a chemical reactor that will operate continuously at high temperature, you need absolute confidence in the reaction kinetics data. Lives and capital equipment are on the line.

Organizations feel safer leveraging internal, verified information for critical decisions. In many cases, they will use open-source models or tools but validate them against proprietary data before deployment. This highlights a symbiosis: open data and tools might be used in the early exploratory phase, but proprietary data grounds the project in reality before execution.

Bridging the Gap: Open and Proprietary Coexistence

Rather than viewing open and proprietary data approaches as an either/or choice, leading R&D teams treat them as complementary. Open data is leveraged to accelerate the early stages—hypothesis generation, initial screening and model development—where speed and breadth matter more than absolute precision. Proprietary data enters later to refine, validate and scale up those innovations with confidence. This phased strategy is increasingly common across process industries:

  • Catalyst Screening: Researchers can train an AI model on a large open dataset of catalytic surfaces and reaction energetics to identify promising catalyst candidates. The model might not predict exact rates perfectly, but it narrows the field dramatically. Then, the top candidates are tested in-house under real-world conditions, generating proprietary performance data. Those proprietary results feed back into updated models, ensuring the next round of predictions aligns with reality.

  • Materials Discovery: Open databases (for example, containing thousands of polymer or alloy compositions) allow a broad search for materials with desired properties. Companies use this to shortlist materials that might meet their requirements for, say, a corrosion-resistant coating. Subsequent testing of those materials yields a curated dataset of what actually works, which is then used in proprietary property prediction workflows to fine-tune the choice.

  • Property Prediction and Process Design: For routine properties like reaction energies or diffusion coefficients, engineers may turn to public compilations to get initial estimates. But when designing the actual process, they often rely on internally measured values or a vetted proprietary database to plug into simulations. The open-source values serve as starting points or to fill gaps, while proprietary values are used in final design calculations to ensure safety factors are correctly applied.

Toward a Data-Driven Future: Integration and Trust

A noteworthy trend in chemical R&D is the move toward integrating computational and experimental workflows more tightly than ever. The rise of “digital twin” models of chemical processes and even autonomous labs means that data is constantly cycling between the virtual and physical realms. In a self-driving lab scenario, an AI model might propose experimental runs based on simulation data. The experiments then generate new data, which updates the model in turn. For this cycle to succeed, the data must be both abundant (where open data helps) and reliable (where quality control and proprietary verification help). This is pushing industry-wide efforts to break down data silos and adopt common standards.

Standardization efforts are emerging, sometimes led by consortia or government agencies, to define how chemical data should be reported and exchanged. From file formats for molecular structures to ontologies that precisely label experimental conditions, these standards help ensure that when one organization’s open data is used by another’s proprietary model (or vice versa), the integration is smooth.

It underscores a key point: open-source and proprietary methods are not opposites; they are partners. Open data and tools spark the engine of innovation, and proprietary data and expertise steer that engine to its destination safely.

Complementary Paths to Innovation

As the chemical process industries charge forward into the era of AI and digitalization, the most successful R&D organizations will be those that adeptly balance openness with rigor. Open-source datasets and models will continue to broaden horizons, and proprietary data will remain the bedrock on which real-world decisions are made. The path to the future is not one or the other, but a synergy of both. By recognizing the strengths of each approach, we ensure that our AI predictions and simulations are grounded in reality and that our real-world experiments are augmented by the vast context that global data sharing provides.

In practical terms, this means investing in data infrastructure and culture. Companies will benefit from contributing to open standards and even sharing non-competitive portions of their data, because an ecosystem of better data helps everyone’s models. Likewise, researchers should be diligent in curating and documenting the data they publish openly, treating it with the same care they would with internal data. The community will thank them and likely reuse their contributions widely. When open and proprietary datasets peacefully coexist and complement each other, chemical process R&D can achieve the best of both worlds: the agility and innovation that come from openness, and the precision and trust that come from proprietary rigor.

About the Author

Joshua Young

Senior Application Scientist

Joshua Young is a Senior Application Scientist at the U.S. branch of Japanese AI startup Matlantis, which uses atomic-scale AI-based simulation to accelerate global materials discovery, specializing in computational materials research to drive adoption of the Matlantis platform. With expertise in materials informatics, density functional theory (DFT) and ab initio molecular dynamics (AIMD), he designs and probes materials across scales for energy and nanoelectronics applications. At Matlantis, Young helps clients integrate AI-powered atomistic simulations into R&D—optimizing catalysts, designing battery electrodes and exploring complex polymers—and translate cutting-edge computational modeling into industrial impact.

Sign up for our eNewsletters
Get the latest news and updates