Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.
Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.
In his presentation, Daniel compared the current state of open data (including governmental open data and scientific open data) to a thrift store. You can often find bargains, or historical data that would be impossible to source from data vendors, but on a strictly as-is basis, without a catalogue, service, or guarantee. Therefore, working with open data requires a careful reprocessing, validation, and in many cases, frequent re-validation. Open data is often over-estimated: it is never a finished product, often it cannot even be downloaded, therefore it requires further investment to make it valuable. However, because most open data arrives from the governmental sector, you can tap into information sources where no market alternative exists. Open data in some cases may be a cheaper substitute to market vendors, but often it is an exclusive source of information that do not have any market vendors.
The practices related to the exploitation of open data are not only relevant in an open data context: these are good data ingestion and procurement practices for any third party data, and in large organizations, for any cross-departmental data. (See the blogpost: The Data Sisyphus.)
Case Study: Belgian Drought/Flood Risk Awareness, Financial Capacity & Hydrology a complex integration of various open data sources.
In the second part of the presentation, Daniel talked about our modern data observatory concept. We have reviewed about 80 functioning and already defunct international data collection programs. Data observatories, like Copernicus’ Observatory, are permanent infrastructure to record various domain-specific data, such as alternative fuel information, information on homelessness, or on the European music business. In our assessment, most of the EU, OECD, UNESCO recognized or endorsed observatories use obsolete technology and do not rely on the new achievements of data science. Reprex, our start-up offers an open source, open data based alternative solution to build largely automated data observatories. We believe that human judgement is needed in data curation, but processing, documentation and validation is best done by computers.
At last, he presented a few development directions with our open-source software, mentioning our work withing the rOpenGov community. This part of the presentation was originally meant to open the way for a half-day open data workshop, but due to the current pandemic situation, the physical part of the conference and the workshops were not held.
The presentation largely included the topics of our Data & Lyrics blogpost: Open Data—The New Gold Without the Rush
See the presentation slides here.