WP2. Quantitative Data Collection and Cleaning

Led by the University of Cambridge Computer Laboratory, the main task of WP2 is to build a data collection, cleaning, analytic and monitoring system platform, exploring innovative open data tools. Specifically, WP2 deals with building a scalable, fault tolerant data management and analytic framework. It is developing a data storage system which is included in a robust and reliable data management system. All of this requires solid computer system architecture.

Data collection covers information ranging from announcement-level public procurement data, company-level financial accounts and ownership information, public organisation-level financial and administrative data to country-level asset declaration data covering over 30 European countries.

Data collection, cleaning, and linking are achieved automatically from public websites also from archived data sites, which will be operated on a daily (or less frequent) base, including extracting entities from the raw data and building a relational database. A network type of data maintained in a graph database will additionally be built.

The focus of research in WP2 is the efficient indexing of data in a relatively large-scale data processing. NOSQL type of databases, efficient graph data structure and storage will be explored.

Another research question deals with complex data collection that has to extract a large amount of procurement knowledge from various data sources from many countries. In order to improve the reusability and automation, WP2 will build an expert knowledgebase in order to operate automatic data extraction, including a procurement knowledge mapping tool.

Overall, WP2 is a focal point of the DIGIWHIST project and provides a tool to detect corruption and budget-deficit risks to improve the efficiency and transparency of public resources.


Lead Researcher