Quantitative Data Collection and Cleaning: Cleaned and Structured Databases

Jan Hrubý, Tomáš Pošepný, Jakub Krafka, Tomáš Mrázek, Marek Mikeš and Jiří Skuhrovec

The collection of public procurement related raw data is about understanding source systems, what data they offer and how the data can be obtained from a source (more details in our publication on raw data); to create a structured database we need to understand the data itself and store it according to a data template that has been designed to best support analytical work in other deliverables of the DIGIWHIST project.

The system for data collection and transformation has been designed to process data in several stages so that we can go back a step at any time without losing any information. This enables us to identify the exact point in the process where errors appear and fix them without having to repeat full (and costly) data extraction processes. In short, it makes the whole development-validation cycle more efficient.

This publication  describes how we treat data in the second (parsing) and third (cleaning) stages.

 


The comments are closed.