The stakes of dealing with complex data

The new stakes of data science have been largely covered in the media, for instance in The Guardian

Short presentation of the content of the knowledge base

This knowledge base aims at answering a set of simple questions the researcher asks himself when dealing with data intensive research. It is intended for beginners as well as for experts. For each part, the Governance Analytics team provides you with basics, as well as with more elaborated aspects.

We begin first by presenting the Main tools which can be used by researchers. We then present the available Databases accessible to researchers.

The Main tools allow to extract and process data. There is first a Data extraction and cleaning phase. Data extraction from the web has often to be followed by a Data cleaning procedure. The procedure then depends from the nature of the data, textual or numerical.

Textual data can be the object of a Textual analysis. The researcher can turn to the Use of semantic technologies to annotate and search data. Numerical data must be the object of a Classification and analysis. After a phase of Classification, the researcher may wish to perform Clustering, Forecasting, Regression Analysis, and Knowledge discovery.

Numerical data, or textual data converted into numerical information, can then be the subject of an Econometric analysis. We here provide for the use of the beginner researcher in those field a short Introduction to econometrics, before providing an exhaustive set of Resources for econometrics for more advanced users.

The second main purpose of this knowledge base is to provide the beginner or advanced researcher with a set of Databases. We do not aim at exhaustivity, and are waiting for your input to expand this knowledge base.

The recent evolution of data science has been marked by the expansion of Open data or openly accessible data. This can stem whether from Government data, or from NGO data. These databases can also be linked through Linked open data and linked repositories.

We present more traditional datasets from a wide range of data providers. The traditional datasets have recently been challenged by the rise of long run Historical datasets. Such kind of data often rest on the exploitation of physical archives, of which we present here the most interesting examples and possibilities.

We finally conclude by presenting comparable Knowledge bases in the world, which may be of particular help for the researcher to expand his quest for information on data-driven research.

This knowledge base aims at being a Work in progress. The Governance Analytics team is waiting for your input in order to make it even more user friendly, and to incorporate any data tools or databases you think would be of particular interest for this project. Please don’t hesitate to write to the Governance Analytics team