Abstract:
Following the Open Government Data (OGD) initiatives, governments and agencies around the world have been making raw data freely available for public access via their OGD portals and websites. However, despite the variety of benefits to society, the full potential of OGD could not be fully leveraged due to lack of efficient mechanisms to interlink and integrate large datasets coming from different sources. Although there are various reasons, with respect to data quality, interoperability and dissemination mechanisms, two main issues are raised. The first one is because raw datasets are published in heterogeneous formats such as PDF, Excel and CSV which are not machine readable and directly query able using the current web technologies. The other reason relates to the fact that datasets are distributed and locked under different agencies databases (portals) that use a variety of technologies and schemas (vocabularies). Currently, these challenges are being tackled by using Semantic Web (Linked Data) technologies and standards such as RDF, URIs and SPARQL for publishing highly interoperable machine-readable datasets.
In this study we argue that key players in the Ethiopian OGD ecosystem such as CSA (Central Statistics Agency) should start releasing their data in 4 or 5 star RDF format in addition to the traditional dissemination mechanisms. Moreover, we propose a customized methodology for publishing Ethiopian LOGD consisting of- A demonstration of the LOD publication steps with pilot statistical datasets; A core ontology model based on RDF Data Cube Vocabulary and geographic areas to interlink and integrate Ethiopian statistical datasets both domestically and globally. Throughout the process, 8 datasets (5 statistical datasets and 3 Administrative divisions) from CSA, MOE and MOH were modeled and converted to LOD format. The methodology was evaluated by presenting 3 generic Use case scenarios that shows the limitation of existing OGD platforms to integrate the selected datasets. Our findings show that the limitations can be addressed by converting datasets into LOD format; by using standard (controlled) vocabularies with customization and SPARQL services for publishing the datasets.