Open Science & Compliance Overview
FAIR Principles • CC0 Data • MIT Licensed Code • Open Access • Versioned Releases • RDF/OWL Standards • SPARQL Endpoint • Reproducible Pipeline
FAIR Principles
SemRepo adheres to the FAIR data principles:
- Findable – All releases are published with persistent identifiers and are indexed via Zenodo, GitHub, and the project website.
- Accessible – Data is openly available via RDF dumps and a public SPARQL endpoint without authentication barriers.
- Interoperable – The dataset uses standardized semantic web formats (RDF, OWL, SPARQL, VoID) and links to external scholarly knowledge graphs.
- Reusable – Data and pipelines are openly licensed and versioned to support reuse, replication, and extension.
Accessibility & Interoperability
SemRepo provides multiple access methods to support diverse use cases:
- RDF data dumps for full dataset download
- SPARQL endpoint for structured querying
- Standardized ontologies (RDF/OWL) for semantic consistency
- Interlinking with external scholarly knowledge graphs for integration into broader ecosystems
These design choices ensure compatibility with both academic and industrial applications.
Reusability & Versioning
SemRepo follows a versioned release strategy with periodic updates (approximately twice per year, depending on upstream data availability).
Each release includes:
- A complete dataset snapshot
- Metadata and provenance information
- Fully reproducible construction pipelines
This ensures that results can be independently verified and extended by the research community.
Licensing
SemRepo is released under open licenses to maximize reuse and transparency:
- Data: Creative Commons CC0 (public domain dedication)
- Code & Pipeline: MIT License
This licensing model allows unrestricted reuse of both data and software, including commercial and academic applications, with minimal constraints.
Ethical Considerations
SemRepo is built exclusively from publicly available software repositories and scholarly metadata sources.
However, we acknowledge that it inherits structural biases and coverage limitations from upstream sources such as GitHub and linked scholarly knowledge graphs, including uneven representation across languages, regions, and research communities. Transparent provenance tracking and regular updates are provided to support responsible interpretation and use of the dataset.