Skip to content

Add Zenodo as Data Source for Commons Quantification #249

@Goziee-git

Description

@Goziee-git

Problem

Zenodo is a major repository for open access research outputs with 5.5M+ records, but is not currently included in our commons quantification project. Adding Zenodo would significantly expand our coverage of Creative Commons licensed content, particularly in academic and research domains.

Description

Implement data collection from Zenodo using their REST API to gather license information for quantifying the commons. This involves:

  • Fetching records with structured license metadata
  • Classifying Creative Commons and other open licenses
  • Generating reports by year, resource type, and language
  • Handling API rate limiting and pagination

Zenodo Useful Links

Official Documentation

API Endpoints

  • Base URL: https://zenodo.org/api/records
  • Records Search: https://zenodo.org/api/records
  • Single Record: https://zenodo.org/api/records/{id}
  • Communities: https://zenodo.org/api/communities

Technical Details

Query Strategy

GET https://zenodo.org/api/records?q=*&size=100&page=1&sort=bestmatch

Parameters:

  • q: Query string (use * for all records)
  • size: Records per page (300) implementation choice
  • page: Page number for pagination
  • sort: Sorting method (bestmatch recommended)

API Types Available

  1. REST API (Recommended)

    • Format: JSON
    • Authentication: None required for public records
    • Structured license data: metadata.license.id
  2. OAI-PMH (Not recommended)

    • Format: XML Dublin Core
    • Unreliable license parsing from free-text fields (dc:rights)

Key Metadata Fields

  • License: metadata.license.id (structured, e.g., "cc-by-4.0")
  • Access Rights: metadata.access_right ("open", "restricted", "embargoed")
  • Publication Date: metadata.publication_date (ISO format)
  • Resource Type: metadata.resource_type.title
  • Language: metadata.language (ISO codes)

Implementation

  • I would be interested in implementing this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ✨ goal: improvementImprovement to an existing feature💻 aspect: codeConcerns the software code in the repository🚦 status: awaiting triageHas not been triaged & therefore, not ready for work🟩 priority: lowLow priority and doesn't need to be rushed

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions