Automatic extraction of the license from the file: an easy way to complete your deposit

Written by Agnès Magron

The feature for extracting metadata from a PDF file now includes legal mentions related to distribution (Creative Commons licenses and copyright) and allows automatic completion of the deposit form.

The application, which already extracts authors, titles, abstracts, journal titles, and mentions of ANR projects (see our previous post), now retrieves license mentions present in the main file. As with other metadata, the goal is to automatically populate the appropriate metadata field.


Screenshot of the part of the form where the license that is extracted from the PDF file is filled in.

What is the goal? First and foremost, to make it easier to enter information into the deposit form, thereby encouraging the enrichment of HAL with complete data. It is noteworthy that the dissemination license is often overlooked: 363,504 deposits with files have this metadata field completed as of December 4, 2024.

Display conditions for dissemination and reuse

Completing this information helps clarify the conditions under which a publication deposited in HAL may be copied, distributed, modified, or reused. It provides details about the rights of use of the documents. If the author has chosen a Creative Commons license, it encourages sharing of the work while respecting the rules the author has set.

This metadata is particularly easy for digital tools such as search engines, applications, and databases to identify and understand. If it contains a Creative Commons license, these tools can integrate the publication into distribution or reuse workflows that are compatible with the defined rights.

The importance of this metadata is particularly evident in HAL’s deposit suggestion service. This service relies on Creative Commons licenses to meet a key requirement: a publication is only suggested in the deposit suggestion interface if it is openly distributable on HAL. If the publisher does not provide the license in the metadata associated with the DOI, HAL cannot make the suggestion (see our previous post on the suggestion service).

Completing this metadata in an open archive like HAL is therefore essential for encouraging the sharing of research while protecting the authors’ rights.

Part of the Equipex+ HALiance Project

The extraction of this metadata from PDF files is part of the Equipex+ HALiance project, specifically within Work Package 3. This work package aims to retrieve metadata and identifiers from deposited files and automatically enrich the HAL database. CCSD collaborates with Science-Miner, a company that develops open-source tools for exploring scientific texts, to achieve this goal.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.