The SoFAIR project aims to simplify the identification of software cited in scientific publications in order to better describe and promote these tools, which are essential to research progress. The project’s ambition is to develop a machine learning-assisted workflow that can be integrated into existing infrastructures. The system is currently being tested as a proof of concept on the HAL-Inria portal.
Have you ever mentioned software in one of your publications? If so, was this software described and archived? If not, or if you are unsure, you will undoubtedly be interested in the SoFAIR project.
The project is based on the observation that many research software programmes do not comply with FAIR principles (Findable, Accessible, Interoperable, Reusable) due to a lack of identification, linking and archiving mechanisms.
Even if the software used is publicly available, it is difficult to locate, cite or reuse when cited in the body of a publication, or even in a footnote.
A workflow for managing the lifecycle of software resources
The SoFAIR project team aims to provide and deploy an integrated solution for managing the lifecycle of software resources.
This solution is based on the GROBID application, with which HAL users are already familiar, as it can extract metadata from PDF files, such as bibliographic references or funding project codes.
SoFAIR’s focus is on identifying and extracting references to software in PDF files stored in open institutional repositories such as HAL-Inria.
Cycle de vie du logiciel (adaptation de l’illustration Software Assets Lifecycle – SoFAIR Project)After extraction and enrichment, the next phase involves removing ambiguity and validating the discovered mentions. Therefore, an important step in the workflow is to contact the authors of the publication to validate the identified and enriched information.
Software mentions identified in a submitted publication are displayed on the deposit record. These are only visible to the publication’s authors after they have logged in.

Once validated, the deposit can be enriched with explicit links to the software. These validated software references are then visible to everyone.
If the software has already been archived in Software Heritage and has an SWHID identifier, it will be referenced in the repository. Otherwise, the workflow involves sending a request to Software Heritage to register the resource, which will then be permanently archived, assigned a permanent identifier, and notified to HAL-Inria.
Funded by the ANR project, SoFAIR brings together teams from Inria, the Open University, the Brno University of Technology, the Institute of Literary Research of the Polish Academy of Sciences (IBL-PAN) and the European Bioinformatics Institute.
The system, which will be deployed in HAL-Inria from mid-December onwards, is a proof of concept for institutional repositories. The project also includes a EuropePMC demonstrator for life sciences and a digital humanities case study (with links to the DARIAH and EOSC infrastructures). This workflow depends on the availability and sustainability of an external database.
The development of this service is part of the Equipex+ HALiance project.
