The visibility of open repositories was the theme of a meeting organized by the Couperin consortium on November 20th.
The presentations of the open access content aggregators such as BASE or CORE, or that of the ScanR search engine shows the essential place of HAL as a data source for the dissemination and impact of French scientific publications.
Bénédicte Kuntziger of the CCSD was invited to present the referencing and visibility of HAL. A previous post has already addressed the main sites on which HAL is referenced (updated in the documentation): besides BASE, CORE and ScanR already mentioned, HAL is harvested by Google Scholar,OpenAIRE, Isidore for the SHS, Dart-Europe for theses.
Significant work is being done on the visibility of life science submissions in PubMed, Europe PubMed and, through a partnership with INSERM, in PubMed Central. As for submission in the field of economy, they are referenced in RePec. Software source code deposits are referenced in Software Heritage.
This post will address more specifically the technical aspects allowing the machines to access the contents of HAL.
The Open Archives Initiative-Protocol for Metadata Harvesting (OAI-PMH) is a protocol for collecting metadata. It is based on client-to-server communication. The server here is HAL. HAL can be harvested as a whole or in separate sets, called OAI set. The sets proposed by HAL can be ordered with document types, scientific fields and collections.
For example, a “client” may only collect theses or a specific laboratory collection.
APIs (Application Programming Interface) is interface that allow machine-to-machine communication. From a query, any website can display a list of deposits. AuréHAL data (affiliations, authors, disciplines, journals, projects, metadata lists, etc.) is also available via APIs, amplifying the possibilities for exploiting content.
Another way to expose the data is to propose it structured in the RDF format (Resource Description Framework), which is the language used for the semantic web. HAL metadata is available in this format in the portal data.archives-ouvertes.fr. The portal is new and we have little feedback on the use of HAL data in RDF format.
Search Engine Optimization
Search Engine Optimization (SEO) is a set of techniques designed to improve the visibility of a web page in search results. The majority of web users simply use the first results, and rarely go beyond the first page. Knowing that, optimization is an important issue. A work on the metadata in the source code of the documents has been realized: the metadata added in the source code of the pages allows a better identification of the documents by Google Scholar, but also by other tools such as Zotero.
In addition, to improve SEO in Google and Google Scholar, and at the request of their services, a deduplication of URLs to access documents has been achieved. Indeed, a submission can be accessible from several URLs (portal, collection), which increases its visibility but … complicates the identification of the source of the documents by the algorithm of Google. Since this year, the URL of the file provided to Google’s robots is that of the portal used for the submission. This change also benefits other search engines.
Uniformity of URLs structure
Each submission has an identifier present in its URL in the format https://hal.archives-ouvertes.fr/hal-XXXXXXXX. The access URL to the main file is always made with the format https://hal.archives-ouvertes.fr/hal-XXXXXXXX/document. For example, the file of hal-01917105 is accessible with the URL https://hal.archives-ouvertes.fr/hal-01917105/document
This is the way used by Episciences to ‘find’ automatically the URL to access the main document.
On the same principle, a program / robot can predict the URL of the metadata export formats of a document.
But where do the users come from?
According to the statistics of consultation, the global majority of the users arrives on a HAL page via a search engine, Google is at the top. In 2018, statistics account for more than 3 million visits via Google. We also note in 2018 an increase in consultations from social networks, Facebook is at the top but is still very far from Google.
As can be seen, the CCSD mobilizes standardized formats, standards and protocols to facilitate access to HAL publications. In particular, he follows the work done within COAR (Confederation of Open Access Repositories), an international association that brings together the community of open archives.
As the co-organizer of the next COAR meeting event in 2019, the CCSD will also be pleased to welcome its partners in Lyon in May 2019 for the COAR general assembly and the annual congress.