Consultation statistics: Simplifying bot filtering

Written by Benoît Legouy

The CCSD has observed that the identification of query traces on HAL by bots and web crawlers is becoming increasingly complex. The current filtering system, which removes traces of more than 100 consultations per day from the same machine, does not reliably eliminate bot activity. From 2025 onwards, this threshold will no longer be taken into account.
Production of HAL statistics since 2019

In 2019, the CCSD launched its Kibana platform, which allows portal administrators to query a dedicated database for HAL documents and their consultations. The needs identified during the preliminary survey focused mainly on the general statistical survey of university libraries (eSGBU), which administrators complete annually.

Initial dashboards were implemented to provide general key indicators, followed by more specific reports tailored to the eSGBU survey. A working group formed jointly by the board of the users club CasuHAL and the CCSD quickly took charge of these dashboards, which are now managed by CasuHAL.

In order to facilitate access to these resources and benefit all users, new simplified databases have been created, allowing consultation figures to be integrated directly into the HAL interface. Document metrics were introduced with the interface update at the end of 2022, followed by dashboards accessible to depositors, collection managers and portal administrators.

Handling bots in HAL statistics

Since the launch of HAL’s Kibana platform, statistics are extracted from server logs using the Ezpaarse software developed by Inist. This software allows the exclusion of bots listed in the COUNTER registry: the corresponding log entries are not included in HAL’s statistical databases.

In collaboration with the Inist team, the CCSD experimented with an Ezpaarse feature to filter out bots not listed in the COUNTER registry. This feature, developed at the request of the CCSD, allowed log entries left by machines that accessed more than 100 documents in a single day to be flagged. These entries remained in the database, but could be filtered out of the figures displayed.

After several years of use, we found that this feature did not always successfully exclude the targeted machines. More importantly, we noticed changes in the way bots interacted with HAL, sometimes in a malicious way. In particular, web crawlers have evolved and now often include mechanisms to bypass maximum visit limits, typically by spreading their activity across multiple machines. There are also other techniques to evade machine identification by IP address.

These findings led us to conclude that tagging certain log entries was not only inaccurate but also misleading, as it gave the impression that filtered figures were free of bot activity. Several administrators were surprised by the sharp increase in queries in 2024, despite no change in the filtering techniques used. Our review confirms a real increase in HAL traffic, but with our current tools we cannot estimate the proportion of bot activity – just as we could not in 2023 and previous years, despite the filtering measures in place.

What changes in 2025

Recognising the problems caused by this ineffective filter, and lacking the resources to improve it, we have decided to remove the marking of visits over 100 in a single day from the statistics database.In line with widely accepted standards, particularly those used by most other platforms, we will only exclude bots listed in the COUNTER registry. This is in addition to the ad-hoc filters implemented by the CCSD system administration team, which is responsible for mitigating both malicious and non-malicious activity that overloads the HAL infrastructure.

For consistency, the 2024 consultation figures currently displayed on the interface will remain unchanged and will therefore be fully comparable with previous years, including the 2024 ESGBU survey, as they were produced using the same methodologies.The Kibana dashboards will also remain unchanged, but from the 2025 data onwards the bot marking filter will no longer have any effect.

Maintaining HAL’s statistical platform requires year-round operations, often made more challenging by the large volume of data handled. Implementing mechanisms to distinguish human users from automated visits requires significant resources – an effort that, to our knowledge, few platforms undertake.As we have often emphasised, access data for a document or collection should not be interpreted as a direct reflection of human user interest.

 

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.