Bad Bots, Good Bots: Filtering Robots to Reduce Their Impact

Written by Agnès Magron

Building on the trend observed in 2024, the year 2025 saw a significant increase in bot traffic. The infrastructures of HAL and Episciences were affected by particularly aggressive bot requests. In response, the CCSD implemented filtering measures to mitigate the impact of these bots while ensuring that legitimate users and useful bots (e.g. those employed for scientific indexing, archiving, and partner services) were not adversely affected. The challenge lies in preserving the availability and performance of services while preventing resource saturation, all the while maintaining the openness of the infrastructure, which is essential for disseminating and making content visible.

Bot activity has always been intense on HAL; search engines, content aggregators and indexing tools play a crucial role in content discovery and dissemination. However, like many other open infrastructures (e.g., Wikipedia, arXiv, RePec or DOAJ), HAL has observed a concerning trend in recent years: bot-generated traffic has increased significantly and their behaviour has become more aggressive.

‘Aggressive bots’ are those that disregard standard web crawler rules, such as sending an excessive number of requests, failing to clearly identify themselves, or ignoring crawling guidelines (as outlined in the robots.txt file).  Whether or not they are linked to artificial intelligence applications, these bots flood servers with requests, consume energy resources, and degrade the experience of legitimate users by slowing down response times. For example, users may encounter a “Too Many Requests” error after performing a search. Temperature alerts on machines during traffic spikes have also been reported.

These practices require heightened vigilance and require significant effort from the infrastructure management team.  The core challenge lies in the impossibility of determining a bot’s intentions. The task, therefore, is to identify aggressive bots, develop strategies to mitigate the impact of their activity, and then test and implement these strategies.

Balancing performance, accessibility and user satisfaction

The goal  is also to maintain a delicate balance between performance, accessibility and user satisfaction, whether those users are human or machine. Without intervention, there is a risk of system failures and wasted resources in terms of energy and computing power. On the other hand, overly strict measures—such as systematic checks to distinguish humans from bots (e.g., CAPTCHAs)—can frustrate legitimate users.

Aggressive bot traffic also skews consultation statistics. The infrastructure includes 16 machines dedicated solely to storing access logs. The volume recorded in 2025 was twice that in 2024. Since these machines operate continuously, a critical question arises: How can their usage be rationalised? Removing bot logs saves storage space but reduces consultation figures, which can also lead to frustration and dissatisfaction. However, if most logs are generated by bots, are the consultation statistics even meaningful?

The infrastructure team has primarily focused its work on the HAProxy platform, implemented as part of the Equipex+ HALiance project. HAProxy acts as an intelligent intermediary between users (or bots) and the servers hosting the applications. For instance, it can be configured to detect unusually frequent requests (e.g. a bot sending 100 requests per second) and slow them down or block them. For example, on Tuesday 17 February, more than 4 million robot requests were intercepted out of a total of 17 million requests. These were generally search requests that are costly for the HAL indexer. The filters improve response times, achieving page loading times of less than one second.

Dealing with bots task group

Sharing information and collaborating with other infrastructures is essential to advancing knowledge and practices. Since 2025, the CCSD has participated in the Dealing With Bots Task Group, a working group established by the Confederation of Open Access Repositories (COAR) following a survey. The group aims to study the sudden rise in malicious bot traffic and its impact on repositories. The first outcome of this initiative is a website launched in January, which provides repository managers with resources to help them develop context-specific strategies.

A key finding is “that there is no silver bullet solution to this problem”. Repositories must strike a delicate balance between protecting their operations from unscrupulous actors and maintaining their core mission of providing open access to legitimate users and machines.

In 2026, HAL’s strategy is to test Anubis — an open-source utility already implemented by other platforms — and optimise the separation of identified bot traffic from legitimate traffic.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.