How to get rid of bots that consult your deposits …

Written by Agnès Magron

For those who explore the statistics of consultations or downloads of their submissions in HAL, it is sometimes surprising to see peaks of traffic that are often inexplicable. A sudden surge in popularity of the topic covered or the referencing of the article in a highly viewed website can explain it. Most often, however, the right interpretation is the action of bots, those programs that browse the web and the best-known one being used and created by Google.

Provide reliable indicators

Eliminating from the statistics the access generated by these programs is a major challenge to obtain data that reflect the human consultation of resources available on HAL. Bots, if not detected, indeed inflate the figures of consultation. The resultant data are thus unusable.

Providing reliable and stable indicators is a priority for the CCSD which has included the redesign of the statistical module in its roadmap.

For this redesign, the CCSD has chosen to base its developments on the ezPAARSE application, a free and open source software that exploits, analyzes and enriches resource access logs. For several months, in partnership with the ezPAARSE team, the CCSD has made the required developments to use this application within the framework of HAL.

ezPAARSE and the current version of HAL manage bot detection in a so-called « static » way. The bots are filtered with the help of a pre-established list of known bots provided by the COUNTER standard. This detection is unfortunately not enough because some bots slip through the cracks.

It is therefore necessary to improve this first list with data based on the behaviour of human users. It is what we call a dynamic filtering which aim to continually improve our list of bot.The CCSD and ezPAARSE have therefore worked together on a module that detects whether a user reaches a daily consultation threshold. If he exceeds this threshold, his behaviour will label it as “abnormal” and his futures visits will be annotated as coming from a “potential bot”. The result of this partnership work will benefit the community of ezPAARSE users.

Processing chain

The dynamic filtering of the bots applied on a corpus of logs constituted after several steps summarized in the diagram below:

 

All traces left by users (humans as machines) on HAL are analyzed to recover those concerning the consultations and downloads. This eliminates about 65% of the logs. On what remains, we reject the logs identified as coming from the bots according to the COUNTER list (variable volume) and duplicates resulting from double clicks. Finally, the dynamic detection module applies on the resting volume to improve our next detection. At the end of this whole process, the relevant corpus is estimated at about 10% of the total number of logs.

The next step in the redesign of the HAL Statistical Module will be to develop data visualization tools that will allow simple and adaptable operation to meet individual needs.