A. Dorca Josa1, E. Santamaría Pérez2, J.A. Morán Moreno2

1Universitat d'Andorra (ANDORRA)
2Universitat Oberta de Catalunya (SPAIN)
Biometric identification systems based on Keystroke Dynamics have been around for almost forty years. There has been a lot of interest in identifying persons using this behavioral trait. Keystroke Dynamics focuses on the particular way a person types on a keyboard.

The objective of the proposed research is to determine how well the identity of users can be established when they use online resources like e-learning environments when context features are taken into account. The presented research focuses on free text. This means that users were never told what to type, how or when. This particular field of Keystroke Dynamics has not been as thoroughly studied as the fixed text alternative where a plethora of methods have been tried.

The proposed method focuses on the hypothesis that the position of a particular letter, or combination of letters, in a word is of high importance. Other studies have used digraphs and/or trigraphs, without taking into account if these letter combinations had occurred at the beginning, the middle, or the end of a word.

A group of 60 users was analyzed with a total message number of close to 2000. This messages were sent to the forum modules of the moodle LCMS (Learning Content Management System) at the University of Andorra. The proposed technique was transparently applied to the user in a way that did not affect its normal behavior.

The template of the user is built using the context of the written words and the latency between successive keystrokes, something that has not been previously attempted. Other contextual features, like word length, minimum number of needed words or repetition of words have also been studied to determine which are the ones that better help ascertain the identity of a user.

Logical trees are used to store the collected samples in a way that allows for the context to be preserved. The distance between a test sample and a template is obtained by using a combination of the Chebyshev distance measurement, simple statistical methods, and context features like the depth at which the word was found in the logical tree.

Thirty different randomly chosen test sets from the available pool of users and messages were used to test the proposed method. The partition of messages to test and build the models was 30/70%. The minimum distance between test messages and templates was used to identify a user.

The best results were achieved when messages were tested against a random subset group of 40 users (from the 60 available). A mean FRR (False Rejection Rate) of 0.00951% and a mean FAR (False Acceptance Rate) of 0.00025% was achieved. This is, roughly, a 99% of correctly identified messages. When the same tests were performed against templates built using only digraphs and trigraphs, without considering context features, the effectiveness of the method decreased to an 84%. The proposed method benefits from the fact that having less information but of much better quality greatly improves the results.

The results of the proposed research should help determine if using Keystroke Dynamics and the proposed method is enough to identify users from the content they create with a good enough level of certainty. From this moment, it could be used as a method to ensure that a user is not supplanted by another, in authentication schemes, or even to help determine the authorship of different parts of a document written by more than one user.