An evaluation measurement in automatic text classification for authorship attribution

Antonio Rico Sulayes

PDF

Publicado: jul 7, 2016

Palabras clave:

sistemas de clasificación, medidas de evaluación, atribución de autoría

Antonio Rico Sulayes

Universidad de las Américas Puebla (Puebla, México)

Resumen

In authorship attribution, the task of correctly assigning an anonymized document to an author within a predefined set of subjects, various measurements to evaluate classification systems have been used in the research literature. As will be discussed in this article, some of these measurements may differ diametrically. For research purposes, the evaluation of an automatic text classification system, such as the one that may be used for authorship attribution, may report a number of different performance measurements. However, some of the previously used figures are either too optimistic or lack generalizability. In addition to this issues, law-oriented research has pointed out the importance of having an error rate for the legal admissibility not only of this type of text classification task but of any piece of potential evidence in general. Considering the circumstances, the use of a single measurement in authorship attribution is proposed in this paper. Also, the implications of using this figure instead of others presented by researchers are discussed. At the same time, the importance of presenting this measurement along other relevant experimental settings, such as the number of categories (or authors in this context), is explained. The discussion is supported with the presentation of a set of authorship attribution experiments that utilize data from users of crime-related social media.

Descargas

La descarga de datos todavía no está disponible.

Cómo citar

Rico Sulayes, A. (2016). An evaluation measurement in automatic text classification for authorship attribution. Ingenio Magno, 6(2), 62-74. Recuperado a partir de http://revistas.ustatunja.edu.co/index.php/ingeniomagno/article/view/1093

Número

Vol. 6 (2015): Ingenio Magno Vol. 6-2

Sección

Artículos Vol. 6-2

DECLARACIÓN DE ORIGINALIDAD DE ARTÍCULO PRESENTADO

Por medio del presente documento, certifico(amos) que el artículo que se presenta para posible publicación en la revista institucional INGENIO MAGNO del Centro de Investigaciones de Ingeniería Alberto Magno CIIAM de la Universidad Santo Tomás, seccional Tunja, es de mi (nuestra) entera autoría, siendo su contenido producto de mi (nuestra) directa contribución intelectual y aporte al conocimiento.

Todos los datos y referencias a publicaciones hechas están debidamente identificados con su respectiva nota bibliográfica y en las citas que se destacan como tal. De requerir alguna clase de ajuste o corrección, comunicaré(emos) de tal procedimiento con antelación a los responsables de la revista.

Por lo anteriormente expresado, declaro(amos) que el material presentado en su totalidad se encuentra conforme a la legislación aplicable en materia de propiedad intelectual e industrial de ser el caso, y por lo tanto, me(nos) hago (hacemos) absolutamente responsable(s) de cualquier reclamación relacionada a esta.

En caso que el artículo presentado sea publicado, manifiesto(amos) que cedo(emos) plenamente a la Universidad Santo Tomás, seccional Tunja, los derechos de reproducción del mismo.

Citas

Argamon, S., Šari, M. & Stein, S. S. (2003). Style mining of electronic messages for multiple authorship discrimination: first results. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Baayen, H., van Halteren, H., Neijt, A. & Tweedie, F. (2002). An experiment in authorship attribution (pp. 29-37). Proceedings of JADT 2002: Sixth International Conference on Textual Data Statistical Analysis.
Bolle, R. M., Connell, J. H., Pankanti, S., Ratha, N. K. y Senior, A. W. (2004). Guide to biometrics. New York: Springer-Verlag.
Burns, R. B. & Burns, R. A. (2008). Business research methods and statistics using SPSS. UK: Sage.
Burrows, J. (2002). Delta: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267-86.
Chaski, C. E. (2005). Who’s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1-13.
Chaski, C. E. (2007). The keyboard dilemma and authorship identification. In P. Craiger & S. Shenoi (Eds.), Advances in Digital Forensics III (pp. 133-146). New York: Springer.
Foros Blog del Narco. (2010). Retrieved from http:// www.foro.blogdelnarco.com/
Grant, T. (2007). Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1-25.
Grieve, J. (2007). Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic Computing, 22(3), 425-442.
Howald, B. S. (2008). Authorship attribution under the rules of evidence: empirical approaches - a Layperson’s Legal System. International Journal of Speech, Language and the Law, 15(2), 219-247.
Jurafsky, D. & Martin, J. H. (2008). Speech and language processing: an introduction to language natural processing, computational linguistics, and speech recognition (2nd ed.). Upper-Saddle River: Pearson-Prentice Hall.
Manning, C. D., Raghavan, P. y Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge.
McMenamin, G. R. (2002). Forensic linguistics: advances in forensic stylistics. Boca Raton: CRC.

Mikros, G. K. & Argiri, E. K. (2007). Investigating topic influence in authorship attribution. In B. Stein, M. Koppel & E. Stamatatos (Eds.), Proceedings of the SIGIR 2007 International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, PAN 2007.

Koppel , M., Schler, J., & Argamon, S. (2009). Computational Methods in Authorship Attribution. Journal of the American Society for Information Science and Technology, 60(1), 9-26.
Koppel, M., Schler, J., & Messeri, E. (2008). Authorship Attribution in Law Enforcement Scenarios. In C.S. Gal, P. Kantor, & B. Saphira (Eds.), Security Informatics and Terrorism: Patrolling the Web (pp.111-119). Amsterdam: IOS.
Peng, F., Schuurmans, D., Keselj, V. & Wang, S. (2003). Language independent authorship attribution using character level language models. Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics: Vol. 1 (pp. 267-274). Stroudsburg: Association for Computational Linguistics.
Petrovska-Delacretaz, D., Chollet, G. y Dorizzi, B. (2009). Guide to biometric reference systems and performance evaluation. London: Springer-Verlag.
Rico-Sulayes, A. (2011). Statistical authorship attribution of Mexican drug trafficking online forum posts. International Journal of Speech, Language and the Law, 18(1), 53-74

Rico-Sulayes, A. (2012). Quantitative authorship attribution of users of Mexican drug dealing related online forums (PhD dissertation, Georgetown University). Retrieved from https://repository.library.georgetown.edu/ handle/10822/557726

Rico-Sulayes, A. (2014). Técnicas de reducción, algoritmos resistentes al ruido o ambos. Opciones para el manejo de rasgos clasificatorios en la atribución de autoría. Research in Computing Science, 80.
Solan, L. M. & Tiersma, P. M. (2004). Author Identification in American Courts. Applied Linguistics, 25(4), 448-465.
Solan, L. M. & Tiersma, P. M. (2005). Speaking of Crime: The Language of Criminal Justice. Chicago: University of Chicago.
Spassova, M. S. (2008). Las perífrasis verbales del español en la atribución forense de autoría. In R. Monroy & A. Sánchez (Eds.), 25 años de lingüística en España: hitos y retos. Actas del XXVI Congreso de AESLA (pp. 605-614). Murcia: Universidad de Murcia.
Spassova, M. S. (2009). El potencial discriminatorio de las secuencias de categorías gramaticales en la atribución forense de autoría de textos en español (PhD dissertation, Universitat Pompeu Fabra, Barcelona). Retrieved from http://repositori.upf.edu/handle/10230/12285
Spassova, M. S. & Turell, M. T. (2007). The use of morphosyntactically annotated tag sequences as markers of authorship. In M. T. Turell, J. Cicres, and M. S. Spassova (Eds.), Proceedings of the 2nd European IAFL Conference on Forensic Linguistics / Language and the Law 2006 (pp. 229-237). Barcelona: Documenta Universitaria.
Stamatatos, E., Fakotakis, N. & Kokkinakis, G. (2001). Law, 18(1), 53-74. Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193-214. 73

Tambouratzis, G. & Vassiliou, M. (2007). Employing online forums (PhD dissertation, Georgetown University). thematic variables for enhancing classification accuracy Retrieved from https://repository.library.georgetown.edu/ within author discrimination experiments. Literary and handle/10822/557726 Linguistic Computing, 22(2), 207-224.

Witten, I. H., Frank, E. & Hall, M. A. (2011). Data mining: practical machine learning tools and techiniques (3rd ed.). Burlington: Morgan Kaufmann.
Zheng, R., Li, J., Chen, H. & Huang, Z. (2006). A framework for authorship identification of online messages: writingstyle features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3): 378-393.
Zvetco Biometrics. (2012). Biometric Knowledge Center. Retrieved from http://www.zvetcobiometrics.com/Support/ definitions.jsp

An evaluation measurement in automatic text classification for authorship attribution

Resumen

Descargas

Citas

Ingenio Magno Vol. 15 - 1

Ingenio Magno Vol. 14 -2

Ingenio Magno vol. 14 - 1

Ingenio Magno vol. 13-2

Ingenio Magno vol. 13-1

Ingenio Magno Vol. 12-2

Ingenio Magno vol. 12-1

Ingenio Magno vol. 11-2

Ingenio Magno vol. 11-1

Ingenio Magno vol. 10-2

Ingenio Magno vol. 10-1

Ingenio Magno vol. 9-2

Ingenio Magno vol. 9-1

Ingenio Magno vol. 8-2

Ingenio Magno Vol. 8-1

Ingenio Magno Vol. 7-2

Ingenio Magno Vol. 7-1

Ingenio Magno Vol. 6-2

Ingenio Magno Vol. 6

Ingenio Magno Vol. 5

Ingenio Magno Vol. 4

Ingenio Magno Vol. 3

Ingenio Magno Vol. 2

Ingenio Magno Vol. 1

Barra lateral del artículo

Contenido principal del artículo

Resumen

Descargas

Detalles del artículo

Citas