CINXE.COM
{"title":"An Automatic Bayesian Classification System for File Format Selection","authors":"Roman Graf, Sergiu Gordea, Heather M. Ryan","volume":102,"journal":"International Journal of Computer and Information Engineering","pagesStart":1477,"pagesEnd":1483,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/10001774","abstract":"This paper presents an approach for the classification of\r\nan unstructured format description for identification of file formats.\r\nThe main contribution of this work is the employment of data mining\r\ntechniques to support file format selection with just the unstructured\r\ntext description that comprises the most important format features for\r\na particular organisation. Subsequently, the file format indentification\r\nmethod employs file format classifier and associated configurations to\r\nsupport digital preservation experts with an estimation of required file\r\nformat. Our goal is to make use of a format specification knowledge\r\nbase aggregated from a different Web sources in order to select file\r\nformat for a particular institution. Using the naive Bayes method,\r\nthe decision support system recommends to an expert, the file format\r\nfor his institution. The proposed methods facilitate the selection of\r\nfile format and the quality of a digital preservation process. The\r\npresented approach is meant to facilitate decision making for the\r\npreservation of digital content in libraries and archives using domain\r\nexpert knowledge and specifications of file formats. To facilitate\r\ndecision-making, the aggregated information about the file formats is\r\npresented as a file format vocabulary that comprises most common\r\nterms that are characteristic for all researched formats. The goal is to\r\nsuggest a particular file format based on this vocabulary for analysis\r\nby an expert. The sample file format calculation and the calculation\r\nresults including probabilities are presented in the evaluation section.","references":"[1] P. Ayris, R. Davies, R. McLeod, R. Miao, H. Shenton, and P. Wheatley.\r\nThe life2 final project report. Final project report, LIFE Project, London,\r\nUK, 2008.\r\n[2] L. C. David Tarrant, Steve Hitchcock. Where the semantic web and web\r\n2.0 meet format risk management: P2 registry. International Journal of\r\nDigital Curation, 6(1):165\u2013182, 2011.\r\n[3] S. Gordea, A. Lindley, and R. Graf. Computing recommendations for\r\nlong term data accessibility basing on open knowledge and linked data.\r\nJoint proceedings of the RecSys 2011 Workshops Decisions@RecSys\u201911\r\nand UCERSTI 2, 811:51\u201358, November 2011.\r\n[4] R. Graf and S. Gordea. Aggregating a knowledge base of file formats\r\nfrom linked open data. Proceedings of the 9th International Conference\r\non Preservation of Digital Objects, poster:292\u2013293, October 2012.\r\n[5] R. Graf and S. Gordea. A risk analysis of file formats for preservation\r\nplanning. In Proceedings of the 10th International Conference on\r\nPreservation of Digital Objects (iPres2013), pages 177\u2013186, Lissabon,\r\nPortugal, Sep 2013. Biblioteca Nacional de Portugal, Lisboa.\r\n[6] R. Graf, S. Gordea, and H. Ryan. A model for format endangerment\r\nanalysis using fuzzy logic. In Proceedings of the 11th International\r\nConference on Digital Preservation (iPres2014), pages 160\u2013168,\r\nMelbourne, Australia, Oct 2014. State Library of Victoria, Melbourne.\r\n[7] D. Heckerman. Bayesian networks for data mining. Data Mining and\r\nKnowledge Discovery, 1(1):79\u2013119, 1997.\r\n[8] J. Hunter and S. Choudhury. Panic: an integrated approach to the\r\npreservation of composite digital objects using semantic web services.\r\nInternational Journal on Digital Libraries, 6, (2):174\u2013183, September\r\n2006.\r\n[9] A. N. Jackson. Formats over time: Exploring uk web history.\r\nProceedings of the 9th International Conference on Preservation of\r\nDigital Objects, pages 155\u2013158, October 2012.\r\n[10] G. W. Lawrence, W. R. Kehoe, O. Y. Rieger, W. H. Walters, and\r\nA. R. Kenney. Risk management of digital information: A file format\r\ninvestigation. june 2000.\r\n[11] D. Pearson and C. Webb. Defining file format obsolescence: A risky\r\njourney. The International Journal of Digital Curation, Vol 3, No\r\n1:89\u2013106, July 2008.\r\n[12] S. Vermaaten, B. Lavoie, and P. Caplan. Identifying threats to successful\r\ndigital preservation: the spot model rsik assessment. D-Lib Magazine,\r\n18(9\/10), September 2012.\r\n[13] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,\r\nG. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach, D. Hand,\r\nand D. Steinberg. Top 10 algorithms in data mining. Knowledge and\r\nInformation Systems, 14(1):1\u201337, 2008.\r\n[14] R. Zacharski. A Programmer\u2019s Guide to Data Mining: The Ancient Art\r\nof the Numerati. 2012.\r\n[15] H. Zhang. The Optimality of Naive Bayes. In V. Barr and Z. Markov,\r\neditors, FLAIRS Conference. AAAI Press, 2004.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 102, 2015"}