CINXE.COM

{"title":"Data Preprocessing for Supervised Leaning","authors":"S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas","volume":12,"journal":"International Journal of Computer and Information Engineering","pagesStart":4104,"pagesEnd":4110,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/14136","abstract":"Many factors affect the success of Machine Learning\r\n(ML) on a given task. The representation and quality of the instance\r\ndata is first and foremost. If there is much irrelevant and redundant\r\ninformation present or noisy and unreliable data, then knowledge\r\ndiscovery during the training phase is more difficult. It is well known\r\nthat data preparation and filtering steps take considerable amount of\r\nprocessing time in ML problems. Data pre-processing includes data\r\ncleaning, normalization, transformation, feature extraction and\r\nselection, etc. The product of data pre-processing is the final training\r\nset. It would be nice if a single sequence of data pre-processing\r\nalgorithms had the best performance for each data set but this is not\r\nhappened. Thus, we present the most well know algorithms for each\r\nstep of data pre-processing so that one achieves the best performance\r\nfor their data set.","references":"[1] Bauer, K.W., Alsing, S.G., Greene, K.A., 2000. Feature screening using\r\nsignal-to-noise ratios. Neurocomputing 31, 29-44.\r\n[2] M. Boulle. Khiops: A Statistical Discretization Method of Continuous\r\nAttributes. Machine Learning 55:1 (2004) 53-69\r\n[3] Breunig M. M., Kriegel H.-P., Ng R. T., Sander J.: \u00d4\u00c7\u00ffLOF: Identifying\r\nDensity-Based Local Outliers-, Proc. ACM SIGMOD Int. Conf. On\r\nManagement of Data (SIGMOD 2000), Dallas, TX, 2000, pp. 93-104.\r\n[4] Brodley, C.E. and Friedl, M.A. (1999) \"Identifying Mislabeled Training\r\nData\", AIR, Volume 11, pages 131-167.\r\n[5] Bruha and F. Franek: Comparison of various routines for unknown\r\nattribute value processing: covering paradigm. International Journal of\r\nPattern Recognition and Artificial Intelligence, 10, 8 (1996), 939-955\r\n[6] J.R. Cano, F. Herrera, M. Lozano. Strategies for Scaling Up\r\nEvolutionary Instance Reduction Algorithms for Data Mining. In: L.C.\r\nJain, A. Ghosh (Eds.) Evolutionary Computation in Data Mining,\r\nSpringer, 2005, 21-39\r\n[7] C. Cardie. Using decision trees to improve cased-based learning. In\r\nProceedings of the First International Conference on Knowledge\r\nDiscovery and Data Mining. AAAI Press, 1995.\r\n[8] M. Dash, H. Liu, Feature Selection for Classification, Intelligent Data\r\nAnalysis 1 (1997) 131-156.\r\n[9] S. Das. Filters, wrappers and a boosting-based hybrid for feature\r\nselection. Proc. of the 8th International Conference on Machine\r\nLearning, 2001.\r\n[10] T. Elomaa, J. Rousu. Efficient multisplitting revisited: Optimapreserving\r\nelimination of partition candidates. Data Mining and\r\nKnowledge Discovery 8:2 (2004) 97-126\r\n[11] Fayyad U., and Irani K. (1993). Multi-interval discretization of\r\ncontinuous-valued attributes for classification learning. In Proc. of the\r\nThirteenth Int. Joint Conference on Artificial Intelligence, 1022-1027.\r\n[12] Friedman, J.H. 1997. Data mining and statistics: What-s the connection?\r\nProceedings of the 29th Symposium on the Interface Between Computer\r\nScience and Statistics.\r\n[13] Marek Grochowski, Norbert Jankowski: Comparison of Instance\r\nSelection Algorithms II. Results and Comments. ICAISC 2004a: 580-\r\n585.\r\n[14] Jerzy W. Grzymala-Busse and Ming Hu, A Comparison of Several\r\nApproaches to Missing Attribute Values in Data Mining, LNAI 2005,\r\npp. 378\u2212385, 2001.\r\n[15] Isabelle Guyon, Andr\u00e9 Elisseeff; An Introduction to Variable and\r\nFeature Selection, JMLR Special Issue on Variable and Feature\r\nSelection, 3(Mar):1157--1182, 2003.\r\n[16] Hernandez, M.A.; Stolfo, S.J.: Real-World Data is Dirty: Data Cleansing\r\nand the Merge\/Purge Problem. Data Mining and Knowledge Discovery\r\n2(1):9-37, 1998.\r\n[17] Hall, M. (2000). Correlation-based feature selection for discrete and\r\nnumeric class machine learning. Proceedings of the Seventeenth\r\nInternational Conference on Machine Learning (pp. 359-366).\r\n[18] K. M. Ho, and P. D. Scott. Reducing Decision Tree Fragmentation\r\nThrough Attribute Value Grouping: A Comparative Study, in Intelligent\r\nData Analysis Journal, 4(1), pp.1-20, 2000.\r\n[19] Hu, Y.-J., & Kibler, D. (1996). Generation of attributes for learning\r\nalgorithms. Proc. 13th International Conference on Machine Learning.\r\n[20] J. Hua, Z. Xiong, J. Lowey, E. Suh, E.R. Dougherty. Optimal number of\r\nfeatures as a function of sample size for various classification rules.\r\nBioinformatics 21 (2005) 1509-1515\r\n[21] Norbert Jankowski, Marek Grochowski: Comparison of Instances\r\nSelection Algorithms I. Algorithms Survey. ICAISC 2004b: 598-603.\r\n[22] Knorr E. M., Ng R. T.: \u00d4\u00c7\u00ffA Unified Notion of Outliers: Properties and\r\nComputation-, Proc. 4th Int. Conf. on Knowledge Discovery and Data\r\nMining (KDD-97), Newport Beach, CA, 1997, pp. 219-222.\r\n[23] R. Kohavi and M. Sahami. Error-based and entropy-based discretisation\r\nof continuous features. In Proceedings of the Second International\r\nConference on Knowledge Discovery and Data Mining. AAAI Press,\r\n1996.\r\n[24] Kononenko, I., Simec, E., and Robnik-Sikonja, M.(1997).Overcoming\r\nthe myopia of inductive learning algorithms with RELIEFF. Applied\r\nIntelligence, 7: 39-55.\r\n[25] S. B. Kotsiantis, P. E. Pintelas (2004), Hybrid Feature Selection instead\r\nof Ensembles of Classifiers in Medical Decision Support, Proceedings of\r\nInformation Processing and Management of Uncertainty in Knowledge-\r\nBased Systems, July 4-9, Perugia - Italy, pp. 269-276.\r\n[26] Kubat, M. and Matwin, S., 'Addressing the Curse of Imbalanced Data\r\nSets: One Sided Sampling', in the Proceedings of the Fourteenth\r\nInternational Conference on Machine Learning, pp. 179-186, 1997.\r\n[27] Lakshminarayan K., S. Harp & T. Samad, Imputation of Missing Data in\r\nIndustrial Databases, Applied Intelligence 11, 259-275 (1999).\r\n[28] Langley, P., Selection of relevant features in machine learning. In:\r\nProceedings of the AAAI Fall Symposium on Relevance, 1-5, 1994.\r\n[29] P. Langley and S. Sage. Induction of selective Bayesian classifiers. In\r\nProc. of 10th Conference on Uncertainty in Artificial Intelligence,\r\nSeattle, 1994.\r\n[30] Ling, C. and Li, C., 'Data Mining for Direct Marketing: Problems and\r\nSolutions', Proceedings of KDD-98.\r\n[31] Liu, H. and Setiono, R., A probabilistic approach to feature selection\u00d4\u00c7\u00f6a\r\nfilter solution. Proc. of International Conference on ML, 319-327, 1996.\r\n[32] H. Liu and R. Setiono. Some Issues on scalable feature selection. Expert\r\nSystems and Applications, 15 (1998) 333-339. Pergamon.\r\n[33] Liu, H. and H. Metoda (Eds), Instance Selection and Constructive Data\r\nMining, Kluwer, Boston, MA, 2001\r\n[34] H. Liu, F. Hussain, C. Lim, M. Dash. Discretization: An Enabling\r\nTechnique. Data Mining and Knowledge Discovery 6:4 (2002) 393-423.\r\n[35] Maas W. (1994). Efficient agnostic PAC-learning with simple\r\nhypotheses. Proc. of the 7th ACM Conf. on Computational Learning\r\nTheory, 67-75.\r\n[36] Markovitch S. & Rosenstein D. (2002), Feature Generation Using\r\nGeneral Constructor Functions, Machine Learning, 49, 59-98, 2002.\r\n[37] Oates, T. and Jensen, D. 1997. The effects of training set size on\r\ndecision tree complexity. In ML: Proc. of the 14th Intern. Conf., pp.\r\n254-262.\r\n[38] Pfahringer B. (1995). Compression-based discretization of continuous\r\nattributes. Proc. of the 12th International Conference on Machine\r\nLearning.\r\n[39] S. Piramuthu. Evaluating feature selection methods for learning in data\r\nmining applications. European Journal of Operational Research 156:2\r\n(2004) 483-494\r\n[40] Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann\r\nPublishers, Los Altos, CA.\r\n[41] Quinlan J.R. (1993), C4.5: Programs for Machine Learning, Morgan\r\nKaufmann, Los Altos, California.\r\n[42] Reinartz T., A Unifying View on Instance Selection, Data Mining and\r\nKnowledge Discovery, 6, 191-210, 2002, Kluwer Academic Publishers.\r\n[43] Rocke, D. M. and Woodruff, D. L. (1996) \"Identification of Outliers in\r\nMultivariate Data,\" Journal of the American Statistical Association, 91,\r\n1047-1061.\r\n[44] Setiono, R., Liu, H., 1997. Neural-network feature selector. IEEE Trans.\r\nNeural Networks 8 (3), 654-662.\r\n[45] M. Singh and G. M. Provan. Efficient learning of selective Bayesian\r\nnetwork classifiers. In Machine Learning: Proceedings of the Thirteenth\r\nInternational Conference on Machine Learning. Morgan Kaufmann,\r\n1996.\r\n[46] Somol, P., Pudil, P., Novovicova, J., Paclik, P., 1999. Adaptive floating\r\nsearch methods in feature selection. Pattern Recognition Lett. 20\r\n(11\/13), 1157-1163.\r\n[47] P. Somol, P. Pudil. Feature Selection Toolbox. Pattern Recognition 35\r\n(2002) 2749-2759.\r\n[48] C. M. Teng. Correcting noisy data. In Proc. 16th International Conf. on\r\nMachine Learning, pages 239-248. San Francisco, 1999.\r\n[49] Yang J, Honavar V. Feature subset selection using a genetic algorithm.\r\nIEEE Int Systems and their Applications 1998; 13(2): 44-49.\r\n[50] Yu and Liu (2003), Proceedings of the Twentieth International\r\nConference on Machine Learning (ICML-2003), Washington DC.\r\n[51] Zheng (2000), Constructing X-of-N Attributes for Decision Tree\r\nLearning, Machine Learning, 40, 35-75, 2000, Kluwer Academic\r\nPublishers.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 12, 2007"}