CINXE.COM
Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis
<!DOCTYPE html> <html lang="en" dir="ltr"> <head> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-P63WKM1TM1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-P63WKM1TM1'); </script> <!-- Yandex.Metrika counter --> <script type="text/javascript" > (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date(); for (var j = 0; j < document.scripts.length; j++) {if (document.scripts[j].src === r) { return; }} k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(55165297, "init", { clickmap:false, trackLinks:true, accurateTrackBounce:true, webvisor:false }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/55165297" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <!-- Matomo --> <!-- End Matomo Code --> <title>Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis</title> <meta name="description" content="Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis"> <meta name="keywords" content="Counter vectorization, Convolutional Neural Network, Crawler, data technology, Long Short-Term Memory, LSTM, Web Scraping, sentiment analysis."> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no"> <meta charset="utf-8"> <meta name="citation_title" content="Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis"> <meta name="citation_author" content="Sangita Pokhrel"> <meta name="citation_author" content="Nalinda Somasiri"> <meta name="citation_author" content="Rebecca Jeyavadhanam"> <meta name="citation_author" content="Swathi Ganesan"> <meta name="citation_publication_date" content="2023/11/24"> <meta name="citation_journal_title" content="International Journal of Electrical and Computer Engineering"> <meta name="citation_volume" content="17"> <meta name="citation_issue" content="11"> <meta name="citation_firstpage" content="300"> <meta name="citation_lastpage" content="307"> <meta name="citation_pdf_url" content="https://publications.waset.org/10013373/pdf"> <link href="https://cdn.waset.org/favicon.ico" type="image/x-icon" rel="shortcut icon"> <link href="https://cdn.waset.org/static/plugins/bootstrap-4.2.1/css/bootstrap.min.css" rel="stylesheet"> <link href="https://cdn.waset.org/static/plugins/fontawesome/css/all.min.css" rel="stylesheet"> <link href="https://cdn.waset.org/static/css/site.css?v=150220211555" rel="stylesheet"> </head> <body> <header> <div class="container"> <nav class="navbar navbar-expand-lg navbar-light"> <a class="navbar-brand" href="https://waset.org"> <img src="https://cdn.waset.org/static/images/wasetc.png" alt="Open Science Research Excellence" title="Open Science Research Excellence" /> </a> <button class="d-block d-lg-none navbar-toggler ml-auto" type="button" data-toggle="collapse" data-target="#navbarMenu" aria-controls="navbarMenu" aria-expanded="false" aria-label="Toggle navigation"> <span class="navbar-toggler-icon"></span> </button> <div class="w-100"> <div class="d-none d-lg-flex flex-row-reverse"> <form method="get" action="https://waset.org/search" class="form-inline my-2 my-lg-0"> <input class="form-control mr-sm-2" type="search" placeholder="Search Conferences" value="" name="q" aria-label="Search"> <button class="btn btn-light my-2 my-sm-0" type="submit"><i class="fas fa-search"></i></button> </form> </div> <div class="collapse navbar-collapse mt-1" id="navbarMenu"> <ul class="navbar-nav ml-auto align-items-center" id="mainNavMenu"> <li class="nav-item"> <a class="nav-link" href="https://waset.org/conferences" title="Conferences in 2024/2025/2026">Conferences</a> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/disciplines" title="Disciplines">Disciplines</a> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/committees" rel="nofollow">Committees</a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="navbarDropdownPublications" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false"> Publications </a> <div class="dropdown-menu" aria-labelledby="navbarDropdownPublications"> <a class="dropdown-item" href="https://publications.waset.org/abstracts">Abstracts</a> <a class="dropdown-item" href="https://publications.waset.org">Periodicals</a> <a class="dropdown-item" href="https://publications.waset.org/archive">Archive</a> </div> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/page/support" title="Support">Support</a> </li> </ul> </div> </div> </nav> </div> </header> <main> <div class="container mt-4"> <div class="row"> <div class="col-md-9 mx-auto"> <form method="get" action="https://publications.waset.org/search"> <div id="custom-search-input"> <div class="input-group"> <i class="fas fa-search"></i> <input type="text" class="search-query" name="q" placeholder="Author, Title, Abstract, Keywords" value=""> <input type="submit" class="btn_search" value="Search"> </div> </div> </form> </div> </div> <div class="row mt-3"> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Commenced</strong> in January 2007</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Frequency:</strong> Monthly</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Edition:</strong> International</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Paper Count:</strong> 33093</div> </div> </div> </div> <div class="card publication-listing mt-3 mb-3"> <h5 class="card-header" style="font-size:.9rem">Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/search?q=Sangita%20Pokhrel">Sangita Pokhrel</a>, <a href="https://publications.waset.org/search?q=Nalinda%20Somasiri"> Nalinda Somasiri</a>, <a href="https://publications.waset.org/search?q=Rebecca%20Jeyavadhanam"> Rebecca Jeyavadhanam</a>, <a href="https://publications.waset.org/search?q=Swathi%20Ganesan"> Swathi Ganesan</a> </p> <p class="card-text"><strong>Abstract:</strong></p> <p>Tourism is a booming industry with huge future potential for global wealth and employment. There are countless data generated over social media sites every day, creating numerous opportunities to bring more insights to decision-makers. The integration of big data technology into the tourism industry will allow companies to conclude where their customers have been and what they like. This information can then be used by businesses, such as those in charge of managing visitor centres or hotels, etc., and the tourist can get a clear idea of places before visiting. The technical perspective of natural language is processed by analysing the sentiment features of online reviews from tourists, and we then supply an enhanced long short-term memory (LSTM) framework for sentiment feature extraction of travel reviews. We have constructed a web review database using a crawler and web scraping technique for experimental validation to evaluate the effectiveness of our methodology. The text form of sentences was first classified through VADER and RoBERTa model to get the polarity of the reviews. In this paper, we have conducted study methods for feature extraction, such as Count Vectorization and Term Frequency – Inverse Document Frequency (TFIDF) Vectorization and implemented Convolutional Neural Network (CNN) classifier algorithm for the sentiment analysis to decide if the tourist’s attitude towards the destinations is positive, negative, or simply neutral based on the review text that they posted online. The results demonstrated that from the CNN algorithm, after pre-processing and cleaning the dataset, we received an accuracy of 96.12% for the positive and negative sentiment analysis. </p> <iframe src="https://publications.waset.org/10013373.pdf" style="width:100%; height:400px;" frameborder="0"></iframe> <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/search?q=Counter%20vectorization" title="Counter vectorization">Counter vectorization</a>, <a href="https://publications.waset.org/search?q=Convolutional%20Neural%20Network" title=" Convolutional Neural Network"> Convolutional Neural Network</a>, <a href="https://publications.waset.org/search?q=Crawler" title=" Crawler"> Crawler</a>, <a href="https://publications.waset.org/search?q=data%20technology" title=" data technology"> data technology</a>, <a href="https://publications.waset.org/search?q=Long%20Short-Term%20Memory" title=" Long Short-Term Memory"> Long Short-Term Memory</a>, <a href="https://publications.waset.org/search?q=LSTM" title=" LSTM"> LSTM</a>, <a href="https://publications.waset.org/search?q=Web%20Scraping" title=" Web Scraping"> Web Scraping</a>, <a href="https://publications.waset.org/search?q=sentiment%20analysis." title=" sentiment analysis."> sentiment analysis.</a> </p> <a href="https://publications.waset.org/10013373/web-data-scraping-technology-using-term-frequency-inverse-document-frequency-to-enhance-the-big-data-quality-on-sentiment-analysis" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/10013373/apa" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">APA</a> <a href="https://publications.waset.org/10013373/bibtex" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">BibTeX</a> <a href="https://publications.waset.org/10013373/chicago" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">Chicago</a> <a href="https://publications.waset.org/10013373/endnote" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">EndNote</a> <a href="https://publications.waset.org/10013373/harvard" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">Harvard</a> <a href="https://publications.waset.org/10013373/json" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">JSON</a> <a href="https://publications.waset.org/10013373/mla" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">MLA</a> <a href="https://publications.waset.org/10013373/ris" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">RIS</a> <a href="https://publications.waset.org/10013373/xml" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">XML</a> <a href="https://publications.waset.org/10013373/iso690" target="_blank" rel="nofollow" class="btn btn-primary btn-sm">ISO 690</a> <a href="https://publications.waset.org/10013373.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">175</span> </span> <p class="card-text"><strong>References:</strong></p> <br>[1] S. R. Department, "Total contribution of travel and tourism to gross domestic product (GDP) worldwide from 2006 to 2021," Travel, Tourism & Hospitality, no. 2022, 2022. <br>[2] M. H. A. Gandomi, " Beyond the hype: Big data concepts, methods, and analytics," International Journal of Information Management, no. 2022, pp. 137-144, 2015. <br>[3] G. F. S. T.March, " Design and natural science research on information technology," Decision Support Systems, no. 2022, pp. 251-256, 1995. <br>[4] D. R. R. D. Chingakham Nirma Devi, "Literature Review on Sentiment Analysis in Tourism," Test Engineering and Management, vol. 83, pp. 2466-2474, 2020. <br>[5] Renganathan, "Text mining in biomedical domain with emphasis on document clustering," Healthcare Informatics Research, vol. 3, no. 23, pp. 141-146, 2017. <br>[6] Q. C. C. S. E. S. P. Jiang, "Sentiment analysis of online destination image," Current Issues in Tourism, vol. 4, no. 26, pp. 1-22, 2021. <br>[7] A. M. a. I. M. Abubakar, "Impact of online WOM on destination rust and intention to travel: a medical tourism perspective," vol. 5, pp. 192-201, 2016. <br>[8] Shiyang Liao, Junbo Wang, Ruiyun Yu, Koichi Sato, "CNN for situations understanding based on sentiment analysis of twitter data," ResearchGate, vol. 4, pp. 376-381, 2017. <br>[9] C. S. M. B, "An Approach of Sentiment Analysis for Movie Reviews," International Conference on Communication, Computing and Internet of Thing, 2022. <br>[10] X. L. F. D. X. L. M. W. Xian Fan, "Apply Word Vectors for Sentiment Analysis of APP Reviews," The 2016 3rd International Conference on Systems and Informatics (ICSAI 2016), 2016. <br>[11] A. U. Vinaitheerthan Renganathan, "Dubai Restaurants: A Sentiment Analysis," vol. 14, no. 2, 2021. <br>[12] E. S. P. W. Afina Ramadhani, "LSTM-based Deep Learning Architecture of Tourist Review in Tripadvisor," Sixth International Conference on Informatics and Computing (ICIC), 2021. <br>[13] Ali Aggaa, Ahmed Abbou, Moussa Labbadib, Yassine El HoumaImane, HammouOu Alia, "CNN-LSTM: An efficient hybrid deep learning architecture for predicting short-term photovoltaic power production," Electric Power Systems Research, vol. 208, 2022. <br>[14] T. Huang, "Research on Sentiment Classification of Tourist," IEEE 3rd Eurasia Conference on IOT Communication and Engineering (ECICE), 2021. <br>[15] Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J. Santamaría, Mohammed A. Fadhel, Muthana Al-Amidie & Laith Farhan, "Review of deep learning: concepts, CNN architectures, challenges, applications, future directions," Journal of Big Data, vol. 53, 2021. <br>[16] M. M. Ily Amalina Ahmad Sabri, "A deep web data extraction model for web mining: a review," Indonesian Journal of Electrical Engineering and Computer Science, vol. 23, pp. 519-528, 2021. <br>[17] Saram Han and Christopher K. Anderson, "Web Scraping for Hospitality Research: Overview Opportunities, and Implications," Cornell Hospitality Quarterly, 2021. <br>[18] A. Rao, "Convolutional Neural Network Tutorial (CNN) – Developing an Image Classifier in Python Using TensorFlow," Edureka, 15 09 2022. (Online). Available: https://www.edureka.co/blog/convolutional-neural-network/. (Accessed 11 2022). <br>[19] Z. Cai, J. Liu, L. Xu, C. Yin, J. Wang, "A Vision Recognition Based Method for Web Data Extraction," Computer Science, 2017. <br>[20] R. Mitchell, "Web Scraping with Python," O'Reilly Media, 2015. <br>[21] V. Draxl, "Web Scraping Data Extraction from websites," no. 2022, 2018. <br>[22] A. OT, "Web Scraping vs. API: What's the Best Way to Extract Data?," 2021. (Online). Available: https://www.makeuseof.com/web-scraping-vs-api/. (Accessed 03 09 2022). <br>[23] C. P. Colombage, "Comparing Deep Learning Architecture for Sentiment Assessment for Online Consumer Reviews," York St. John University – London Campus, Department of Computer Science, London, 2021. <br>[24] A. Sharma, "A guide to web scraping in Python using Beautiful Soup," 2021. (Online). Available: https://opensource.com/article/21/9/web-scraping-python-beautiful-soup. (Accessed 09 2022). <br>[25] A. R. V. R. C. A. R. D. A. K. M. a. S. K. Shalini K, "Sentiment Analysis of Indian Languages using Convolutional Neural Networks," International Conference on Computer Communication and Informatics (ICCCI -2018), no. 2022, 2018. <br>[26] Renganathan, "Text mining in biomedical domain with emphasis on document clustering," Healthcare Informatics Research, vol. 3, no. 2022, pp. 141-146, 2017. <br>[27] Chang, Chia-Hui and Shao-Chen Lui. “IEPAD: information extraction based on pattern discovery.” The Web Conference (2001). </div> </div> </div> </main> <footer> <div id="infolinks" class="pt-3 pb-2"> <div class="container"> <div style="background-color:#f5f5f5;" class="p-3"> <div class="row"> <div class="col-md-2"> <ul class="list-unstyled"> About <li><a href="https://waset.org/page/support">About Us</a></li> <li><a href="https://waset.org/page/support#legal-information">Legal</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/WASET-16th-foundational-anniversary.pdf">WASET celebrates its 16th foundational anniversary</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Account <li><a href="https://waset.org/profile">My Account</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Explore <li><a href="https://waset.org/disciplines">Disciplines</a></li> <li><a href="https://waset.org/conferences">Conferences</a></li> <li><a href="https://waset.org/conference-programs">Conference Program</a></li> <li><a href="https://waset.org/committees">Committees</a></li> <li><a href="https://publications.waset.org">Publications</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Research <li><a href="https://publications.waset.org/abstracts">Abstracts</a></li> <li><a href="https://publications.waset.org">Periodicals</a></li> <li><a href="https://publications.waset.org/archive">Archive</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Open Science <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Science-Philosophy.pdf">Open Science Philosophy</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Science-Award.pdf">Open Science Award</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Society-Open-Science-and-Open-Innovation.pdf">Open Innovation</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Postdoctoral-Fellowship-Award.pdf">Postdoctoral Fellowship Award</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Scholarly-Research-Review.pdf">Scholarly Research Review</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Support <li><a href="https://waset.org/page/support">Support</a></li> <li><a href="https://waset.org/profile/messages/create">Contact Us</a></li> <li><a href="https://waset.org/profile/messages/create">Report Abuse</a></li> </ul> </div> </div> </div> </div> </div> <div class="container text-center"> <hr style="margin-top:0;margin-bottom:.3rem;"> <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" class="text-muted small">Creative Commons Attribution 4.0 International License</a> <div id="copy" class="mt-2">© 2024 World Academy of Science, Engineering and Technology</div> </div> </footer> <a href="javascript:" id="return-to-top"><i class="fas fa-arrow-up"></i></a> <div class="modal" id="modal-template"> <div class="modal-dialog"> <div class="modal-content"> <div class="row m-0 mt-1"> <div class="col-md-12"> <button type="button" class="close" data-dismiss="modal" aria-label="Close"><span aria-hidden="true">×</span></button> </div> </div> <div class="modal-body"></div> </div> </div> </div> <script src="https://cdn.waset.org/static/plugins/jquery-3.3.1.min.js"></script> <script src="https://cdn.waset.org/static/plugins/bootstrap-4.2.1/js/bootstrap.bundle.min.js"></script> <script src="https://cdn.waset.org/static/js/site.js?v=150220211556"></script> <script> jQuery(document).ready(function() { /*jQuery.get("https://publications.waset.org/xhr/user-menu", function (response) { jQuery('#mainNavMenu').append(response); });*/ jQuery.get({ url: "https://publications.waset.org/xhr/user-menu", cache: false }).then(function(response){ jQuery('#mainNavMenu').append(response); }); }); </script> </body> </html>