CINXE.COM
Search results for: document processing
<!DOCTYPE html> <html lang="en" dir="ltr"> <head> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-P63WKM1TM1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-P63WKM1TM1'); </script> <!-- Yandex.Metrika counter --> <script type="text/javascript" > (function(m,e,t,r,i,k,a){m[i]=m[i]||function(){(m[i].a=m[i].a||[]).push(arguments)}; m[i].l=1*new Date(); for (var j = 0; j < document.scripts.length; j++) {if (document.scripts[j].src === r) { return; }} k=e.createElement(t),a=e.getElementsByTagName(t)[0],k.async=1,k.src=r,a.parentNode.insertBefore(k,a)}) (window, document, "script", "https://mc.yandex.ru/metrika/tag.js", "ym"); ym(55165297, "init", { clickmap:false, trackLinks:true, accurateTrackBounce:true, webvisor:false }); </script> <noscript><div><img src="https://mc.yandex.ru/watch/55165297" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter --> <!-- Matomo --> <!-- End Matomo Code --> <title>Search results for: document processing</title> <meta name="description" content="Search results for: document processing"> <meta name="keywords" content="document processing"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no"> <meta charset="utf-8"> <link href="https://cdn.waset.org/favicon.ico" type="image/x-icon" rel="shortcut icon"> <link href="https://cdn.waset.org/static/plugins/bootstrap-4.2.1/css/bootstrap.min.css" rel="stylesheet"> <link href="https://cdn.waset.org/static/plugins/fontawesome/css/all.min.css" rel="stylesheet"> <link href="https://cdn.waset.org/static/css/site.css?v=150220211555" rel="stylesheet"> </head> <body> <header> <div class="container"> <nav class="navbar navbar-expand-lg navbar-light"> <a class="navbar-brand" href="https://waset.org"> <img src="https://cdn.waset.org/static/images/wasetc.png" alt="Open Science Research Excellence" title="Open Science Research Excellence" /> </a> <button class="d-block d-lg-none navbar-toggler ml-auto" type="button" data-toggle="collapse" data-target="#navbarMenu" aria-controls="navbarMenu" aria-expanded="false" aria-label="Toggle navigation"> <span class="navbar-toggler-icon"></span> </button> <div class="w-100"> <div class="d-none d-lg-flex flex-row-reverse"> <form method="get" action="https://waset.org/search" class="form-inline my-2 my-lg-0"> <input class="form-control mr-sm-2" type="search" placeholder="Search Conferences" value="document processing" name="q" aria-label="Search"> <button class="btn btn-light my-2 my-sm-0" type="submit"><i class="fas fa-search"></i></button> </form> </div> <div class="collapse navbar-collapse mt-1" id="navbarMenu"> <ul class="navbar-nav ml-auto align-items-center" id="mainNavMenu"> <li class="nav-item"> <a class="nav-link" href="https://waset.org/conferences" title="Conferences in 2024/2025/2026">Conferences</a> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/disciplines" title="Disciplines">Disciplines</a> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/committees" rel="nofollow">Committees</a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="navbarDropdownPublications" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false"> Publications </a> <div class="dropdown-menu" aria-labelledby="navbarDropdownPublications"> <a class="dropdown-item" href="https://publications.waset.org/abstracts">Abstracts</a> <a class="dropdown-item" href="https://publications.waset.org">Periodicals</a> <a class="dropdown-item" href="https://publications.waset.org/archive">Archive</a> </div> </li> <li class="nav-item"> <a class="nav-link" href="https://waset.org/page/support" title="Support">Support</a> </li> </ul> </div> </div> </nav> </div> </header> <main> <div class="container mt-4"> <div class="row"> <div class="col-md-9 mx-auto"> <form method="get" action="https://publications.waset.org/abstracts/search"> <div id="custom-search-input"> <div class="input-group"> <i class="fas fa-search"></i> <input type="text" class="search-query" name="q" placeholder="Author, Title, Abstract, Keywords" value="document processing"> <input type="submit" class="btn_search" value="Search"> </div> </div> </form> </div> </div> <div class="row mt-3"> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Commenced</strong> in January 2007</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Frequency:</strong> Monthly</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Edition:</strong> International</div> </div> </div> <div class="col-sm-3"> <div class="card"> <div class="card-body"><strong>Paper Count:</strong> 4389</div> </div> </div> </div> <h1 class="mt-3 mb-3 text-center" style="font-size:1.6rem;">Search results for: document processing</h1> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4389</span> Incremental Learning of Independent Topic Analysis</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Takahiro%20Nishigaki">Takahiro Nishigaki</a>, <a href="https://publications.waset.org/abstracts/search?q=Katsumi%20Nitta"> Katsumi Nitta</a>, <a href="https://publications.waset.org/abstracts/search?q=Takashi%20Onoda"> Takashi Onoda</a> </p> <p class="card-text"><strong>Abstract:</strong></p> In this paper, we present a method of applying Independent Topic Analysis (ITA) to increasing the number of document data. The number of document data has been increasing since the spread of the Internet. ITA was presented as one method to analyze the document data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis (ICA). ICA is a technique in the signal processing; however, it is difficult to apply the ITA to increasing number of document data. Because ITA must use the all document data so temporal and spatial cost is very high. Therefore, we present Incremental ITA which extracts the independent topics from increasing number of document data. Incremental ITA is a method of updating the independent topics when the document data is added after extracted the independent topics from a just previous the data. In addition, Incremental ITA updates the independent topics when the document data is added. And we show the result applied Incremental ITA to benchmark datasets. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=text%20mining" title="text mining">text mining</a>, <a href="https://publications.waset.org/abstracts/search?q=topic%20extraction" title=" topic extraction"> topic extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=independent" title=" independent"> independent</a>, <a href="https://publications.waset.org/abstracts/search?q=incremental" title=" incremental"> incremental</a>, <a href="https://publications.waset.org/abstracts/search?q=independent%20component%20analysis" title=" independent component analysis"> independent component analysis</a> </p> <a href="https://publications.waset.org/abstracts/58971/incremental-learning-of-independent-topic-analysis" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/58971.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">309</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4388</span> Model-Based Field Extraction from Different Class of Administrative Documents</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Jinen%20Daghrir">Jinen Daghrir</a>, <a href="https://publications.waset.org/abstracts/search?q=Anis%20Kricha"> Anis Kricha</a>, <a href="https://publications.waset.org/abstracts/search?q=Karim%20Kalti"> Karim Kalti</a> </p> <p class="card-text"><strong>Abstract:</strong></p> The amount of incoming administrative documents is massive and manually processing these documents is a costly task especially on the timescale. In fact, this problem has led an important amount of research and development in the context of automatically extracting fields from administrative documents, in order to reduce the charges and to increase the citizen satisfaction in administrations. In this matter, we introduce an administrative document understanding system. Given a document in which a user has to select fields that have to be retrieved from a document class, a document model is automatically built. A document model is represented by an attributed relational graph (ARG) where nodes represent fields to extract, and edges represent the relation between them. Both of vertices and edges are attached with some feature vectors. When another document arrives to the system, the layout objects are extracted and an ARG is generated. The fields extraction is translated into a problem of matching two ARGs which relies mainly on the comparison of the spatial relationships between layout objects. Experimental results yield accuracy rates from 75% to 100% tested on eight document classes. Our proposed method has a good performance knowing that the document model is constructed using only one single document. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=administrative%20document%20understanding" title="administrative document understanding">administrative document understanding</a>, <a href="https://publications.waset.org/abstracts/search?q=logical%20labelling" title=" logical labelling"> logical labelling</a>, <a href="https://publications.waset.org/abstracts/search?q=logical%20layout%20analysis" title=" logical layout analysis"> logical layout analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=fields%20extraction%20from%20administrative%20documents" title=" fields extraction from administrative documents"> fields extraction from administrative documents</a> </p> <a href="https://publications.waset.org/abstracts/89264/model-based-field-extraction-from-different-class-of-administrative-documents" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/89264.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">213</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4387</span> Investigation of Topic Modeling-Based Semi-Supervised Interpretable Document Classifier</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Dasom%20Kim">Dasom Kim</a>, <a href="https://publications.waset.org/abstracts/search?q=William%20Xiu%20Shun%20Wong"> William Xiu Shun Wong</a>, <a href="https://publications.waset.org/abstracts/search?q=Yoonjin%20Hyun"> Yoonjin Hyun</a>, <a href="https://publications.waset.org/abstracts/search?q=Donghoon%20Lee"> Donghoon Lee</a>, <a href="https://publications.waset.org/abstracts/search?q=Minji%20Paek"> Minji Paek</a>, <a href="https://publications.waset.org/abstracts/search?q=Sungho%20Byun"> Sungho Byun</a>, <a href="https://publications.waset.org/abstracts/search?q=Namgyu%20Kim"> Namgyu Kim</a> </p> <p class="card-text"><strong>Abstract:</strong></p> There have been many researches on document classification for classifying voluminous documents automatically. Through document classification, we can assign a specific category to each unlabeled document on the basis of various machine learning algorithms. However, providing labeled documents manually requires considerable time and effort. To overcome the limitations, the semi-supervised learning which uses unlabeled document as well as labeled documents has been invented. However, traditional document classifiers, regardless of supervised or semi-supervised ones, cannot sufficiently explain the reason or the process of the classification. Thus, in this paper, we proposed a methodology to visualize major topics and class components of each document. We believe that our methodology for visualizing topics and classes of each document can enhance the reliability and explanatory power of document classifiers. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=data%20mining" title="data mining">data mining</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20classifier" title=" document classifier"> document classifier</a>, <a href="https://publications.waset.org/abstracts/search?q=text%20mining" title=" text mining"> text mining</a>, <a href="https://publications.waset.org/abstracts/search?q=topic%20modeling" title=" topic modeling"> topic modeling</a> </p> <a href="https://publications.waset.org/abstracts/48985/investigation-of-topic-modeling-based-semi-supervised-interpretable-document-classifier" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/48985.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">402</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4386</span> DocPro: A Framework for Processing Semantic and Layout Information in Business Documents</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Ming-Jen%20Huang">Ming-Jen Huang</a>, <a href="https://publications.waset.org/abstracts/search?q=Chun-Fang%20Huang"> Chun-Fang Huang</a>, <a href="https://publications.waset.org/abstracts/search?q=Chiching%20Wei"> Chiching Wei</a> </p> <p class="card-text"><strong>Abstract:</strong></p> With the recent advance of the deep neural network, we observe new applications of NLP (natural language processing) and CV (computer vision) powered by deep neural networks for processing business documents. However, creating a real-world document processing system needs to integrate several NLP and CV tasks, rather than treating them separately. There is a need to have a unified approach for processing documents containing textual and graphical elements with rich formats, diverse layout arrangement, and distinct semantics. In this paper, a framework that fulfills this unified approach is presented. The framework includes a representation model definition for holding the information generated by various tasks and specifications defining the coordination between these tasks. The framework is a blueprint for building a system that can process documents with rich formats, styles, and multiple types of elements. The flexible and lightweight design of the framework can help build a system for diverse business scenarios, such as contract monitoring and reviewing. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=document%20processing" title="document processing">document processing</a>, <a href="https://publications.waset.org/abstracts/search?q=framework" title=" framework"> framework</a>, <a href="https://publications.waset.org/abstracts/search?q=formal%20definition" title=" formal definition"> formal definition</a>, <a href="https://publications.waset.org/abstracts/search?q=machine%20learning" title=" machine learning"> machine learning</a> </p> <a href="https://publications.waset.org/abstracts/126703/docpro-a-framework-for-processing-semantic-and-layout-information-in-business-documents" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/126703.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">214</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4385</span> A Similarity Measure for Classification and Clustering in Image Based Medical and Text Based Banking Applications</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=K.%20P.%20Sandesh">K. P. Sandesh</a>, <a href="https://publications.waset.org/abstracts/search?q=M.%20H.%20Suman"> M. H. Suman</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Text processing plays an important role in information retrieval, data-mining, and web search. Measuring the similarity between the documents is an important operation in the text processing field. In this project, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature the proposed measure takes the following three cases into account: (1) The feature appears in both documents; (2) The feature appears in only one document and; (3) The feature appears in none of the documents. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, especially in banking and health sectors. The results show that the performance obtained by the proposed measure is better than that achieved by the other measures. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=document%20classification" title="document classification">document classification</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20clustering" title=" document clustering"> document clustering</a>, <a href="https://publications.waset.org/abstracts/search?q=entropy" title=" entropy"> entropy</a>, <a href="https://publications.waset.org/abstracts/search?q=accuracy" title=" accuracy"> accuracy</a>, <a href="https://publications.waset.org/abstracts/search?q=classifiers" title=" classifiers"> classifiers</a>, <a href="https://publications.waset.org/abstracts/search?q=clustering%20algorithms" title=" clustering algorithms"> clustering algorithms</a> </p> <a href="https://publications.waset.org/abstracts/22708/a-similarity-measure-for-classification-and-clustering-in-image-based-medical-and-text-based-banking-applications" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/22708.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">518</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4384</span> Visual Template Detection and Compositional Automatic Regular Expression Generation for Business Invoice Extraction</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Anthony%20Proschka">Anthony Proschka</a>, <a href="https://publications.waset.org/abstracts/search?q=Deepak%20Mishra"> Deepak Mishra</a>, <a href="https://publications.waset.org/abstracts/search?q=Merlyn%20Ramanan"> Merlyn Ramanan</a>, <a href="https://publications.waset.org/abstracts/search?q=Zurab%20Baratashvili"> Zurab Baratashvili</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Small and medium-sized businesses receive over 160 billion invoices every year. Since these documents exhibit many subtle differences in layout and text, extracting structured fields such as sender name, amount, and VAT rate from them automatically is an open research question. In this paper, existing work in template-based document extraction is extended, and a system is devised that is able to reliably extract all required fields for up to 70% of all documents in the data set, more than any other previously reported method. The approaches are described for 1) detecting through visual features which template a given document belongs to, 2) automatically generating extraction rules for a given new template by composing regular expressions from multiple components, and 3) computing confidence scores that indicate the accuracy of the automatic extractions. The system can generate templates with as little as one training sample and only requires the ground truth field values instead of detailed annotations such as bounding boxes that are hard to obtain. The system is deployed and used inside a commercial accounting software. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=data%20mining" title="data mining">data mining</a>, <a href="https://publications.waset.org/abstracts/search?q=information%20retrieval" title=" information retrieval"> information retrieval</a>, <a href="https://publications.waset.org/abstracts/search?q=business" title=" business"> business</a>, <a href="https://publications.waset.org/abstracts/search?q=feature%20extraction" title=" feature extraction"> feature extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=layout" title=" layout"> layout</a>, <a href="https://publications.waset.org/abstracts/search?q=business%20data%20processing" title=" business data processing"> business data processing</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20handling" title=" document handling"> document handling</a>, <a href="https://publications.waset.org/abstracts/search?q=end-user%20trained%20information%20extraction" title=" end-user trained information extraction"> end-user trained information extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20archiving" title=" document archiving"> document archiving</a>, <a href="https://publications.waset.org/abstracts/search?q=scanned%20business%20documents" title=" scanned business documents"> scanned business documents</a>, <a href="https://publications.waset.org/abstracts/search?q=automated%20document%20processing" title=" automated document processing"> automated document processing</a>, <a href="https://publications.waset.org/abstracts/search?q=F1-measure" title=" F1-measure"> F1-measure</a>, <a href="https://publications.waset.org/abstracts/search?q=commercial%20accounting%20software" title=" commercial accounting software"> commercial accounting software</a> </p> <a href="https://publications.waset.org/abstracts/128370/visual-template-detection-and-compositional-automatic-regular-expression-generation-for-business-invoice-extraction" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/128370.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">130</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4383</span> Distorted Document Images Dataset for Text Detection and Recognition</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Ilia%20Zharikov">Ilia Zharikov</a>, <a href="https://publications.waset.org/abstracts/search?q=Philipp%20Nikitin"> Philipp Nikitin</a>, <a href="https://publications.waset.org/abstracts/search?q=Ilia%20Vasiliev"> Ilia Vasiliev</a>, <a href="https://publications.waset.org/abstracts/search?q=Vladimir%20Dokholyan"> Vladimir Dokholyan</a> </p> <p class="card-text"><strong>Abstract:</strong></p> With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=document%20analysis" title="document analysis">document analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=open%20dataset" title=" open dataset"> open dataset</a>, <a href="https://publications.waset.org/abstracts/search?q=optical%20character%20recognition" title=" optical character recognition"> optical character recognition</a>, <a href="https://publications.waset.org/abstracts/search?q=text%20detection" title=" text detection"> text detection</a> </p> <a href="https://publications.waset.org/abstracts/106148/distorted-document-images-dataset-for-text-detection-and-recognition" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/106148.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">172</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4382</span> Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=L.%20Hamsaveni"> L. Hamsaveni</a>, <a href="https://publications.waset.org/abstracts/search?q=Navya%20Prakash"> Navya Prakash</a>, <a href="https://publications.waset.org/abstracts/search?q=Suresha"> Suresha</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document images to obtain an original document with complete information. In case, degraded document image captured is skewed, it has to be straightened (deskew) to perform further process. A special format of image storing known as YCbCr is used as a tool to convert the Grayscale image to RGB image format. The presented algorithm is tested on various types of degraded documents such as printed documents, handwritten documents, old script documents and handwritten image sketches in documents. The purpose of this research is to obtain an original document for a given set of degraded documents of the same source. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=grayscale%20image%20format" title="grayscale image format">grayscale image format</a>, <a href="https://publications.waset.org/abstracts/search?q=image%20fusing" title=" image fusing"> image fusing</a>, <a href="https://publications.waset.org/abstracts/search?q=RGB%20image%20format" title=" RGB image format"> RGB image format</a>, <a href="https://publications.waset.org/abstracts/search?q=SURF%20detection" title=" SURF detection"> SURF detection</a>, <a href="https://publications.waset.org/abstracts/search?q=YCbCr%20image%20format" title=" YCbCr image format"> YCbCr image format</a> </p> <a href="https://publications.waset.org/abstracts/64187/degraded-document-analysis-and-extraction-of-original-text-document-an-approach-without-optical-character-recognition" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/64187.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">377</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4381</span> Binarization and Recognition of Characters from Historical Degraded Documents</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Bency%20Jacob">Bency Jacob</a>, <a href="https://publications.waset.org/abstracts/search?q=S.B.%20Waykar"> S.B. Waykar</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Degradations in historical document images appear due to aging of the documents. It is very difficult to understand and retrieve text from badly degraded documents as there is variation between the document foreground and background. Thresholding of such document images either result in broken characters or detection of false texts. Numerous algorithms exist that can separate text and background efficiently in the textual regions of the document; but portions of background are mistaken as text in areas that hardly contain any text. This paper presents a way to overcome these problems by a robust binarization technique that recovers the text from a severely degraded document images and thereby increases the accuracy of optical character recognition systems. The proposed document recovery algorithm efficiently removes degradations from document images. Here we are using the ostus method ,local thresholding and global thresholding and after the binarization training and recognizing the characters in the degraded documents. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=binarization" title="binarization">binarization</a>, <a href="https://publications.waset.org/abstracts/search?q=denoising" title=" denoising"> denoising</a>, <a href="https://publications.waset.org/abstracts/search?q=global%20thresholding" title=" global thresholding"> global thresholding</a>, <a href="https://publications.waset.org/abstracts/search?q=local%20thresholding" title=" local thresholding"> local thresholding</a>, <a href="https://publications.waset.org/abstracts/search?q=thresholding" title=" thresholding"> thresholding</a> </p> <a href="https://publications.waset.org/abstracts/33322/binarization-and-recognition-of-characters-from-historical-degraded-documents" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/33322.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">344</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4380</span> Adaptation of Hough Transform Algorithm for Text Document Skew Angle Detection</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Kayode%20A.%20Olaniyi">Kayode A. Olaniyi</a>, <a href="https://publications.waset.org/abstracts/search?q=Olabanji%20F.%20Omotoye"> Olabanji F. Omotoye</a>, <a href="https://publications.waset.org/abstracts/search?q=Adeola%20A.%20Ogunleye"> Adeola A. Ogunleye</a> </p> <p class="card-text"><strong>Abstract:</strong></p> The skew detection and correction form an important part of digital document analysis. This is because uncompensated skew can deteriorate document features and can complicate further document image processing steps. Efficient text document analysis and digitization can rarely be achieved when a document is skewed even at a small angle. Once the documents have been digitized through the scanning system and binarization also achieved, document skew correction is required before further image analysis. Research efforts have been put in this area with algorithms developed to eliminate document skew. Skew angle correction algorithms can be compared based on performance criteria. Most important performance criteria are accuracy of skew angle detection, range of skew angle for detection, speed of processing the image, computational complexity and consequently memory space used. The standard Hough Transform has successfully been implemented for text documentation skew angle estimation application. However, the standard Hough Transform algorithm level of accuracy depends largely on how much fine the step size for the angle used. This consequently consumes more time and memory space for increase accuracy and, especially where number of pixels is considerable large. Whenever the Hough transform is used, there is always a tradeoff between accuracy and speed. So a more efficient solution is needed that optimizes space as well as time. In this paper, an improved Hough transform (HT) technique that optimizes space as well as time to robustly detect document skew is presented. The modified algorithm of Hough Transform presents solution to the contradiction between the memory space, running time and accuracy. Our algorithm starts with the first step of angle estimation accurate up to zero decimal place using the standard Hough Transform algorithm achieving minimal running time and space but lacks relative accuracy. Then to increase accuracy, suppose estimated angle found using the basic Hough algorithm is x degree, we then run again basic algorithm from range between ±x degrees with accuracy of one decimal place. Same process is iterated till level of desired accuracy is achieved. The procedure of our skew estimation and correction algorithm of text images is implemented using MATLAB. The memory space estimation and process time are also tabulated with skew angle assumption of within 00 and 450. The simulation results which is demonstrated in Matlab show the high performance of our algorithms with less computational time and memory space used in detecting document skew for a variety of documents with different levels of complexity. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=hough-transform" title="hough-transform">hough-transform</a>, <a href="https://publications.waset.org/abstracts/search?q=skew-detection" title=" skew-detection"> skew-detection</a>, <a href="https://publications.waset.org/abstracts/search?q=skew-angle" title=" skew-angle"> skew-angle</a>, <a href="https://publications.waset.org/abstracts/search?q=skew-correction" title=" skew-correction"> skew-correction</a>, <a href="https://publications.waset.org/abstracts/search?q=text-document" title=" text-document"> text-document</a> </p> <a href="https://publications.waset.org/abstracts/103263/adaptation-of-hough-transform-algorithm-for-text-document-skew-angle-detection" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/103263.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">158</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4379</span> Towards Law Data Labelling Using Topic Modelling</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Daniel%20Pinheiro%20Da%20Silva%20Junior">Daniel Pinheiro Da Silva Junior</a>, <a href="https://publications.waset.org/abstracts/search?q=Aline%20Paes"> Aline Paes</a>, <a href="https://publications.waset.org/abstracts/search?q=Daniel%20De%20Oliveira"> Daniel De Oliveira</a>, <a href="https://publications.waset.org/abstracts/search?q=Christiano%20Lacerda%20Ghuerren"> Christiano Lacerda Ghuerren</a>, <a href="https://publications.waset.org/abstracts/search?q=Marcio%20Duran"> Marcio Duran</a> </p> <p class="card-text"><strong>Abstract:</strong></p> The Courts of Accounts are institutions responsible for overseeing and point out irregularities of Public Administration expenses. They have a high demand for processes to be analyzed, whose decisions must be grounded on severity laws. Despite the existing large amount of processes, there are several cases reporting similar subjects. Thus, previous decisions on already analyzed processes can be a precedent for current processes that refer to similar topics. Identifying similar topics is an open, yet essential task for identifying similarities between several processes. Since the actual amount of topics is considerably large, it is tedious and error-prone to identify topics using a pure manual approach. This paper presents a tool based on Machine Learning and Natural Language Processing to assists in building a labeled dataset. The tool relies on Topic Modelling with Latent Dirichlet Allocation to find the topics underlying a document followed by Jensen Shannon distance metric to generate a probability of similarity between documents pairs. Furthermore, in a case study with a corpus of decisions of the Rio de Janeiro State Court of Accounts, it was noted that data pre-processing plays an essential role in modeling relevant topics. Also, the combination of topic modeling and a calculated distance metric over document represented among generated topics has been proved useful in helping to construct a labeled base of similar and non-similar document pairs. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=courts%20of%20accounts" title="courts of accounts">courts of accounts</a>, <a href="https://publications.waset.org/abstracts/search?q=data%20labelling" title=" data labelling"> data labelling</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20similarity" title=" document similarity"> document similarity</a>, <a href="https://publications.waset.org/abstracts/search?q=topic%20modeling" title=" topic modeling"> topic modeling</a> </p> <a href="https://publications.waset.org/abstracts/121281/towards-law-data-labelling-using-topic-modelling" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/121281.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">179</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4378</span> Adaptation of Projection Profile Algorithm for Skewed Handwritten Text Line Detection</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Kayode%20A.%20Olaniyi">Kayode A. Olaniyi</a>, <a href="https://publications.waset.org/abstracts/search?q=Tola.%20M.%20Osifeko"> Tola. M. Osifeko</a>, <a href="https://publications.waset.org/abstracts/search?q=Adeola%20A.%20Ogunleye"> Adeola A. Ogunleye</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Text line segmentation is an important step in document image processing. It represents a labeling process that assigns the same label using distance metric probability to spatially aligned units. Text line detection techniques have successfully been implemented mainly in printed documents. However, processing of the handwritten texts especially unconstrained documents has remained a key problem. This is because the unconstrained hand-written text lines are often not uniformly skewed. The spaces between text lines may not be obvious, complicated by the nature of handwriting and, overlapping ascenders and/or descenders of some characters. Hence, text lines detection and segmentation represents a leading challenge in handwritten document image processing. Text line detection methods that rely on the traditional global projection profile of the text document cannot efficiently confront with the problem of variable skew angles between different text lines. Hence, the formulation of a horizontal line as a separator is often not efficient. This paper presents a technique to segment a handwritten document into distinct lines of text. The proposed algorithm starts, by partitioning the initial text image into columns, across its width into chunks of about 5% each. At each vertical strip of 5%, the histogram of horizontal runs is projected. We have worked with the assumption that text appearing in a single strip is almost parallel to each other. The algorithm developed provides a sliding window through the first vertical strip on the left side of the page. It runs through to identify the new minimum corresponding to a valley in the projection profile. Each valley would represent the starting point of the orientation line and the ending point is the minimum point on the projection profile of the next vertical strip. The derived text-lines traverse around any obstructing handwritten vertical strips of connected component by associating it to either the line above or below. A decision of associating such connected component is made by the probability obtained from a distance metric decision. The technique outperforms the global projection profile for text line segmentation and it is robust to handle skewed documents and those with lines running into each other. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=connected-component" title="connected-component">connected-component</a>, <a href="https://publications.waset.org/abstracts/search?q=projection-profile" title=" projection-profile"> projection-profile</a>, <a href="https://publications.waset.org/abstracts/search?q=segmentation" title=" segmentation"> segmentation</a>, <a href="https://publications.waset.org/abstracts/search?q=text-line" title=" text-line"> text-line</a> </p> <a href="https://publications.waset.org/abstracts/102464/adaptation-of-projection-profile-algorithm-for-skewed-handwritten-text-line-detection" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/102464.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">124</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4377</span> A Proposed Framework for Software Redocumentation Using Distributed Data Processing Techniques and Ontology</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Laila%20Khaled%20Almawaldi">Laila Khaled Almawaldi</a>, <a href="https://publications.waset.org/abstracts/search?q=Hiew%20Khai%20Hang"> Hiew Khai Hang</a>, <a href="https://publications.waset.org/abstracts/search?q=Sugumaran%20A.%20l.%20Nallusamy"> Sugumaran A. l. Nallusamy</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Legacy systems are crucial for organizations, but their intricacy and lack of documentation pose challenges for maintenance and enhancement. Redocumentation of legacy systems is vital for automatically or semi-automatically creating documentation for software lacking sufficient records. It aims to enhance system understandability, maintainability, and knowledge transfer. However, existing redocumentation methods need improvement in data processing performance and document generation efficiency. This stems from the necessity to efficiently handle the extensive and complex code of legacy systems. This paper proposes a method for semi-automatic legacy system re-documentation using semantic parallel processing and ontology. Leveraging parallel processing and ontology addresses current challenges by distributing the workload and creating documentation with logically interconnected data. The paper outlines challenges in legacy system redocumentation and suggests a method of redocumentation using parallel processing and ontology for improved efficiency and effectiveness. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=legacy%20systems" title="legacy systems">legacy systems</a>, <a href="https://publications.waset.org/abstracts/search?q=redocumentation" title=" redocumentation"> redocumentation</a>, <a href="https://publications.waset.org/abstracts/search?q=big%20data%20analysis" title=" big data analysis"> big data analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=parallel%20processing" title=" parallel processing"> parallel processing</a> </p> <a href="https://publications.waset.org/abstracts/185855/a-proposed-framework-for-software-redocumentation-using-distributed-data-processing-techniques-and-ontology" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/185855.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">45</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4376</span> Using Closed Frequent Itemsets for Hierarchical Document Clustering</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Cheng-Jhe%20Lee">Cheng-Jhe Lee</a>, <a href="https://publications.waset.org/abstracts/search?q=Chiun-Chieh%20Hsu"> Chiun-Chieh Hsu</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Due to the rapid development of the Internet and the increased availability of digital documents, the excessive information on the Internet has led to information overflow problem. In order to solve these problems for effective information retrieval, document clustering in text mining becomes a popular research topic. Clustering is the unsupervised classification of data items into groups without the need of training data. Many conventional document clustering methods perform inefficiently for large document collections because they were originally designed for relational database. Therefore they are impractical in real-world document clustering and require special handling for high dimensionality and high volume. We propose the FIHC (Frequent Itemset-based Hierarchical Clustering) method, which is a hierarchical clustering method developed for document clustering, where the intuition of FIHC is that there exist some common words for each cluster. FIHC uses such words to cluster documents and builds hierarchical topic tree. In this paper, we combine FIHC algorithm with ontology to solve the semantic problem and mine the meaning behind the words in documents. Furthermore, we use the closed frequent itemsets instead of only use frequent itemsets, which increases efficiency and scalability. The experimental results show that our method is more accurate than those of well-known document clustering algorithms. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=FIHC" title="FIHC">FIHC</a>, <a href="https://publications.waset.org/abstracts/search?q=documents%20clustering" title=" documents clustering"> documents clustering</a>, <a href="https://publications.waset.org/abstracts/search?q=ontology" title=" ontology"> ontology</a>, <a href="https://publications.waset.org/abstracts/search?q=closed%20frequent%20itemset" title=" closed frequent itemset"> closed frequent itemset</a> </p> <a href="https://publications.waset.org/abstracts/41381/using-closed-frequent-itemsets-for-hierarchical-document-clustering" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/41381.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">399</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4375</span> Application of Signature Verification Models for Document Recognition </h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Boris%20M.%20Fedorov">Boris M. Fedorov</a>, <a href="https://publications.waset.org/abstracts/search?q=Liudmila%20P.%20Goncharenko"> Liudmila P. Goncharenko</a>, <a href="https://publications.waset.org/abstracts/search?q=Sergey%20A.%20Sybachin"> Sergey A. Sybachin</a>, <a href="https://publications.waset.org/abstracts/search?q=Natalia%20A.%20Mamedova"> Natalia A. Mamedova</a>, <a href="https://publications.waset.org/abstracts/search?q=Ekaterina%20V.%20Makarenkova"> Ekaterina V. Makarenkova</a>, <a href="https://publications.waset.org/abstracts/search?q=Saule%20Rakhimova"> Saule Rakhimova</a> </p> <p class="card-text"><strong>Abstract:</strong></p> In modern economic conditions, the question of the possibility of correct recognition of a signature on digital documents in order to verify the expression of will or confirm a certain operation is relevant. The additional complexity of processing lies in the dynamic variability of the signature for each individual, as well as in the way information is processed because the signature refers to biometric data. The article discusses the issues of using artificial intelligence models in order to improve the quality of signature confirmation in document recognition. The analysis of several possible options for using the model is carried out. The results of the study are given, in which it is possible to correctly determine the authenticity of the signature on small samples. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=signature%20recognition" title="signature recognition">signature recognition</a>, <a href="https://publications.waset.org/abstracts/search?q=biometric%20data" title=" biometric data"> biometric data</a>, <a href="https://publications.waset.org/abstracts/search?q=artificial%20intelligence" title=" artificial intelligence"> artificial intelligence</a>, <a href="https://publications.waset.org/abstracts/search?q=neural%20networks" title=" neural networks"> neural networks</a> </p> <a href="https://publications.waset.org/abstracts/131387/application-of-signature-verification-models-for-document-recognition" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/131387.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">148</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4374</span> A Proposed Approach for Emotion Lexicon Enrichment</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Amr%20Mansour%20Mohsen">Amr Mansour Mohsen</a>, <a href="https://publications.waset.org/abstracts/search?q=Hesham%20Ahmed%20Hassan"> Hesham Ahmed Hassan</a>, <a href="https://publications.waset.org/abstracts/search?q=Amira%20M.%20Idrees"> Amira M. Idrees</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Document Analysis is an important research field that aims to gather the information by analyzing the data in documents. As one of the important targets for many fields is to understand what people actually want, sentimental analysis field has been one of the vital fields that are tightly related to the document analysis. This research focuses on analyzing text documents to classify each document according to its opinion. The aim of this research is to detect the emotions from text documents based on enriching the lexicon with adapting their content based on semantic patterns extraction. The proposed approach has been presented, and different experiments are applied by different perspectives to reveal the positive impact of the proposed approach on the classification results. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=document%20analysis" title="document analysis">document analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=sentimental%20analysis" title=" sentimental analysis"> sentimental analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=emotion%20detection" title=" emotion detection"> emotion detection</a>, <a href="https://publications.waset.org/abstracts/search?q=WEKA%20tool" title=" WEKA tool"> WEKA tool</a>, <a href="https://publications.waset.org/abstracts/search?q=NRC%20lexicon" title=" NRC lexicon"> NRC lexicon</a> </p> <a href="https://publications.waset.org/abstracts/41465/a-proposed-approach-for-emotion-lexicon-enrichment" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/41465.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">442</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4373</span> Resume Ranking Using Custom Word2vec and Rule-Based Natural Language Processing Techniques</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Subodh%20Chandra%20Shakya">Subodh Chandra Shakya</a>, <a href="https://publications.waset.org/abstracts/search?q=Rajendra%20Sapkota"> Rajendra Sapkota</a>, <a href="https://publications.waset.org/abstracts/search?q=Aakash%20Tamang"> Aakash Tamang</a>, <a href="https://publications.waset.org/abstracts/search?q=Shushant%20Pudasaini"> Shushant Pudasaini</a>, <a href="https://publications.waset.org/abstracts/search?q=Sujan%20Adhikari"> Sujan Adhikari</a>, <a href="https://publications.waset.org/abstracts/search?q=Sajjan%20Adhikari"> Sajjan Adhikari</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Lots of efforts have been made in order to measure the semantic similarity between the text corpora in the documents. Techniques have been evolved to measure the similarity of two documents. One such state-of-art technique in the field of Natural Language Processing (NLP) is word to vector models, which converts the words into their word-embedding and measures the similarity between the vectors. We found this to be quite useful for the task of resume ranking. So, this research paper is the implementation of the word2vec model along with other Natural Language Processing techniques in order to rank the resumes for the particular job description so as to automate the process of hiring. The research paper proposes the system and the findings that were made during the process of building the system. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=chunking" title="chunking">chunking</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20similarity" title=" document similarity"> document similarity</a>, <a href="https://publications.waset.org/abstracts/search?q=information%20extraction" title=" information extraction"> information extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=natural%20language%20processing" title=" natural language processing"> natural language processing</a>, <a href="https://publications.waset.org/abstracts/search?q=word2vec" title=" word2vec"> word2vec</a>, <a href="https://publications.waset.org/abstracts/search?q=word%20embedding" title=" word embedding"> word embedding</a> </p> <a href="https://publications.waset.org/abstracts/129534/resume-ranking-using-custom-word2vec-and-rule-based-natural-language-processing-techniques" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/129534.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">158</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4372</span> Improving the Performance of Requisition Document Online System for Royal Thai Army by Using Time Series Model </h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=D.%20Prangchumpol">D. Prangchumpol</a> </p> <p class="card-text"><strong>Abstract:</strong></p> This research presents a forecasting method of requisition document demands for Military units by using Exponential Smoothing methods to analyze data. The data used in the forecast is an actual data requisition document of The Adjutant General Department. The results of the forecasting model to forecast the requisition of the document found that Holt–Winters’ trend and seasonality method of α=0.1, β=0, γ=0 is appropriate and matches for requisition of documents. In addition, the researcher has developed a requisition online system to improve the performance of requisition documents of The Adjutant General Department, and also ensuring that the operation can be checked. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=requisition" title="requisition">requisition</a>, <a href="https://publications.waset.org/abstracts/search?q=holt%E2%80%93winters" title=" holt–winters"> holt–winters</a>, <a href="https://publications.waset.org/abstracts/search?q=time%20series" title=" time series"> time series</a>, <a href="https://publications.waset.org/abstracts/search?q=royal%20thai%20army" title=" royal thai army"> royal thai army</a> </p> <a href="https://publications.waset.org/abstracts/1503/improving-the-performance-of-requisition-document-online-system-for-royal-thai-army-by-using-time-series-model" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/1503.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">308</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4371</span> A Newspapers Expectations Indicator from Web Scraping</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Pilar%20Rey%20del%20Castillo">Pilar Rey del Castillo</a> </p> <p class="card-text"><strong>Abstract:</strong></p> This document describes the building of an average indicator of the general sentiments about the future exposed in the newspapers in Spain. The raw data are collected through the scraping of the Digital Periodical and Newspaper Library website. Basic tools of natural language processing are later applied to the collected information to evaluate the sentiment strength of each word in the texts using a polarized dictionary. The last step consists of summarizing these sentiments to produce daily indices. The results are a first insight into the applicability of these techniques to produce periodic sentiment indicators. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=natural%20language%20processing" title="natural language processing">natural language processing</a>, <a href="https://publications.waset.org/abstracts/search?q=periodic%20indicator" title=" periodic indicator"> periodic indicator</a>, <a href="https://publications.waset.org/abstracts/search?q=sentiment%20analysis" title=" sentiment analysis"> sentiment analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=web%20scraping" title=" web scraping"> web scraping</a> </p> <a href="https://publications.waset.org/abstracts/143267/a-newspapers-expectations-indicator-from-web-scraping" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/143267.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">133</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4370</span> Hindi Speech Synthesis by Concatenation of Recognized Hand Written Devnagri Script Using Support Vector Machines Classifier</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Saurabh%20Farkya">Saurabh Farkya</a>, <a href="https://publications.waset.org/abstracts/search?q=Govinda%20Surampudi"> Govinda Surampudi</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Optical Character Recognition is one of the current major research areas. This paper is focussed on recognition of Devanagari script and its sound generation. This Paper consists of two parts. First, Optical Character Recognition of Devnagari handwritten Script. Second, speech synthesis of the recognized text. This paper shows an implementation of support vector machines for the purpose of Devnagari Script recognition. The Support Vector Machines was trained with Multi Domain features; Transform Domain and Spatial Domain or Structural Domain feature. Transform Domain includes the wavelet feature of the character. Structural Domain consists of Distance Profile feature and Gradient feature. The Segmentation of the text document has been done in 3 levels-Line Segmentation, Word Segmentation, and Character Segmentation. The pre-processing of the characters has been done with the help of various Morphological operations-Otsu's Algorithm, Erosion, Dilation, Filtration and Thinning techniques. The Algorithm was tested on the self-prepared database, a collection of various handwriting. Further, Unicode was used to convert recognized Devnagari text into understandable computer document. The document so obtained is an array of codes which was used to generate digitized text and to synthesize Hindi speech. Phonemes from the self-prepared database were used to generate the speech of the scanned document using concatenation technique. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=Character%20Recognition%20%28OCR%29" title="Character Recognition (OCR)">Character Recognition (OCR)</a>, <a href="https://publications.waset.org/abstracts/search?q=Text%20to%20Speech%20%28TTS%29" title=" Text to Speech (TTS)"> Text to Speech (TTS)</a>, <a href="https://publications.waset.org/abstracts/search?q=Support%20Vector%20Machines%20%28SVM%29" title=" Support Vector Machines (SVM)"> Support Vector Machines (SVM)</a>, <a href="https://publications.waset.org/abstracts/search?q=Library%20of%20Support%20Vector%20Machines%20%28LIBSVM%29" title=" Library of Support Vector Machines (LIBSVM)"> Library of Support Vector Machines (LIBSVM)</a> </p> <a href="https://publications.waset.org/abstracts/19232/hindi-speech-synthesis-by-concatenation-of-recognized-hand-written-devnagri-script-using-support-vector-machines-classifier" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/19232.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">499</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4369</span> Off-Topic Text Detection System Using a Hybrid Model</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Usama%20Shahid">Usama Shahid</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Be it written documents, news columns, or students' essays, verifying the content can be a time-consuming task. Apart from the spelling and grammar mistakes, the proofreader is also supposed to verify whether the content included in the essay or document is relevant or not. The irrelevant content in any document or essay is referred to as off-topic text and in this paper, we will address the problem of off-topic text detection from a document using machine learning techniques. Our study aims to identify the off-topic content from a document using Echo state network model and we will also compare data with other models. The previous study uses Convolutional Neural Networks and TFIDF to detect off-topic text. We will rearrange the existing datasets and take new classifiers along with new word embeddings and implement them on existing and new datasets in order to compare the results with the previously existing CNN model. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=off%20topic" title="off topic">off topic</a>, <a href="https://publications.waset.org/abstracts/search?q=text%20detection" title=" text detection"> text detection</a>, <a href="https://publications.waset.org/abstracts/search?q=eco%20state%20network" title=" eco state network"> eco state network</a>, <a href="https://publications.waset.org/abstracts/search?q=machine%20learning" title=" machine learning"> machine learning</a> </p> <a href="https://publications.waset.org/abstracts/160685/off-topic-text-detection-system-using-a-hybrid-model" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/160685.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">85</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4368</span> Document-level Sentiment Analysis: An Exploratory Case Study of Low-resource Language Urdu</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Ammarah%20Irum">Ammarah Irum</a>, <a href="https://publications.waset.org/abstracts/search?q=Muhammad%20Ali%20Tahir"> Muhammad Ali Tahir</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Document-level sentiment analysis in Urdu is a challenging Natural Language Processing (NLP) task due to the difficulty of working with lengthy texts in a language with constrained resources. Deep learning models, which are complex neural network architectures, are well-suited to text-based applications in addition to data formats like audio, image, and video. To investigate the potential of deep learning for Urdu sentiment analysis, we implemented five different deep learning models, including Bidirectional Long Short Term Memory (BiLSTM), Convolutional Neural Network (CNN), Convolutional Neural Network with Bidirectional Long Short Term Memory (CNN-BiLSTM), and Bidirectional Encoder Representation from Transformer (BERT). In this study, we developed a hybrid deep learning model called BiLSTM-Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN) by fusing BiLSTM and CNN architecture. The proposed and baseline techniques are applied on Urdu Customer Support data set and IMDB Urdu movie review data set by using pre-trained Urdu word embedding that are suitable for sentiment analysis at the document level. Results of these techniques are evaluated and our proposed model outperforms all other deep learning techniques for Urdu sentiment analysis. BiLSTM-SLMFCNN outperformed the baseline deep learning models and achieved 83%, 79%, 83% and 94% accuracy on small, medium and large sized IMDB Urdu movie review data set and Urdu Customer Support data set respectively. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=urdu%20sentiment%20analysis" title="urdu sentiment analysis">urdu sentiment analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=deep%20learning" title=" deep learning"> deep learning</a>, <a href="https://publications.waset.org/abstracts/search?q=natural%20language%20processing" title=" natural language processing"> natural language processing</a>, <a href="https://publications.waset.org/abstracts/search?q=opinion%20mining" title=" opinion mining"> opinion mining</a>, <a href="https://publications.waset.org/abstracts/search?q=low-resource%20language" title=" low-resource language"> low-resource language</a> </p> <a href="https://publications.waset.org/abstracts/172973/document-level-sentiment-analysis-an-exploratory-case-study-of-low-resource-language-urdu" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/172973.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">72</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4367</span> A Methodology for Automatic Diversification of Document Categories</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Dasom%20Kim">Dasom Kim</a>, <a href="https://publications.waset.org/abstracts/search?q=Chen%20Liu"> Chen Liu</a>, <a href="https://publications.waset.org/abstracts/search?q=Myungsu%20Lim"> Myungsu Lim</a>, <a href="https://publications.waset.org/abstracts/search?q=Su-Hyeon%20Jeon"> Su-Hyeon Jeon</a>, <a href="https://publications.waset.org/abstracts/search?q=ByeoungKug%20Jeon"> ByeoungKug Jeon</a>, <a href="https://publications.waset.org/abstracts/search?q=Kee-Young%20Kwahk"> Kee-Young Kwahk</a>, <a href="https://publications.waset.org/abstracts/search?q=Namgyu%20Kim"> Namgyu Kim</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we previously proposed a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. In this paper, we design a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=big%20data%20analysis" title="big data analysis">big data analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=document%20classification" title=" document classification"> document classification</a>, <a href="https://publications.waset.org/abstracts/search?q=multi-category" title=" multi-category"> multi-category</a>, <a href="https://publications.waset.org/abstracts/search?q=text%20mining" title=" text mining"> text mining</a>, <a href="https://publications.waset.org/abstracts/search?q=topic%20analysis" title=" topic analysis"> topic analysis</a> </p> <a href="https://publications.waset.org/abstracts/36754/a-methodology-for-automatic-diversification-of-document-categories" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/36754.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">272</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4366</span> Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Po-Fang%20Hsu">Po-Fang Hsu</a>, <a href="https://publications.waset.org/abstracts/search?q=Chiching%20Wei"> Chiching Wei</a> </p> <p class="card-text"><strong>Abstract:</strong></p> In this paper, we present a novel neural graph matching approach applied to document comparison. Document comparison is a common task in the legal and financial industries. In some cases, the most important differences may be the addition or omission of words, sentences, clauses, or paragraphs. However, it is a challenging task without recording or tracing the whole edited process. Under many temporal uncertainties, we explore the potentiality of our approach to proximate the accurate comparison to make sure which element blocks have a relation of edition with others. In the beginning, we apply a document layout analysis that combines traditional and modern technics to segment layouts in blocks of various types appropriately. Then we transform this issue into a problem of layout graph matching with textual awareness. Regarding graph matching, it is a long-studied problem with a broad range of applications. However, different from previous works focusing on visual images or structural layout, we also bring textual features into our model for adapting this domain. Specifically, based on the electronic document, we introduce an encoder to deal with the visual presentation decoding from PDF. Additionally, because the modifications can cause the inconsistency of document layout analysis between modified documents and the blocks can be merged and split, Sinkhorn divergence is adopted in our neural graph approach, which tries to overcome both these issues with many-to-many block matching. We demonstrate this on two categories of layouts, as follows., legal agreement and scientific articles, collected from our real-case datasets. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=document%20comparison" title="document comparison">document comparison</a>, <a href="https://publications.waset.org/abstracts/search?q=graph%20matching" title=" graph matching"> graph matching</a>, <a href="https://publications.waset.org/abstracts/search?q=graph%20neural%20network" title=" graph neural network"> graph neural network</a>, <a href="https://publications.waset.org/abstracts/search?q=modification%20similarity" title=" modification similarity"> modification similarity</a>, <a href="https://publications.waset.org/abstracts/search?q=multi-modal" title=" multi-modal"> multi-modal</a> </p> <a href="https://publications.waset.org/abstracts/141898/neural-graph-matching-for-modification-similarity-applied-to-electronic-document-comparison" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/141898.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">179</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4365</span> Use of Interpretable Evolved Search Query Classifiers for Sinhala Documents</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Prasanna%20Haddela">Prasanna Haddela</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Document analysis is a well matured yet still active research field, partly as a result of the intricate nature of building computational tools but also due to the inherent problems arising from the variety and complexity of human languages. Breaking down language barriers is vital in enabling access to a number of recent technologies. This paper investigates the application of document classification methods to new Sinhalese datasets. This language is geographically isolated and rich with many of its own unique features. We will examine the interpretability of the classification models with a particular focus on the use of evolved Lucene search queries generated using a Genetic Algorithm (GA) as a method of document classification. We will compare the accuracy and interpretability of these search queries with other popular classifiers. The results are promising and are roughly in line with previous work on English language datasets. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=evolved%20search%20queries" title="evolved search queries">evolved search queries</a>, <a href="https://publications.waset.org/abstracts/search?q=Sinhala%20document%20classification" title=" Sinhala document classification"> Sinhala document classification</a>, <a href="https://publications.waset.org/abstracts/search?q=Lucene%20Sinhala%20analyzer" title=" Lucene Sinhala analyzer"> Lucene Sinhala analyzer</a>, <a href="https://publications.waset.org/abstracts/search?q=interpretable%20text%20classification" title=" interpretable text classification"> interpretable text classification</a>, <a href="https://publications.waset.org/abstracts/search?q=genetic%20algorithm" title=" genetic algorithm"> genetic algorithm</a> </p> <a href="https://publications.waset.org/abstracts/126324/use-of-interpretable-evolved-search-query-classifiers-for-sinhala-documents" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/126324.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">114</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4364</span> DCDNet: Lightweight Document Corner Detection Network Based on Attention Mechanism</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Kun%20Xu">Kun Xu</a>, <a href="https://publications.waset.org/abstracts/search?q=Yuan%20Xu"> Yuan Xu</a>, <a href="https://publications.waset.org/abstracts/search?q=Jia%20Qiao"> Jia Qiao</a> </p> <p class="card-text"><strong>Abstract:</strong></p> The document detection plays an important role in optical character recognition and text analysis. Because the traditional detection methods have weak generalization ability, and deep neural network has complex structure and large number of parameters, which cannot be well applied in mobile devices, this paper proposes a lightweight Document Corner Detection Network (DCDNet). DCDNet is a two-stage architecture. The first stage with Encoder-Decoder structure adopts depthwise separable convolution to greatly reduce the network parameters. After introducing the Feature Attention Union (FAU) module, the second stage enhances the feature information of spatial and channel dim and adaptively adjusts the size of receptive field to enhance the feature expression ability of the model. Aiming at solving the problem of the large difference in the number of pixel distribution between corner and non-corner, Weighted Binary Cross Entropy Loss (WBCE Loss) is proposed to define corner detection problem as a classification problem to make the training process more efficient. In order to make up for the lack of Dataset of document corner detection, a Dataset containing 6620 images named Document Corner Detection Dataset (DCDD) is made. Experimental results show that the proposed method can obtain fast, stable and accurate detection results on DCDD. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=document%20detection" title="document detection">document detection</a>, <a href="https://publications.waset.org/abstracts/search?q=corner%20detection" title=" corner detection"> corner detection</a>, <a href="https://publications.waset.org/abstracts/search?q=attention%20mechanism" title=" attention mechanism"> attention mechanism</a>, <a href="https://publications.waset.org/abstracts/search?q=lightweight" title=" lightweight"> lightweight</a> </p> <a href="https://publications.waset.org/abstracts/152145/dcdnet-lightweight-document-corner-detection-network-based-on-attention-mechanism" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/152145.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">354</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4363</span> An Experiential Learning of Ontology-Based Multi-document Summarization by Removal Summarization Techniques</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Pranjali%20Avinash%20Yadav-Deshmukh">Pranjali Avinash Yadav-Deshmukh</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Remarkable development of the Internet along with the new technological innovation, such as high-speed systems and affordable large storage space have led to a tremendous increase in the amount and accessibility to digital records. For any person, studying of all these data is tremendously time intensive, so there is a great need to access effective multi-document summarization (MDS) systems, which can successfully reduce details found in several records into a short, understandable summary or conclusion. For semantic representation of textual details in ontology area, as a theoretical design, our system provides a significant structure. The stability of using the ontology in fixing multi-document summarization problems in the sector of catastrophe control is finding its recommended design. Saliency ranking is usually allocated to each phrase and phrases are rated according to the ranking, then the top rated phrases are chosen as the conclusion. With regards to the conclusion quality, wide tests on a selection of media announcements are appropriate for “Jammu Kashmir Overflow in 2014” records. Ontology centered multi-document summarization methods using “NLP centered extraction” outshine other baselines. Our participation in recommended component is to implement the details removal methods (NLP) to enhance the results. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=disaster%20management" title="disaster management">disaster management</a>, <a href="https://publications.waset.org/abstracts/search?q=extraction%20technique" title=" extraction technique"> extraction technique</a>, <a href="https://publications.waset.org/abstracts/search?q=k-means" title=" k-means"> k-means</a>, <a href="https://publications.waset.org/abstracts/search?q=multi-document%20summarization" title=" multi-document summarization"> multi-document summarization</a>, <a href="https://publications.waset.org/abstracts/search?q=NLP" title=" NLP"> NLP</a>, <a href="https://publications.waset.org/abstracts/search?q=ontology" title=" ontology"> ontology</a>, <a href="https://publications.waset.org/abstracts/search?q=sentence%20extraction" title=" sentence extraction"> sentence extraction</a> </p> <a href="https://publications.waset.org/abstracts/32426/an-experiential-learning-of-ontology-based-multi-document-summarization-by-removal-summarization-techniques" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/32426.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">386</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4362</span> Efficient Layout-Aware Pretraining for Multimodal Form Understanding</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Armineh%20Nourbakhsh">Armineh Nourbakhsh</a>, <a href="https://publications.waset.org/abstracts/search?q=Sameena%20Shah"> Sameena Shah</a>, <a href="https://publications.waset.org/abstracts/search?q=Carolyn%20Rose"> Carolyn Rose</a> </p> <p class="card-text"><strong>Abstract:</strong></p> Layout-aware language models have been used to create multimodal representations for documents that are in image form, achieving relatively high accuracy in document understanding tasks. However, the large number of parameters in the resulting models makes building and using them prohibitive without access to high-performing processing units with large memory capacity. We propose an alternative approach that can create efficient representations without the need for a neural visual backbone. This leads to an 80% reduction in the number of parameters compared to the smallest SOTA model, widely expanding applicability. In addition, our layout embeddings are pre-trained on spatial and visual cues alone and only fused with text embeddings in downstream tasks, which can facilitate applicability to low-resource of multi-lingual domains. Despite using 2.5% of training data, we show competitive performance on two form understanding tasks: semantic labeling and link prediction. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=layout%20understanding" title="layout understanding">layout understanding</a>, <a href="https://publications.waset.org/abstracts/search?q=form%20understanding" title=" form understanding"> form understanding</a>, <a href="https://publications.waset.org/abstracts/search?q=multimodal%20document%20understanding" title=" multimodal document understanding"> multimodal document understanding</a>, <a href="https://publications.waset.org/abstracts/search?q=bias-augmented%20attention" title=" bias-augmented attention"> bias-augmented attention</a> </p> <a href="https://publications.waset.org/abstracts/147955/efficient-layout-aware-pretraining-for-multimodal-form-understanding" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/147955.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">148</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4361</span> A U-Net Based Architecture for Fast and Accurate Diagram Extraction</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Revoti%20Prasad%20Bora">Revoti Prasad Bora</a>, <a href="https://publications.waset.org/abstracts/search?q=Saurabh%20Yadav"> Saurabh Yadav</a>, <a href="https://publications.waset.org/abstracts/search?q=Nikita%20Katyal"> Nikita Katyal</a> </p> <p class="card-text"><strong>Abstract:</strong></p> In the context of educational data mining, the use case of extracting information from images containing both text and diagrams is of high importance. Hence, document analysis requires the extraction of diagrams from such images and processes the text and diagrams separately. To the author’s best knowledge, none among plenty of approaches for extracting tables, figures, etc., suffice the need for real-time processing with high accuracy as needed in multiple applications. In the education domain, diagrams can be of varied characteristics viz. line-based i.e. geometric diagrams, chemical bonds, mathematical formulas, etc. There are two broad categories of approaches that try to solve similar problems viz. traditional computer vision based approaches and deep learning approaches. The traditional computer vision based approaches mainly leverage connected components and distance transform based processing and hence perform well in very limited scenarios. The existing deep learning approaches either leverage YOLO or faster-RCNN architectures. These approaches suffer from a performance-accuracy tradeoff. This paper proposes a U-Net based architecture that formulates the diagram extraction as a segmentation problem. The proposed method provides similar accuracy with a much faster extraction time as compared to the mentioned state-of-the-art approaches. Further, the segmentation mask in this approach allows the extraction of diagrams of irregular shapes. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=computer%20vision" title="computer vision">computer vision</a>, <a href="https://publications.waset.org/abstracts/search?q=deep-learning" title=" deep-learning"> deep-learning</a>, <a href="https://publications.waset.org/abstracts/search?q=educational%20data%20mining" title=" educational data mining"> educational data mining</a>, <a href="https://publications.waset.org/abstracts/search?q=faster-RCNN" title=" faster-RCNN"> faster-RCNN</a>, <a href="https://publications.waset.org/abstracts/search?q=figure%20extraction" title=" figure extraction"> figure extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=image%20segmentation" title=" image segmentation"> image segmentation</a>, <a href="https://publications.waset.org/abstracts/search?q=real-time%20document%20analysis" title=" real-time document analysis"> real-time document analysis</a>, <a href="https://publications.waset.org/abstracts/search?q=text%20extraction" title=" text extraction"> text extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=U-Net" title=" U-Net"> U-Net</a>, <a href="https://publications.waset.org/abstracts/search?q=YOLO" title=" YOLO"> YOLO</a> </p> <a href="https://publications.waset.org/abstracts/148396/a-u-net-based-architecture-for-fast-and-accurate-diagram-extraction" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/148396.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">137</span> </span> </div> </div> <div class="card paper-listing mb-3 mt-3"> <h5 class="card-header" style="font-size:.9rem"><span class="badge badge-info">4360</span> Semantic Indexing Improvement for Textual Documents: Contribution of Classification by Fuzzy Association Rules</h5> <div class="card-body"> <p class="card-text"><strong>Authors:</strong> <a href="https://publications.waset.org/abstracts/search?q=Mohsen%20Maraoui">Mohsen Maraoui</a> </p> <p class="card-text"><strong>Abstract:</strong></p> In the aim of natural language processing applications improvement, such as information retrieval, machine translation, lexical disambiguation, we focus on statistical approach to semantic indexing for multilingual text documents based on conceptual network formalism. We propose to use this formalism as an indexing language to represent the descriptive concepts and their weighting. These concepts represent the content of the document. Our contribution is based on two steps. In the first step, we propose the extraction of index terms using the multilingual lexical resource Euro WordNet (EWN). In the second step, we pass from the representation of index terms to the representation of index concepts through conceptual network formalism. This network is generated using the EWN resource and pass by a classification step based on association rules model (in attempt to discover the non-taxonomic relations or contextual relations between the concepts of a document). These relations are latent relations buried in the text and carried by the semantic context of the co-occurrence of concepts in the document. Our proposed indexing approach can be applied to text documents in various languages because it is based on a linguistic method adapted to the language through a multilingual thesaurus. Next, we apply the same statistical process regardless of the language in order to extract the significant concepts and their associated weights. We prove that the proposed indexing approach provides encouraging results. <p class="card-text"><strong>Keywords:</strong> <a href="https://publications.waset.org/abstracts/search?q=concept%20extraction" title="concept extraction">concept extraction</a>, <a href="https://publications.waset.org/abstracts/search?q=conceptual%20network%20formalism" title=" conceptual network formalism"> conceptual network formalism</a>, <a href="https://publications.waset.org/abstracts/search?q=fuzzy%20association%20rules" title=" fuzzy association rules"> fuzzy association rules</a>, <a href="https://publications.waset.org/abstracts/search?q=multilingual%20thesaurus" title=" multilingual thesaurus"> multilingual thesaurus</a>, <a href="https://publications.waset.org/abstracts/search?q=semantic%20indexing" title=" semantic indexing"> semantic indexing</a> </p> <a href="https://publications.waset.org/abstracts/98854/semantic-indexing-improvement-for-textual-documents-contribution-of-classification-by-fuzzy-association-rules" class="btn btn-primary btn-sm">Procedia</a> <a href="https://publications.waset.org/abstracts/98854.pdf" target="_blank" class="btn btn-primary btn-sm">PDF</a> <span class="bg-info text-light px-1 py-1 float-right rounded"> Downloads <span class="badge badge-light">141</span> </span> </div> </div> <ul class="pagination"> <li class="page-item disabled"><span class="page-link">‹</span></li> <li class="page-item active"><span class="page-link">1</span></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=2">2</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=3">3</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=4">4</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=5">5</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=6">6</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=7">7</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=8">8</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=9">9</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=10">10</a></li> <li class="page-item disabled"><span class="page-link">...</span></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=146">146</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=147">147</a></li> <li class="page-item"><a class="page-link" href="https://publications.waset.org/abstracts/search?q=document%20processing&page=2" rel="next">›</a></li> </ul> </div> </main> <footer> <div id="infolinks" class="pt-3 pb-2"> <div class="container"> <div style="background-color:#f5f5f5;" class="p-3"> <div class="row"> <div class="col-md-2"> <ul class="list-unstyled"> About <li><a href="https://waset.org/page/support">About Us</a></li> <li><a href="https://waset.org/page/support#legal-information">Legal</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/WASET-16th-foundational-anniversary.pdf">WASET celebrates its 16th foundational anniversary</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Account <li><a href="https://waset.org/profile">My Account</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Explore <li><a href="https://waset.org/disciplines">Disciplines</a></li> <li><a href="https://waset.org/conferences">Conferences</a></li> <li><a href="https://waset.org/conference-programs">Conference Program</a></li> <li><a href="https://waset.org/committees">Committees</a></li> <li><a href="https://publications.waset.org">Publications</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Research <li><a href="https://publications.waset.org/abstracts">Abstracts</a></li> <li><a href="https://publications.waset.org">Periodicals</a></li> <li><a href="https://publications.waset.org/archive">Archive</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Open Science <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Science-Philosophy.pdf">Open Science Philosophy</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Science-Award.pdf">Open Science Award</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Open-Society-Open-Science-and-Open-Innovation.pdf">Open Innovation</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Postdoctoral-Fellowship-Award.pdf">Postdoctoral Fellowship Award</a></li> <li><a target="_blank" rel="nofollow" href="https://publications.waset.org/static/files/Scholarly-Research-Review.pdf">Scholarly Research Review</a></li> </ul> </div> <div class="col-md-2"> <ul class="list-unstyled"> Support <li><a href="https://waset.org/page/support">Support</a></li> <li><a href="https://waset.org/profile/messages/create">Contact Us</a></li> <li><a href="https://waset.org/profile/messages/create">Report Abuse</a></li> </ul> </div> </div> </div> </div> </div> <div class="container text-center"> <hr style="margin-top:0;margin-bottom:.3rem;"> <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" class="text-muted small">Creative Commons Attribution 4.0 International License</a> <div id="copy" class="mt-2">© 2024 World Academy of Science, Engineering and Technology</div> </div> </footer> <a href="javascript:" id="return-to-top"><i class="fas fa-arrow-up"></i></a> <div class="modal" id="modal-template"> <div class="modal-dialog"> <div class="modal-content"> <div class="row m-0 mt-1"> <div class="col-md-12"> <button type="button" class="close" data-dismiss="modal" aria-label="Close"><span aria-hidden="true">×</span></button> </div> </div> <div class="modal-body"></div> </div> </div> </div> <script src="https://cdn.waset.org/static/plugins/jquery-3.3.1.min.js"></script> <script src="https://cdn.waset.org/static/plugins/bootstrap-4.2.1/js/bootstrap.bundle.min.js"></script> <script src="https://cdn.waset.org/static/js/site.js?v=150220211556"></script> <script> jQuery(document).ready(function() { /*jQuery.get("https://publications.waset.org/xhr/user-menu", function (response) { jQuery('#mainNavMenu').append(response); });*/ jQuery.get({ url: "https://publications.waset.org/xhr/user-menu", cache: false }).then(function(response){ jQuery('#mainNavMenu').append(response); }); }); </script> </body> </html>