Showing 1–10 of 10 results for author: Mak, B arXiv:2405.00980
cs.CL cs.CV
A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News
Authors: Zhe Niu, Ronglai Zuo, Brian Mak, Fangyun Wei id="2405.00980v1-abstract-short" style="display: inline;"> This paper introduces TVB-HKSL-News, a new Hong Kong sign language (HKSL) dataset collected from a TV news program over a period of 7 months. The dataset is collected to enrich resources for HKSL and support research in large-vocabulary continuous sign language recognition (SLR) and translation (SLT). It consists of 16.07 hours of sign videos of two signers with a vocabulary of 6,515 glosses (for… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.00980v1-abstract-full').style.display = 'inline'; document.getElementById('2405.00980v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.00980v1-abstract-full" style="display: none;"> This paper introduces TVB-HKSL-News, a new Hong Kong sign language (HKSL) dataset collected from a TV news program over a period of 7 months. The dataset is collected to enrich resources for HKSL and support research in large-vocabulary continuous sign language recognition (SLR) and translation (SLT). It consists of 16.07 hours of sign videos of two signers with a vocabulary of 6,515 glosses (for SLR) and 2,850 Chinese characters or 18K Chinese words (for SLT). One signer has 11.66 hours of sign videos and the other has 4.41 hours. One objective in building the dataset is to support the investigation of how well large-vocabulary continuous sign language recognition/translation can be done for a single signer given a (relatively) large amount of his/her training data, which could potentially lead to the development of new modeling methods. Besides, most parts of the data collection pipeline are automated with little human intervention; we believe that our collection method can be scaled up to collect more sign language data easily for SLT in the future for any sign languages if such sign-interpreted videos are available. arXiv:2401.05336
cs.CV
Towards Online Continuous Sign Language Recognition and Translation
Authors: Ronglai Zuo, Fangyun Wei, Brian Mak
Submitted 22 September, 2024; v1 submitted 10 January, 2024; originally announced January 2024.
Comments: Accepted to EMNLP 2024 Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.05336v2-abstract-full').style.display = 'inline'; document.getElementById('2401.05336v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.05336v2-abstract-full" style="display: none;"> Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. arXiv:2401.04730
cs.CV
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
Authors: Ronglai Zuo, Fangyun Wei, Zenggui Chen, Brian Mak, Jiaolong Yang, Xin Tong The Spoken2Sign task is orthogonal and complementary to traditional sign language to spoken language (Sign2Spoken) translation. To enable Spoken2Sign translation, we present a simple baseline consisting of three steps: 1) creating a gloss-video… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.04730v2-abstract-full').style.display = 'inline'; document.getElementById('2401.04730v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.04730v2-abstract-full" style="display: none;"> The objective of this paper is to develop a functional system for translating spoken languages into sign languages, referred to as Spoken2Sign translation. The Spoken2Sign task is orthogonal and complementary to traditional sign language to spoken language (Sign2Spoken) translation. To enable Spoken2Sign translation, we present a simple baseline consisting of three steps: 1) creating a gloss-video dictionary using existing Sign2Spoken benchmarks; 2) estimating a 3D sign for each sign video in the dictionary; 3) training a Spoken2Sign model, which is composed of a Text2Gloss translator, a sign connector, and a rendering module, with the aid of the yielded gloss-3D sign dictionary. The translation results are then displayed through a sign avatar. As far as we know, we are the first to present the Spoken2Sign task in an output format of 3D signs. In addition to its capability of Spoken2Sign translation, we also demonstrate that two by-products of our approach-3D keypoint augmentation and multi-view understanding-can assist in keypoint-based sign language understanding. Submitted 3 July, 2024; v1 submitted 9 January, 2024; originally announced January 2024.
Comments: Accepted by ECCV 2024 Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2303.12080v1-abstract-full').style.display = 'inline'; document.getElementById('2303.12080v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2303.12080v1-abstract-full" style="display: none;"> Sign languages are visual languages which convey information by signers' handshape, facial expression, body movement, and so forth. Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we propose the Natural Language-Assisted Sign Language Recognition (NLA-SLR) framework, which exploits semantic information contained in glosses (sign labels). First, for VISigns with similar semantic meanings, we propose language-aware label smoothing by generating soft labels for each training sign whose smoothing weights are computed from the normalized semantic similarities among the glosses to ease training. Second, for VISigns with distinct semantic meanings, we present an inter-modality mixup technique which blends vision and gloss features to further maximize the separability of different signs under the supervision of blended labels. Besides, we also introduce a novel backbone, video-keypoint network, which not only models both RGB videos and human body keypoints but also derives knowledge from sign videos of different temporal receptive fields. Submitted 21 March, 2023; originally announced March 2023.
Comments: Accepted by CVPR 2023. Codes are available at arXiv:2303.00502
cs.SD cs.CV eess.AS
On the Audio-visual Synchronization for Lip-to-Speech Synthesis
Authors: Zhe Niu, Brian Mak In this work, we show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues. Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2303.00502v1-abstract-full').style.display = 'inline'; document.getElementById('2303.00502v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2303.00502v1-abstract-full" style="display: none;"> Most lip-to-speech (LTS) synthesis models are trained and evaluated under the assumption that the audio-video pairs in the dataset are perfectly synchronized. In this work, we show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues. Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync. To address these asynchrony issues, we propose a synchronized lip-to-speech (SLTS) model with an automatic synchronization mechanism (ASM) to correct data asynchrony and penalize model asynchrony. We further demonstrate the limitation of the commonly adopted evaluation metrics for LTS with asynchronous test data and introduce an audio alignment frontend before the metrics sensitive to time alignment for better evaluation. Submitted 1 March, 2023; originally announced March 2023. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.13023v2-abstract-full').style.display = 'inline'; document.getElementById('2212.13023v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2212.13023v2-abstract-full" style="display: none;"> Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. arXiv:2212.13023
cs.CV
Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal
Authors: Ronglai Zuo, Brian Mak
Submitted 11 January, 2024; v1 submitted 26 December, 2022; originally announced December 2022.
Comments: Accepted by ACM TOMM For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understa… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2211.01367v2-abstract-full').style.display = 'inline'; document.getElementById('2211.01367v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2211.01367v2-abstract-full" style="display: none;"> Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation. The resulting model is called TwoStream-SLR, which is competent for sign language recognition (SLR). TwoStream-SLR is extended to a sign language translation (SLT) model, TwoStream-SLT, by simply attaching an extra translation network. arXiv:2211.01367
cs.CV
Two-Stream Network for Sign Language Recognition and Translation
Authors: Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, Brian Mak
Submitted 22 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.
Comments: Accepted by NeurIPS 2022. Code and models are available at: arXiv:2008.08567
cs.CL cs.LG
Transformer based Multilingual document Embedding model
Authors: Wei Li, Brian Mak This paper presents a transformer-based sentence/document embedding model, T-LASER, which makes three significant improvements. Firstly, the BiLSTM layers is replaced by the attention-based transformer layers, which is more capable of learning sequent… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2008.08567v2-abstract-full').style.display = 'inline'; document.getElementById('2008.08567v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2008.08567v2-abstract-full" style="display: none;"> One of the current state-of-the-art multilingual document embedding model LASER is based on the bidirectional LSTM neural machine translation model. This paper presents a transformer-based sentence/document embedding model, T-LASER, which makes three significant improvements. Firstly, the BiLSTM layers is replaced by the attention-based transformer layers, which is more capable of learning sequential patterns in longer texts. Secondly, due to the absence of recurrence, T-LASER enables faster parallel computations in the encoder to generate the text embedding. Thirdly, we augment the NMT translation loss function with an additional novel distance constraint loss. This distance constraint loss would further bring the embeddings of parallel sentences close together in the vector space; we call the T-LASER model trained with distance constraint, cT-LASER. Submitted 20 August, 2020; v1 submitted 19 August, 2020; originally announced August 2020. NV is developed with a self-attention mechanism under the neural machine translation (NMT) framework. In NV, each pair of parallel documents in different languages are projected to the same shared layer in the model. However, th… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1807.11057v3-abstract-full').style.display = 'inline'; document.getElementById('1807.11057v3-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1807.11057v3-abstract-full" style="display: none;"> This paper investigates a cross-lingual document embedding method that improves the current Neural machine Translation framework based Document Vector (NTDV or simply NV). NV is developed with a self-attention mechanism under the neural machine translation (NMT) framework. In NV, each pair of parallel documents in different languages are projected to the same shared layer in the model. However, the pair of NV embeddings are not guaranteed to be similar. This paper further adds a distance constraint to the training objective function of NV so that the two embeddings of a parallel document are required to be as close as possible. The new method will be called constrained NV (cNV). In a cross-lingual document classification task, the new cNV performs as well as NV and outperforms other published studies that require forward-pass decoding. arXiv:1807.11057
cs.CL
NMT-based Cross-lingual Document Embeddings
Authors: Wei Li, Brian Mak
Submitted 19 August, 2020; v1 submitted 29 July, 2018; originally announced July 2018. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genr… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1611.00196v1-abstract-full').style.display = 'inline'; document.getElementById('1611.00196v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1611.00196v1-abstract-full" style="display: none;"> In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genre classification. This paper proposes a novel distributed vector representation of a document: a simple recurrent-neural-network language model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is first created from all documents in a task; some of the LM parameters are then adapted by each document, and the adapted parameters are vectorized to represent the document. The new document vectors are labeled as DV-RNN and DV-LSTM respectively. We believe that our new document vectors can capture some high-level sequential information in the documents, which other current document representations fail to capture. The new document vectors were evaluated in the genre classification of documents in three corpora: the Brown Corpus, the BNC Baby Corpus and an artificially created Penn Treebank dataset. Their classification performances are compared with the performance of TF-IDF vector and the state-of-the-art distributed memory model of paragraph vector (PV-DM). arXiv:1611.00196
cs.CL
Recurrent Neural Network Language Model Adaptation Derived Document Vector
Authors: Wei Li, Brian Kan Wing Mak
Submitted 1 November, 2016; originally announced November 2016.