Abstract
ʻŌlelo Hawaiʻi remains underrepresented in modern language technologies, lacking essential Natural Language Processing (NLP) tools such as spell-checking and semantic search that depend on understanding word meanings and sentence structure. Addressing this gap requires a system capable of generating computationally meaningful representations of ʻŌlelo Hawaiʻi words and sentences that can support core NLP tasks in low-resource settings.
In response, we developed a benchmarking system designed to evaluate NLP models trained on ʻŌlelo Hawaiʻi text. This system is grounded in two key datasets: a collection of raw ʻŌlelo Hawaiʻi sentences from textbooks, short stories, and other teaching materials, and a curated corpus of part-of-speech-tagged sentences created through human-in-the-loop annotation. We explored both traditional and modern approaches to word representations, including Word2Vec and transformer-based architectures such as Bidirectional Encoder Representations from Transformers (BERT). All models were trained on the raw corpus and evaluated on a downstream part-of-speech tagging task and language modeling task.
Model performance was evaluated using F1 score and accuracy for part-of-speech tagging, and perplexity for models with language modeling objectives. These benchmarking results enable direct comparison between traditional and transformer-based approaches. By analyzing model behavior on downstream tasks, we identify which architectures demonstrate the most promise for advancing ʻŌlelo Hawaiʻi NLP capabilities.
Our findings establish a performance baseline and offer guidance for selecting and developing future models that support culturally aligned language technologies, helping to bridge the technological gap and pave the way toward a comprehensive ʻŌlelo Hawaiʻi language model.