caslabs · research note

Developing Encoder-Based Language Models for Future Technology to Support ʻŌlelo Hawaiʻi

Project code: haw-lm

Abstract

ʻŌlelo Hawaiʻi remains underrepresented in modern language technologies, lacking essential Natural Language Processing (NLP) tools such as spell-checking and semantic search that depend on understanding word meanings and sentence structure. Addressing this gap requires a system capable of generating computationally meaningful representations of ʻŌlelo Hawaiʻi words and sentences that can support core NLP tasks in low-resource settings.

In response, we developed a benchmarking system designed to evaluate NLP models trained on ʻŌlelo Hawaiʻi text. This system is grounded in two key datasets: a collection of raw ʻŌlelo Hawaiʻi sentences from textbooks, short stories, and other teaching materials, and a curated corpus of part-of-speech-tagged sentences created through human-in-the-loop annotation. We explored both traditional and modern approaches to word representations, including Word2Vec and transformer-based architectures such as Bidirectional Encoder Representations from Transformers (BERT). All models were trained on the raw corpus and evaluated on a downstream part-of-speech tagging task and language modeling task.

Model performance was evaluated using F1 score and accuracy for part-of-speech tagging, and perplexity for models with language modeling objectives. These benchmarking results enable direct comparison between traditional and transformer-based approaches. By analyzing model behavior on downstream tasks, we identify which architectures demonstrate the most promise for advancing ʻŌlelo Hawaiʻi NLP capabilities.

Our findings establish a performance baseline and offer guidance for selecting and developing future models that support culturally aligned language technologies, helping to bridge the technological gap and pave the way toward a comprehensive ʻŌlelo Hawaiʻi language model.

ʻŌlelo Hawaiʻi Demo

Keywords

ʻŌlelo Hawaiʻi, NLP, encoder-based models, Word2Vec, BERT, transformers, part-of-speech tagging, language modeling, low-resource, benchmarking.

Datasets

  • Raw sentence corpus (textbooks, stories, teaching materials)
  • Curated POS-tagged corpus (human-in-the-loop)

Methods

  • Traditional embeddings: Word2Vec
  • Transformers: BERT (encoder-only)
  • Tasks: POS tagging, language modeling
  • Metrics: Accuracy, Macro-F1, Perplexity

Contact