Research

Multilingual Multimodal Multiagent Systems for Localization

Published By
Aziz Ulak
Published By
Sercan Arik
Icon
Subject
Research
Icon
Read Time
7min

Abstract

Localization involves adapting content, products, services, and user experiences to resonate with target cultural, legal, and business nuances, ensuring relevance and authenticity across diverse markets and use cases. This includes adjusting multimedia content, messaging, product features, and even interfaces and strategies to meet regional preferences, compliance requirements, and consumer behaviors. Ollang tackles this critical challenge with a multiagent system specifically designed to handle multimodal and multilingual modeling complexities. Ollang’s multiagent system-based solution reduced translation errors by >60%. We overview the design and implementation of Ollang's agentic human-grade localization framework. We explain the key characteristics of agents in the system, value propositions, and customization granularity, and present benchmark results demonstrating significantly improved results.

Introduction

In today's globalized world, the demand for accurate and culturally appropriate localization of multimedia content, especially for text, video and audio modalities, has never been higher. Ollang addresses this need with its multiagent architecture, designed to deliver human-grade foreign subtitles, documents or audios, even in challenging scenarios involving noise, complex reasoning, poor pronunciation, or unique contextual elements. 

Multiagent Systems

An agent in a multiagent system broadly refers to a modular, autonomous unit within a larger architecture that leverages generative AI capabilities to perform specific, domain-oriented tasks. It is characterized by its ability to dynamically process inputs, call functions, and interact with other components or agents within the system. Additionally, an agent can maintain its own memory to store contextual information, adapt to evolving scenarios, and provide continuity across tasks.

Key Characteristics of an Agent

  • Utilizing Generative AI: The agent utilizes generative AI models (e.g., LLMs, diffusion models) for understanding, generating, or transforming content.
  • Task-Specific Functionality: Each agent is designed to solve a well-defined problem or contribute a specific capability, such as analyzing video content, detecting differences between the same content in two different languages, adapting content for cultural nuances, aligning text sentiment with a target audience etc..
  • Function Calling: Agents have access to external functions (e.g., APIs, libraries, databases) to enhance their processing, enabling actions like fetching external data, performing web search, invoking speech or video generation systems, or looking up information from databases.
  • Memory: Agents maintain memory to ensure consistency and coherence across interactions. This allows the agent to track context, user preferences, or intermediate steps.
  • Autonomy and Coordination: Agents can collaborate with other agents , forming a network of specialized components that collectively solve complex problems.

Example Agents for Multimedia Localization

  1. Video Understanding Agent: Analyzes video content by performing scene segmentation, object detection, and speech-to-text transcription using computer vision APIs, AI transcription, and memory systems to track recurring elements across videos.
  2. Cultural Adaptation Agent: Modifies content to match target audience cultural norms by leveraging translation models, cultural databases, and sentiment analysis to prevent cultural misunderstandings while adjusting language and imagery accordingly.
  3. Sentiment Alignment Agent: Adapts content's emotional tone using fine-tuned LLM outputs and sentiment analysis APIs, maintaining consistent tone across communications through memory-based tracking.
  4. Data Transformation Agent: Converts and structures data between formats using parsing libraries and AI models, while maintaining consistent data mapping rules through memory systems.
  5. Multimodal Judge Agent: Ensures output fidelity to source content by evaluating across modalities using generative AI and specialized evaluation functions, storing validation strategies for consistent assessment.
  6. Customization Agent: Optimizes system performance by dynamically adjusting workflows and parameters to meet varying customer needs and task-specific requirements.

Value Propositions of Agentic Systems

  • Enhanced Capabilities through Modularization: By breaking down complex problems into smaller, manageable tasks with specific prompts, multimodal LLMs can operate more effectively, yielding higher accuracy and robustness.This enables more effective handling of long context modeling and complex reasoning challenges, that are among the important bottlenecks of modern multimodal LLMs. 
  • Multimodal Modeling: Agentic systems excel in processing and integrating different modalities such as documents, videos, and audio, leading to a more comprehensive understanding. Strengths of different models can be effectively utilized, as some might be superior in understanding and generating some particular modalities. 
  • Feedback Loop Mechanisms: Incorporation of critic, correction, and error detection agents creates a feedback loop that enhances the overall system performance by identifying and mitigating issues early. This can in a way mimic how human-in-the-loop systems are operating on top of standard use of multimodal LLMs. 
  • Dynamic Routing: A routing system enables the selection of the best available models for specific tasks and inputs, optimizing performance across different languages and modalities.This yields superior accuracy-cost-latency tradeoff for the users. 
  • Modularity and Control: Agentic systems offer a modular, interpretable, and controllable architecture, allowing for easier detection and resolution of issues and facilitating customization and scaling. Users can know what each agent is doing, and can visualize and build insights into intermediate processed outputs.

Obtaining State-of-the-Art Multilingual Multimodal Multiagent Systems

  • Multiagent Topology and Design: Optimized design on which agents to employ, and adjusting the structure and interactions between agents to optimize for specific use cases.
  • Prompts of Agents: Customizing instructions and examples provided to agents to refine their outputs, with automated optimization processes that utilize performance on target tasks
  • Decoding with Multipath Reasoning: Including mechanisms specifically invented to enhance improved reasoning, and aggregation mechanisms of effective solution paths.
  • Routing Model: Implementing or tuning models (e.g., from a small LLM) that direct tasks to the most appropriate agents or models.
  • LLM Customization: Modifying weights in language models, particularly those strong in translation, to enhance performance for specific languages or domains, while reflecting the agentic designs.

Accuracy Results

The table below presents accuracy results for subtitle localization across different systems, evaluated based on three key metrics;

Localization Accuracy assesses the correctness of translated content compared to the reference.

Subtitle Segmentation examines the readability and synchronization of subtitles by analyzing their division.

Formatting evaluates adherence to Netflix's formatting standards, including line lengths and the number of lines. These combined metrics provide a comprehensive evaluation of the systems' performance.

Language Ollang Multiagent System (Ours) GPT4o Gemini 1.5 Pro
Chinese (Mandarin) 92.9 74.1 76.2
German 98.9 87.9 91.1

We evaluated subtitle translation workflows using services built on foundational models, focusing on performance in specific target languages. Gemini 1.5 Pro and GPT-4o were used as benchmarks due to their advanced capabilities and strong representation of cutting-edge AI models.

Conclusion

While localization serves as a compelling initial demonstration of Ollang's multiagent system's capabilities, it represents only a fraction of its potential. There are many further possibilities by expanding into use cases based on intelligent content search and moderation, answering questions about business insights, altering the multimodal content with human control, and others. The true power lies in the underlying agentic framework, which offers several key value propositions. By modularizing complex tasks into smaller, manageable units with specific prompts, our system achieves enhanced capabilities, yielding higher accuracy and robustness, particularly in handling long context and complex reasoning—critical bottlenecks for current multimodal LLMs. This modularity also enables effective multimodal modeling, seamlessly integrating diverse data streams like documents, videos, and audio, leveraging the strengths of specialized models for each modality. Furthermore, incorporating feedback loop mechanisms with critic, correction, and error detection agents creates a dynamic system that continuously learns and improves, mirroring the benefits of human-in-the-loop processes. Dynamic routing optimizes performance across languages and modalities by selecting the most suitable model (among 100s candidates in the market) for each task, offering a superior accuracy-cost-latency tradeoff. Ultimately, this modular, interpretable, and controllable architecture simplifies issue detection, resolution, customization, and scaling, providing users with unprecedented insight into intermediate outputs. 

The success we are already seeing with localization is just the beginning. Ollang's agentic systems will revolutionize how enterprises operate by offering a powerful, adaptable framework for a wide range of complex challenges, with the potential to unlock significant efficiencies and innovation across diverse applications to help enterprises reach their global addressable markets.

References

Appendix

Appendix 1:

Eval agent configuration:

LLM: claude-3-5-sonnet-20241022 (latest)

Temperature: 0.0

Appendix 2: Detailed Analysis of Eval Results

MultiAgent EVAL - Raw Results