EXAONE 4.5 is LG's first open-weight Vision-Language Model combining a 1.2B vision encoder with a 32B language model for industrial applications. It achieves 256K token context length and outperforms larger models on mathematical reasoning and document understanding benchmarks, including surpassing…
核心问题
How can an effective open-weight vision-language model be developed for industrial applications by integrating visual processing capabilities into a large language model architecture?
核心方法
{'approach': 'The approach trains a 1.2B-parameter vision encoder from scratch and integrates it with the EXAONE 4.0 32B language model using a two-stage pre-training pipeline for cross-modal alignment. The model employs Grouped Query Attention, 2D RoPE for vision encoding, and embeds context extension into supervised fine-tuning to achieve 256K token context length. Multi-stage offline preference optimization and reinforcement learning enhance reasoning capabilities across text and vision domains.', 'key_components': ['A 1.2B-parameter vision encoder was trained from scratch to avoid performance degradation from visual token reduction.', 'Grouped Query Attention (GQA) is employed in both vision encoder and language model for computational efficiency.', '2D RoPE is used for vision encoding while 1D RoPE is maintained for language processing to optimize cross-modal performance.', 'The K-EXAONE tokenizer provides enhanced multilingual support, particularly for Korean language processing.', 'The license agreement governs use of the EXAONE AI Model between Licensee and LG Management Development Institute Co., Ltd.', 'Users agree to the terms by downloading, installing, copying, or using the Model.', 'The Agreement constitutes a binding legal contract between Licensee and Licensor.', 'Users who do not agree to all terms must not download, install, copy, or use the Model.'], 'section_ids': ['sec_2', 'sec_20']}
论点验证
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- 训练超参数(学习率、批次大小、训练轮数、优化器配置、学习率调度策略等)
- 训练数据集详情(视觉编码器训练数据、VLM训练数据、数据规模和来源)
- 训练基础设施规格(硬件类型、GPU/TPU数量、训练时长)
- 模型架构具体参数(层数、隐藏维度、注意力头数、GQA配置参数)
- 最大图像分辨率的具体数值
- 视觉编码器与语言模型的连接方式细节
- 图像和文本预处理步骤
- 随机种子设置
- 训练过程细节(是否多阶段训练、课程学习策略等)
- 各基准测试的具体提示词模板
局限性(作者自述)
论文中未明确列出局限性。
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-22T07:38:00+00:00 · 数据来源:Paper Collector