Beijing, China – Alibaba, a Chinese tech company, has recently issued a statement noting the launch of Qwen2.5-VL, an upgraded version of its visual-language model predecessor, Qwen2-VL.
As per the company, the multimodal model is offered in an open-source format, with sizes ranging from 3 billion to 72 billion parameters, and includes both base and instruction-tuned variants.
The Qwen2.5-VL-72B-Instruct model can be accessed as well on the Qwen Chat platform, alongside the entire Qwen2.5-VL series hosted on Hugging Face and Alibaba’s Model Scope.
In terms of capabilities, the Qwen2.5-VL can interpret complex visual elements, including texts, diagrams, charts, graphics, and image structures. It can also understand videos longer than an hour and answer video-related questions while accurately identifying specific segments down to the exact second.
In addition, the model can develop structured outputs, like JSON, enabling the automatic extraction and organisation of data from invoices, forms, and tables. Said capability streamlines processes in finance and legal sectors.
Meanwhile, Qwen2.5-VL may also function as a visual agent that facilitates task execution on computers and mobile devices, such as checking the weather or booking flights, through the use of a guiding tool.
In particular, the flagship model Qwen2.5-VL-72B-Instruct has performed a series of benchmarks covering domains and tasks including document and diagram reading, general visual question answering, college-level math, video understanding, and visual agent.
From this end, researchers have improved the model’s multimodal capabilities by implementing dynamic resolution and frame rate training for enhanced video understanding. They have also introduced a visual encoder, integrating Window Attention within a dynamic Vision Transformer (ViT) framework to accelerate both training and inference.
These innovations make the model a crucial solution for diverse multimodal applications across various fields.
Apart from these developments, Alibaba has also launched the latest version of the Qwen large language model, known as Qwen2.5-1M. This open-source iteration is distinguished by its capability to process long context inputs, with the ability to handle up to 1 million tokens.
Included in the release are two instruction-tuned models, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M,boasting 7 billion and 14 billion parameters. These models have been made available on Hugging Face.
It has also unveiled a corresponding inference framework optimised for processing long contexts on GitHub. This framework is tailored to help developers deploy the Qwen2.5-1M series more cost-effectively.
By leveraging techniques such as length extrapolation and sparse attention, the framework can process 1-million-token inputs with speeds 3 to 7 times faster than traditional approaches, offering a potent solution for developing applications that require long-context processing with more efficiency.
Recently, Alibaba also made an announcement introducing Qwen2.5-Max, a next-generation AI model they claim surpasses several top AI systems in key performance benchmarks. This latest model is now accessible to developers via Alibaba Cloud services and Alibaba’s conversational AI platform, Qwen Chat.