Skip to content

Conversation

MeouSker77
Copy link
Contributor

@MeouSker77 MeouSker77 commented Sep 20, 2024

Description

add basic support and optimization for qwen2-vl

1. Why the change?

2. User API changes

this model requires transformers 4.45

pip install transformers==4.45
pip install qwen_vl_utils

A smiple example (this):

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

import time
import torch
from ipex_llm import optimize_model

model_path = r"Qwen/Qwen2-VL-7B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(model_path)

model = optimize_model(model, modules_to_not_convert=["visual"])    # default to 'sym_int4'
model = model.float().eval()    # use .float() for better output, use .half() for better speed
# print(model)

model = model.to('xpu')

# default processer
# processor = AutoProcessor.from_pretrained(model_path)

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained(model_path, min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to('xpu')

with torch.inference_mode():
    for i in range(3):
        st = time.time()
        # Inference: Generation of the output
        generated_ids = model.generate(**inputs, max_new_tokens=32)
        et = time.time()
        print(et - st)
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        print(output_text)

3. Summary of the change

4. How to test?

  • N/A
  • Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
  • Application test
  • Document test
  • ...

@MeouSker77 MeouSker77 requested a review from rnwang04 September 20, 2024 09:20
Copy link
Contributor

@rnwang04 rnwang04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MeouSker77 MeouSker77 merged commit 9239fd4 into intel:main Sep 20, 2024
1 check passed
@MeouSker77 MeouSker77 deleted the add-qwen2-optimization branch September 20, 2024 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants