Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 37 additions & 25 deletions docs/mddocs/Quickstart/install_windows_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,48 +123,51 @@ To monitor your GPU's performance and status (e.g. memory consumption, utilizati

## A Quick Example

Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
Now let's play with a real LLM. We'll be using the [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".

- Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment.

- Step 2: Install additional package required for Qwen-1.8B-Chat to conduct:

```cmd
pip install tiktoken transformers_stream_generator einops
```

- Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
- Step 2: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.

- For **loading model from Hugging Face**:

Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model with IPEX-LLM optimizations.
Create a new file named `demo.py` and insert the code snippet below to run [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) model with IPEX-LLM optimizations.

```python
# Copy/Paste the contents to a new file demo.py
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig
generation_config = GenerationConfig(use_cache=True)

print('Now start loading Tokenizer and optimizing Model...')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
trust_remote_code=True)

# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
load_in_4bit=True,
cpu_embedding=True,
trust_remote_code=True)
model = model.to('xpu')
print('Successfully loaded Tokenizer and optimized Model!')

# Format the prompt
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/Qwen/Qwen2-1.5B-Instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
input_ids = tokenizer.encode(text, return_tensors="pt").to('xpu')

print('--------------------------------------Note-----------------------------------------')
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
Expand All @@ -185,7 +188,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
do_sample=False,
max_new_tokens=32,
generation_config=generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
output_str = tokenizer.decode(output[0], skip_special_tokens=False)
print(output_str)
```
- For **loading model ModelScopee**:
Expand All @@ -195,7 +198,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
pip install modelscope==1.11.0
```

Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary) model with IPEX-LLM optimizations.
Create a new file named `demo.py` and insert the code snippet below to run [Qwen2-1.5B-Instruct](https://www.modelscope.cn/models/qwen/Qwen2-1.5B-Instruct/summary) model with IPEX-LLM optimizations.

```python

Expand All @@ -207,11 +210,11 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
generation_config = GenerationConfig(use_cache=True)

print('Now start loading Tokenizer and optimizing Model...')
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
trust_remote_code=True)

# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-1.5B-Instruct",
load_in_4bit=True,
cpu_embedding=True,
trust_remote_code=True,
Expand All @@ -220,13 +223,22 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
print('Successfully loaded Tokenizer and optimized Model!')

# Format the prompt
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/Qwen/Qwen2-1.5B-Instruct
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": question}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

input_ids = tokenizer.encode(text, return_tensors="pt").to('xpu')
print('--------------------------------------Note-----------------------------------------')
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
Expand All @@ -246,7 +258,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
do_sample=False,
max_new_tokens=32,
generation_config=generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
output_str = tokenizer.decode(output[0], skip_special_tokens=False)
print(output_str)
```
> **Note**:
Expand All @@ -257,7 +269,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
> When running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function.
> This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.

- Step 4. Run `demo.py` within the activated Python environment using the following command:
- Step 3. Run `demo.py` within the activated Python environment using the following command:

```cmd
python demo.py
Expand All @@ -269,7 +281,7 @@ Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Inte
```
user: What is AI?

assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,
assistant: AI, or artificial intelligence, refers to the simulation of human intelligence in machines that are programmed to think and act like humans. It involves the development of algorithms,
```

## Tips & Troubleshooting
Expand Down