|
18 | 18 | "This notebook shows how to train a Qwen 2.5 7B model to play 2048. It will demonstrate how to set up a multi-turn agent, how to train it, and how to evaluate it.\n",
|
19 | 19 | "\n",
|
20 | 20 | "Completions will be logged to OpenPipe, and metrics will be logged to Weights & Biases.\n",
|
21 |
| - "\n", |
22 |
| - "You will learn how to construct an [agentic environment](#Agentic-Environment), how to define a [rollout](#Defining-a-Rollout), and how to run a [training loop](#Training-Loop)." |
| 21 | + "\n ", |
| 22 | + "You will learn how to construct an [agentic environment](#Environment), how to define a [rollout](#Rollout), and how to run a [training loop](#Loop)." |
23 | 23 | ]
|
24 | 24 | },
|
25 | 25 | {
|
|
84 | 84 | },
|
85 | 85 | {
|
86 | 86 | "cell_type": "markdown",
|
87 |
| - "metadata": {}, |
| 87 | + "metadata": { |
| 88 | + "tags": [ |
| 89 | + "environment" |
| 90 | + ] |
| 91 | + }, |
88 | 92 | "source": [
|
89 | 93 | "### Agentic Environment\n",
|
| 94 | + "<a name=\"Environment\"></a>\n", |
90 | 95 | "\n",
|
91 | 96 | "ART allows your agent to learn by interacting with its environment. In this example, we'll create an environment in which the agent can play 2048.\n",
|
92 | 97 | "\n",
|
|
313 | 318 | },
|
314 | 319 | {
|
315 | 320 | "cell_type": "markdown",
|
316 |
| - "metadata": {}, |
| 321 | + "metadata": { |
| 322 | + "tags": [ |
| 323 | + "rollout" |
| 324 | + ] |
| 325 | + }, |
317 | 326 | "source": [
|
318 | 327 | "### Defining a Rollout\n",
|
| 328 | + "<a name=\"Rollout\"></a>\n", |
319 | 329 | "\n",
|
320 | 330 | "A rollout is a single episode of an agent performing its task. It is generates one or more trajectories, which are lists of messages and choices.\n",
|
321 | 331 | "\n",
|
|
459 | 469 | },
|
460 | 470 | {
|
461 | 471 | "cell_type": "markdown",
|
462 |
| - "metadata": {}, |
| 472 | + "metadata": { |
| 473 | + "tags": [ |
| 474 | + "loop" |
| 475 | + ] |
| 476 | + }, |
463 | 477 | "source": [
|
| 478 | + "<a name=\"Loop\"></a>\n", |
464 | 479 | "### Training Loop\n",
|
465 | 480 | "\n",
|
466 | 481 | "The training loop is where the magic happens. For each of the 500 iterations defined below, the rollout function will be called 18 times in parallel. This means that 18 games will be played at once. Each game will produce a trajectory, which will be used to update the model.\n",
|
|
503 | 518 | "\n",
|
504 | 519 | "\n",
|
505 | 520 | "Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).\n",
|
506 |
| - "</div>\n", |
507 |
| - "\n", |
508 |
| - "<a href=\"https://art.openpipe.ai/\"><img src=\"https://github.com/openpipe/art/raw/notebooks/assets/Header_separator.png\" height=\"5\"></a></a>" |
| 521 | + "</div>\n" |
509 | 522 | ]
|
510 | 523 | }
|
511 | 524 | ],
|
|
0 commit comments