Skip to content

Commit 9fd91cf

Browse files
XiaohanZhangCMUxiaohanzhan-dbmvpatel2000
authored
Validation (#1027)
* add validation script * update * change token count function * reorganize cells * Add unit tests * Add a printout for CPT * update question * Add questions * Fix lints * update format * update * nb source * add validation script * update * change token count function * reorganize cells * Add unit tests * Add a printout for CPT * update question * Add questions * Fix lints * update format * update * nb source * Remove license insert for validation notebook * Add validation utils * Minor cleanups (#858) * nits * logger * add log * lint * update utils/__init__.py to include extra validation functions * update notebook * update * update * Read UC delta table (#773) * initial commit * use databricks-sql to read delta table and convert to json * update * update * update * add mocked unittest * Fix lints * update * update * restructure code * Add timer for optimizing * Add db-connect * add wrapper * update * add install dbconnect * update * update * patch dbconnect to allow multiple return formats * update * add arrow * use compression * clean up * Add cluster rt check * Fix lints * remove patch.py for CI * update * update * updat * update * fix tests * fix lint * update * update * Add more tests * update * update * update * change to download_json * update * fix lints * Add decompressed option for arrow * format json to jsonl * Add comments * Make cf_collect_type global option * fix comments * fix lints * fix comments * Fix lints * change to use workspaceclient * Add CPT support * Rewire method assignment logic * Fix bug in stripping https * Add tests for rewired method assignment logic * Fix lints * Fix lints * Removed logger set_level * Remove pyspark. It conflicts with databricks-connect * Update the comment * skip cluster version check when cluster_id is serverless * Add use_serverless flag * update tests with use_serverless flag * Fix lints --------- Co-authored-by: Xiaohan Zhang <[email protected]> * Add download remote function to util * update * remove fused layernorm (#859) * update * update * update * update * update * update * update * update * update * Remove hardcoded combined.jsonl with a flag (#861) * Remove hardcoded combined.jsonl with a flag * update * change output_json_path output_json_folder --------- Co-authored-by: Xiaohan Zhang <[email protected]> * bump (#828) * Add dask and dataframe_to_mds * update * update * update * update * Add notebook * update * update * remove script and tests, keep notebook * update * update * update * update * Always initialize dist (#864) * fix dev * lint * remove gpu * updated notebook * remove scripts keep notebook * update notebook. rephrase. * update * Add response tokens * update * update * Disable MDSWrite, return token counts * Change plot settings * update notebook * update * update notebook * update * update notebook * update pip install link * Change done file location * Create the dest folder * update notebook * update --------- Co-authored-by: Xiaohan Zhang <[email protected]> Co-authored-by: xiaohanzhan-db <xiaohanzhan-db> Co-authored-by: Mihir Patel <[email protected]>
1 parent 5090e13 commit 9fd91cf

File tree

1 file changed

+163
-21
lines changed

1 file changed

+163
-21
lines changed

notebooks/validate_and_tokenize_data.ipynb

Lines changed: 163 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@
44
"cell_type": "markdown",
55
"metadata": {
66
"application/vnd.databricks.v1+cell": {
7-
"cellMetadata": {},
7+
"cellMetadata": {
8+
"byteLimit": 2048000,
9+
"rowLimit": 10000
10+
},
811
"inputWidgets": {},
912
"nuid": "f275a21b-47d4-472c-972b-e2a84a597db2",
1013
"showTitle": false,
@@ -54,7 +57,10 @@
5457
"cell_type": "markdown",
5558
"metadata": {
5659
"application/vnd.databricks.v1+cell": {
57-
"cellMetadata": {},
60+
"cellMetadata": {
61+
"byteLimit": 2048000,
62+
"rowLimit": 10000
63+
},
5864
"inputWidgets": {},
5965
"nuid": "3d08a21c-9f5a-4ad2-af85-e016335cc53d",
6066
"showTitle": false,
@@ -173,6 +179,7 @@
173179
"import re\n",
174180
"import json\n",
175181
"import tempfile\n",
182+
"import random\n",
176183
"import numpy as np\n",
177184
"import pandas as pd \n",
178185
"from collections import defaultdict\n",
@@ -193,7 +200,10 @@
193200
"cell_type": "markdown",
194201
"metadata": {
195202
"application/vnd.databricks.v1+cell": {
196-
"cellMetadata": {},
203+
"cellMetadata": {
204+
"byteLimit": 2048000,
205+
"rowLimit": 10000
206+
},
197207
"inputWidgets": {},
198208
"nuid": "3a513cdd-967d-4a87-b56f-340053fa79cd",
199209
"showTitle": false,
@@ -208,7 +218,10 @@
208218
"cell_type": "markdown",
209219
"metadata": {
210220
"application/vnd.databricks.v1+cell": {
211-
"cellMetadata": {},
221+
"cellMetadata": {
222+
"byteLimit": 2048000,
223+
"rowLimit": 10000
224+
},
212225
"inputWidgets": {},
213226
"nuid": "cfebdfdf-b87c-4a77-b97c-4697566a55fa",
214227
"showTitle": false,
@@ -255,6 +268,29 @@
255268
{
256269
"cell_type": "code",
257270
"execution_count": null,
271+
"metadata": {
272+
"application/vnd.databricks.v1+cell": {
273+
"cellMetadata": {
274+
"byteLimit": 2048000,
275+
"rowLimit": 10000
276+
},
277+
"inputWidgets": {},
278+
"nuid": "0d1f2e9e-db40-41fd-a6b9-bb4757db08b0",
279+
"showTitle": false,
280+
"title": ""
281+
}
282+
},
283+
"outputs": [],
284+
"source": [
285+
"# Make sure you have write access to the ``home`` directory\n",
286+
"home = os.path.join('/local_disk0', 'ift')\n",
287+
"os.makedirs(home, exist_ok=True)\n",
288+
"os.chdir(home)"
289+
]
290+
},
291+
{
292+
"cell_type": "code",
293+
"execution_count": 0,
258294
"metadata": {
259295
"application/vnd.databricks.v1+cell": {
260296
"cellMetadata": {
@@ -271,22 +307,26 @@
271307
"source": [
272308
"FT_API_args = Namespace(\n",
273309
" model= 'mosaicml/mpt-7b', # Other examples: 'EleutherAI/gpt-neox-20b',\n",
274-
" train_data_path= 'main.streaming.random_large_table', # Other examples: 'tatsu-lab/alpaca/train', # '/Volumes/main/mosaic_hackathon/managed-volume/IFT/train.jsonl' # 'mosaicml/dolly_hhrlhf/train'\n",
310+
" train_data_path= 'mosaicml/dolly_hhrlhf/train', # Other examples: '/path/to/train.jsonl', 'catalog.schema.table'\n",
275311
" task_type='INSTRUCTION_FINETUNE',\n",
276312
" training_duration=3,\n",
277313
" context_length=2048,\n",
278314
")\n",
279315
"\n",
280-
"temporary_jsonl_data_path = '/Volumes/main/mosaic_hackathon/managed-volume/IFT/ft_data_11Jan24_3/train'\n",
281-
"os.environ['HF_DATASETS_CACHE'] = '/tmp/'\n",
282-
"os.makedirs(temporary_jsonl_data_path, exist_ok=True)"
316+
"temporary_jsonl_data_path = os.path.join(home, 'ft_data_11Jan24_3/train')\n",
317+
"os.environ['HF_DATASETS_CACHE'] = os.path.join(home, 'hf_cache')\n",
318+
"os.makedirs(temporary_jsonl_data_path, exist_ok=True)\n",
319+
"os.makedirs(os.environ['HF_DATASETS_CACHE'], exist_ok=True)"
283320
]
284321
},
285322
{
286323
"cell_type": "markdown",
287324
"metadata": {
288325
"application/vnd.databricks.v1+cell": {
289-
"cellMetadata": {},
326+
"cellMetadata": {
327+
"byteLimit": 2048000,
328+
"rowLimit": 10000
329+
},
290330
"inputWidgets": {},
291331
"nuid": "39c45005-1a77-4162-b9e4-bd8df6f5ec69",
292332
"showTitle": false,
@@ -362,7 +402,10 @@
362402
"cell_type": "markdown",
363403
"metadata": {
364404
"application/vnd.databricks.v1+cell": {
365-
"cellMetadata": {},
405+
"cellMetadata": {
406+
"byteLimit": 2048000,
407+
"rowLimit": 10000
408+
},
366409
"inputWidgets": {},
367410
"nuid": "06d46367-bd32-473a-9f16-1b34a8dd9356",
368411
"showTitle": false,
@@ -377,7 +420,10 @@
377420
"cell_type": "markdown",
378421
"metadata": {
379422
"application/vnd.databricks.v1+cell": {
380-
"cellMetadata": {},
423+
"cellMetadata": {
424+
"byteLimit": 2048000,
425+
"rowLimit": 10000
426+
},
381427
"inputWidgets": {},
382428
"nuid": "1a28320a-a2a1-4f3c-a0cd-ad6045a24f64",
383429
"showTitle": false,
@@ -467,7 +513,10 @@
467513
"cell_type": "markdown",
468514
"metadata": {
469515
"application/vnd.databricks.v1+cell": {
470-
"cellMetadata": {},
516+
"cellMetadata": {
517+
"byteLimit": 2048000,
518+
"rowLimit": 10000
519+
},
471520
"inputWidgets": {},
472521
"nuid": "9713a0ce-80f4-4187-b10b-4223b17fe4c1",
473522
"showTitle": false,
@@ -506,7 +555,10 @@
506555
"cell_type": "markdown",
507556
"metadata": {
508557
"application/vnd.databricks.v1+cell": {
509-
"cellMetadata": {},
558+
"cellMetadata": {
559+
"byteLimit": 2048000,
560+
"rowLimit": 10000
561+
},
510562
"inputWidgets": {},
511563
"nuid": "7249e9e6-1ea7-4fc9-8959-8a17d62a9fb4",
512564
"showTitle": false,
@@ -547,7 +599,10 @@
547599
"cell_type": "markdown",
548600
"metadata": {
549601
"application/vnd.databricks.v1+cell": {
550-
"cellMetadata": {},
602+
"cellMetadata": {
603+
"byteLimit": 2048000,
604+
"rowLimit": 10000
605+
},
551606
"inputWidgets": {},
552607
"nuid": "6699f47f-9b53-47da-95c0-b862c5826d0a",
553608
"showTitle": false,
@@ -562,7 +617,10 @@
562617
"cell_type": "markdown",
563618
"metadata": {
564619
"application/vnd.databricks.v1+cell": {
565-
"cellMetadata": {},
620+
"cellMetadata": {
621+
"byteLimit": 2048000,
622+
"rowLimit": 10000
623+
},
566624
"inputWidgets": {},
567625
"nuid": "dd37fdce-62d0-493e-bfa9-d823634b2a0d",
568626
"showTitle": false,
@@ -624,12 +682,66 @@
624682
"source": [
625683
"FT_API_args = Namespace(\n",
626684
" model= 'mosaicml/mpt-7b',\n",
627-
" train_data_path= '/Volumes/main/mosaic_hackathon/managed-volume/ABT',\n",
685+
" train_data_path= os.path.join(home, 'ABT'), # this is the path to your collection of txt files\n",
628686
" task_type='CONTINUED_PRETRAIN',\n",
629687
" training_duration=3,\n",
630-
" context_length=2048,\n",
688+
" context_length=8,\n",
631689
")\n",
632-
"temporary_mds_output_path = '/Volumes/main/mosaic_hackathon/managed-volume/{your_username}/mds_data_11Jan24_5'"
690+
"temporary_mds_output_path = os.path.join(home, 'mds_data_11Jan24_5')"
691+
]
692+
},
693+
{
694+
"cell_type": "markdown",
695+
"metadata": {
696+
"application/vnd.databricks.v1+cell": {
697+
"cellMetadata": {
698+
"byteLimit": 2048000,
699+
"rowLimit": 10000
700+
},
701+
"inputWidgets": {},
702+
"nuid": "fc2e4e8b-7700-47c4-bb21-ae4c389f39a2",
703+
"showTitle": false,
704+
"title": ""
705+
}
706+
},
707+
"source": [
708+
"Generate a synthetic dataset. Replace train_data_path with your raw data path in practice."
709+
]
710+
},
711+
{
712+
"cell_type": "code",
713+
"execution_count": 0,
714+
"metadata": {
715+
"application/vnd.databricks.v1+cell": {
716+
"cellMetadata": {
717+
"byteLimit": 2048000,
718+
"rowLimit": 10000
719+
},
720+
"inputWidgets": {},
721+
"nuid": "10f08422-5091-4e64-b3f7-54928584cd60",
722+
"showTitle": false,
723+
"title": ""
724+
}
725+
},
726+
"outputs": [],
727+
"source": [
728+
"def generate_synthetic_dataset(folder_path, num_files=128):\n",
729+
" \"\"\"Generate a synthetic dataset of text files with random words.\"\"\"\n",
730+
" def generate_random_words(num_words=50):\n",
731+
" words = [\"apple\", \"banana\", \"cherry\", \"date\", \"elderberry\", \"fig\", \"grape\", \"honeydew\", \"kiwi\", \"lemon\", \"mango\", \"nectarine\", \"orange\", \"papaya\", \"quince\", \"raspberry\", \"strawberry\", \"tangerine\", \"ugli\", \"vanilla\", \"watermelon\", \"xigua\", \"yam\", \"zucchini\"]\n",
732+
" return ' '.join(random.choice(words) for _ in range(num_words))\n",
733+
"\n",
734+
" if not os.path.exists(folder_path):\n",
735+
" os.makedirs(folder_path)\n",
736+
" \n",
737+
" for i in range(num_files):\n",
738+
" file_path = os.path.join(folder_path, f\"file_{i}.txt\")\n",
739+
" with open(file_path, 'w') as file:\n",
740+
" file.write(generate_random_words())\n",
741+
"\n",
742+
" print(f\"Generated {num_files} files in '{folder_path}'.\")\n",
743+
"\n",
744+
"generate_synthetic_dataset(FT_API_args.train_data_path)"
633745
]
634746
},
635747
{
@@ -656,7 +768,10 @@
656768
"cell_type": "markdown",
657769
"metadata": {
658770
"application/vnd.databricks.v1+cell": {
659-
"cellMetadata": {},
771+
"cellMetadata": {
772+
"byteLimit": 2048000,
773+
"rowLimit": 10000
774+
},
660775
"inputWidgets": {},
661776
"nuid": "c21e7d1b-db34-4e5d-b6d9-190dc75170d3",
662777
"showTitle": false,
@@ -688,6 +803,27 @@
688803
{
689804
"cell_type": "code",
690805
"execution_count": null,
806+
"metadata": {
807+
"application/vnd.databricks.v1+cell": {
808+
"cellMetadata": {
809+
"byteLimit": 2048000,
810+
"rowLimit": 10000
811+
},
812+
"inputWidgets": {},
813+
"nuid": "f5aea2a8-db29-40c9-8ed2-b6a1d032e7ab",
814+
"showTitle": false,
815+
"title": ""
816+
}
817+
},
818+
"outputs": [],
819+
"source": [
820+
"import os\n",
821+
"os.makedirs(temporary_mds_output_path, exist_ok=True)"
822+
]
823+
},
824+
{
825+
"cell_type": "code",
826+
"execution_count": 0,
691827
"metadata": {
692828
"application/vnd.databricks.v1+cell": {
693829
"cellMetadata": {
@@ -734,7 +870,10 @@
734870
"cell_type": "markdown",
735871
"metadata": {
736872
"application/vnd.databricks.v1+cell": {
737-
"cellMetadata": {},
873+
"cellMetadata": {
874+
"byteLimit": 2048000,
875+
"rowLimit": 10000
876+
},
738877
"inputWidgets": {},
739878
"nuid": "298eb990-9160-4e1b-958f-33dd2c11b54b",
740879
"showTitle": false,
@@ -776,7 +915,10 @@
776915
"execution_count": null,
777916
"metadata": {
778917
"application/vnd.databricks.v1+cell": {
779-
"cellMetadata": {},
918+
"cellMetadata": {
919+
"byteLimit": 2048000,
920+
"rowLimit": 10000
921+
},
780922
"inputWidgets": {},
781923
"nuid": "e123669c-2f77-4d66-93eb-04efd546f39f",
782924
"showTitle": false,

0 commit comments

Comments
 (0)