Skip to content

Commit a9218d6

Browse files
XiaohanZhangCMUxiaohanzhan-dbmvpatel2000
authored
Validation (#875)
* add validation script * update * change token count function * reorganize cells * Add unit tests * Add a printout for CPT * update question * Add questions * Fix lints * update format * update * nb source * add validation script * update * change token count function * reorganize cells * Add unit tests * Add a printout for CPT * update question * Add questions * Fix lints * update format * update * nb source * Remove license insert for validation notebook * Add validation utils * Minor cleanups (#858) * nits * logger * add log * lint * update utils/__init__.py to include extra validation functions * update notebook * update * update * Read UC delta table (#773) * initial commit * use databricks-sql to read delta table and convert to json * update * update * update * add mocked unittest * Fix lints * update * update * restructure code * Add timer for optimizing * Add db-connect * add wrapper * update * add install dbconnect * update * update * patch dbconnect to allow multiple return formats * update * add arrow * use compression * clean up * Add cluster rt check * Fix lints * remove patch.py for CI * update * update * updat * update * fix tests * fix lint * update * update * Add more tests * update * update * update * change to download_json * update * fix lints * Add decompressed option for arrow * format json to jsonl * Add comments * Make cf_collect_type global option * fix comments * fix lints * fix comments * Fix lints * change to use workspaceclient * Add CPT support * Rewire method assignment logic * Fix bug in stripping https * Add tests for rewired method assignment logic * Fix lints * Fix lints * Removed logger set_level * Remove pyspark. It conflicts with databricks-connect * Update the comment * skip cluster version check when cluster_id is serverless * Add use_serverless flag * update tests with use_serverless flag * Fix lints --------- Co-authored-by: Xiaohan Zhang <[email protected]> * Add download remote function to util * update * remove fused layernorm (#859) * update * update * update * update * update * update * update * update * update * Remove hardcoded combined.jsonl with a flag (#861) * Remove hardcoded combined.jsonl with a flag * update * change output_json_path output_json_folder --------- Co-authored-by: Xiaohan Zhang <[email protected]> * bump (#828) * Add dask and dataframe_to_mds * update * update * update * update * Add notebook * update * update * remove script and tests, keep notebook * update * update * update * update * Always initialize dist (#864) * fix dev * lint * remove gpu * updated notebook * remove scripts keep notebook * update notebook. rephrase. * update * Add response tokens * update --------- Co-authored-by: Xiaohan Zhang <[email protected]> Co-authored-by: xiaohanzhan-db <xiaohanzhan-db> Co-authored-by: Mihir Patel <[email protected]>
1 parent f1fa63c commit a9218d6

File tree

4 files changed

+1334
-1300
lines changed

4 files changed

+1334
-1300
lines changed

llmfoundry/data/finetuning/tasks.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -434,7 +434,7 @@ def dataset_mapper(example: Dict):
434434

435435
detected_cpu_count = os.cpu_count() or 1
436436
detected_cpus_with_margin = detected_cpu_count - 8
437-
num_cpus_to_use = detected_cpu_count # Hack for Valiation instead of max(1, detected_cpus_with_margin)
437+
num_cpus_to_use = detected_cpu_count # Hack for Valiation instead of max(1, detected_cpus_with_margin)
438438

439439
columns_to_remove = list(dataset[0].keys())
440440
tokenized_dataset = dataset.map(

llmfoundry/utils/__init__.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,17 @@
1313
update_batch_size_info)
1414
from llmfoundry.utils.model_download_utils import (
1515
download_from_cache_server, download_from_hf_hub)
16-
17-
from llmfoundry.utils.validation_utils import (
18-
create_om_cfg, token_counts_and_validation, token_counts,
19-
check_HF_datasets, is_hf_dataset_path, is_uc_delta_table,
20-
pandas_processing_fn, integrity_check, convert_text_to_mds,
21-
parse_args, _args_str, plot_hist, dataframe_to_mds)
16+
from llmfoundry.utils.validation_utils import (_args_str, check_HF_datasets,
17+
convert_text_to_mds,
18+
create_om_cfg,
19+
dataframe_to_mds,
20+
integrity_check,
21+
is_hf_dataset_path,
22+
is_uc_delta_table,
23+
pandas_processing_fn,
24+
parse_args, plot_hist,
25+
token_counts,
26+
token_counts_and_validation)
2227

2328
except ImportError as e:
2429
raise ImportError(

0 commit comments

Comments
 (0)