-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Hello All,
I am testing the pipeline for TrackML data following the Quick Start guide. After some effort I was able to run the pipeline partially. Here are some observations:
- First, the default
pipeline_test.yaml
is missing, so I use thepipeline_quickstart.yaml
instead ($ traintrack configs/pipeline_quickstart.yaml
) - Wandb have to be installed explicitly
The processing stage was successful. But I am having issues with the Embedding stage. I have CPU version of PyTorch, during the validation stage when build_edges()
is called, FAISS should be used for CPU and FRNN in case of GPU. It seems like even in CPU, FRNN is called which wasn't imported in the first place. Now the error is as follows:
Running from top with args: ['/home/adeel/anaconda/envs/exatrkx-test/bin/traintrack', 'configs/pipeline_quickstart.yaml']
run_pipeline.start()...
Namespace(inference=False, pipeline_config='configs/pipeline_quickstart.yaml', run_stage=False, slurm=False, verbose=False)
current stage: {'set': 'Embedding', 'name': 'LayerlessEmbedding', 'config': 'train_quickstart_embedding.yaml', 'batch_config': 'batch_gpu_default.yaml', 'resume_id': None}
Running stage, with args, and model library: LightningModules
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
wandb: Currently logged in as: aakram (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.10
wandb: Syncing run dashing-armadillo-5
wandb: ⭐️ View project at https://wandb.ai/aakram/CodaEmbeddingStudy
wandb: 🚀 View run at https://wandb.ai/aakram/CodaEmbeddingStudy/runs/24kbmduf
wandb: Run data is saved locally in /home/adeel/current/data_sets/exatrkx-test/lightning_models/lightning_checkpoints/wandb/run-20220211_205202-24kbmduf
wandb: Run `wandb offline` to turn off syncing.
| Name | Type | Params
---------------------------------------
0 | network | Sequential | 3.2 M
---------------------------------------
3.2 M Trainable params
0 Non-trainable params
3.2 M Total params
12.726 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/torch_geometric/deprecation.py:13: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
warnings.warn(out)
/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/adeel/anaconda/envs/exatrkx-test/bin/traintrack", line 8, in <module>
sys.exit(main())
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/command_line_pipe.py", line 71, in main
run_pipeline.start(run_args)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/run_pipeline.py", line 152, in start
run_stage(**config)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/utils/data_utils.py", line 51, in wrapped
return dFxn(*cp, **dp)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/run_pipeline.py", line 81, in run_stage
train_stage(model, model_config)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/run_pipeline.py", line 96, in train_stage
trainer.fit(model)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
self._evaluation_loop.run()
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 239, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 219, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/adeel/current/5_exatrkx/exatrkx-hsf/Pipelines/TrackML_Example/LightningModules/Embedding/embedding_base.py", line 301, in validation_step
outputs = self.shared_evaluation(
File "/home/adeel/current/5_exatrkx/exatrkx-hsf/Pipelines/TrackML_Example/LightningModules/Embedding/embedding_base.py", line 257, in shared_evaluation
e_spatial = build_edges(spatial, spatial, indices=None, r_max=knn_radius, k_max=knn_num)
File "/home/adeel/current/5_exatrkx/exatrkx-hsf/Pipelines/TrackML_Example/LightningModules/Embedding/utils.py", line 173, in build_edges
import frnn
ModuleNotFoundError: No module named 'frnn'
wandb: Waiting for W&B process to finish, PID 21911... (failed 1). Press ctrl-c to abort syncing.
wandb:
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced dashing-armadillo-5: https://wandb.ai/aakram/CodaEmbeddingStudy/runs/24kbmduf
wandb: Find logs at: /home/adeel/current/data_sets/exatrkx-test/lightning_models/lightning_checkpoints/wandb/run-20220211_205202-24kbmduf/logs/debug.log
wandb:
I hope this will help to fix the problem, but I will try to find a solution.
Metadata
Metadata
Assignees
Labels
No labels