Skip to content

Running the TrackML_example with TrainTrack #20

@a-akram

Description

@a-akram

Hello All,

I am testing the pipeline for TrackML data following the Quick Start guide. After some effort I was able to run the pipeline partially. Here are some observations:

  • First, the default pipeline_test.yaml is missing, so I use the pipeline_quickstart.yaml instead ($ traintrack configs/pipeline_quickstart.yaml)
  • Wandb have to be installed explicitly

The processing stage was successful. But I am having issues with the Embedding stage. I have CPU version of PyTorch, during the validation stage when build_edges() is called, FAISS should be used for CPU and FRNN in case of GPU. It seems like even in CPU, FRNN is called which wasn't imported in the first place. Now the error is as follows:

Running from top with args: ['/home/adeel/anaconda/envs/exatrkx-test/bin/traintrack', 'configs/pipeline_quickstart.yaml']
run_pipeline.start()...
Namespace(inference=False, pipeline_config='configs/pipeline_quickstart.yaml', run_stage=False, slurm=False, verbose=False)
current stage:  {'set': 'Embedding', 'name': 'LayerlessEmbedding', 'config': 'train_quickstart_embedding.yaml', 'batch_config': 'batch_gpu_default.yaml', 'resume_id': None}
Running stage, with args, and model library: LightningModules
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
wandb: Currently logged in as: aakram (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.10
wandb: Syncing run dashing-armadillo-5
wandb: ⭐️ View project at https://wandb.ai/aakram/CodaEmbeddingStudy
wandb: 🚀 View run at https://wandb.ai/aakram/CodaEmbeddingStudy/runs/24kbmduf
wandb: Run data is saved locally in /home/adeel/current/data_sets/exatrkx-test/lightning_models/lightning_checkpoints/wandb/run-20220211_205202-24kbmduf
wandb: Run `wandb offline` to turn off syncing.


  | Name    | Type       | Params
---------------------------------------
0 | network | Sequential | 3.2 M 
---------------------------------------
3.2 M     Trainable params
0         Non-trainable params
3.2 M     Total params
12.726    Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/torch_geometric/deprecation.py:13: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
  warnings.warn(out)
/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Validation sanity check:   0%|                                                                                                                                     | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/adeel/anaconda/envs/exatrkx-test/bin/traintrack", line 8, in <module>
    sys.exit(main())
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/command_line_pipe.py", line 71, in main
    run_pipeline.start(run_args)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/run_pipeline.py", line 152, in start
    run_stage(**config)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/utils/data_utils.py", line 51, in wrapped
    return dFxn(*cp, **dp)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/run_pipeline.py", line 81, in run_stage
    train_stage(model, model_config)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/traintrack/run_pipeline.py", line 96, in train_stage
    trainer.fit(model)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 239, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/home/adeel/anaconda/envs/exatrkx-test/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 219, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/adeel/current/5_exatrkx/exatrkx-hsf/Pipelines/TrackML_Example/LightningModules/Embedding/embedding_base.py", line 301, in validation_step
    outputs = self.shared_evaluation(
  File "/home/adeel/current/5_exatrkx/exatrkx-hsf/Pipelines/TrackML_Example/LightningModules/Embedding/embedding_base.py", line 257, in shared_evaluation
    e_spatial = build_edges(spatial, spatial, indices=None, r_max=knn_radius, k_max=knn_num)
  File "/home/adeel/current/5_exatrkx/exatrkx-hsf/Pipelines/TrackML_Example/LightningModules/Embedding/utils.py", line 173, in build_edges
    import frnn
ModuleNotFoundError: No module named 'frnn'

wandb: Waiting for W&B process to finish, PID 21911... (failed 1). Press ctrl-c to abort syncing.
wandb:                                                                                
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced dashing-armadillo-5: https://wandb.ai/aakram/CodaEmbeddingStudy/runs/24kbmduf
wandb: Find logs at: /home/adeel/current/data_sets/exatrkx-test/lightning_models/lightning_checkpoints/wandb/run-20220211_205202-24kbmduf/logs/debug.log
wandb: 

I hope this will help to fix the problem, but I will try to find a solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions