[FEATURE REQUEST] Multiprocessing with for dataset.map, dataset.filter

It would be nice to be able to speed up `dataset.map` or `dataset.filter`. Perhaps this is as easy as sharding the dataset sending each shard to a process/thread/dask pool and using the new `nlp.concatenate_dataset()` function to join them all together?