Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.28.0
Highlights
- First version of fused sdpa vector for CUDA
- Convolutions in CUDA
- Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more
What's Changed
- [CUDA] Fix segfault on exit by @awni in #2424
- [CUDA] No occupancy query for launch params by @awni in #2426
- [CUDA] More sizes for gemv by @awni in #2429
- Add more CUDA architectures for PyPi package by @awni in #2427
- Use ccache in CI by @zcbenz in #2414
- [CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in #2433
- Cuda faster softmax by @awni in #2435
- Remove the kernel arg from get_launch_args by @zcbenz in #2437
- Move arange to its own file by @zcbenz in #2438
- Use load_vector in arg_reduce by @zcbenz in #2439
- Make CI faster by @zcbenz in #2440
- [CUDA] Quantized refactoring by @angeloskath in #2442
- fix circular reference by @awni in #2443
- [CUDA] Fix gemv regression by @awni in #2445
- Fix wrong graph key when using concurrent context by @zcbenz in #2447
- Fix custom metal extension by @awni in #2446
- Add tests for export including control flow models and quantized models by @junpeiz in #2430
- [CUDA] Backward convolution by @zcbenz in #2431
- [CUDA] Save primitive inputs faster by @zcbenz in #2449
- [CUDA] Vectorize generated kernels by @angeloskath in #2444
- [CUDA] Matmul utils initial commit by @angeloskath in #2441
- Fix arctan2 grads by @angeloskath in #2453
- Use LRU cache for cuda graph by @zcbenz in #2448
- Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in #2460
- Default install cuda on linux by @awni in #2462
- fix wraps compile by @awni in #2461
- Feat: add USE_SYSTEM_FMT CMake option by @GaetanLepage in #2219
- Use SmallVector for shapes and strides by @zcbenz in #2454
- Fix install tags by @awni in #2464
- Faster gather qmm sorted test by @awni in #2463
- Fix cublas on h100 by @awni in #2466
- revert default cuda install by @awni in #2465
- feat: support a destinations based in tree flatten/unflatten by @LVivona in #2450
- Fix typo in metal command encoder by @angeloskath in #2471
- Update CUDA sdpa by @jagrit06 in #2468
- version by @awni in #2470
New Contributors
- @junpeiz made their first contribution in #2430
- @zamderax made their first contribution in #2460
- @GaetanLepage made their first contribution in #2219
- @LVivona made their first contribution in #2450
Full Changelog: v0.27.1...v0.28.0
v0.27.1
Highlights
- Initial PyPi release of the CUDA back-end.
- CUDA back-end works for well with mlx-lm:
- Reasonably fast for LLM inference
- Supports single-machine training and LoRA fine-tuning
What's Changed
- Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in #2232
- Share more common code in Compiled by @zcbenz in #2240
- Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in #2231
- Perf regression fix by @angeloskath in #2243
- Add profiler annotations in common primitives for CUDA backend by @zcbenz in #2244
- Default strict mode for module
update
andupdate_modules
by @awni in #2239 - Fix linux linking error by @awni in #2248
- Improve metal elementwise kernels by @awni in #2247
- CUDA backend: matmul by @zcbenz in #2241
- Change layernorms to two pass algorithm by @angeloskath in #2246
- Fix unintuitive metal kernel caching by @awni in #2242
- Refactor the lu test by @emmanuel-ferdman in #2250
- CUDA backend: unary ops by @zcbenz in #2158
- Fix export to work with gather/scatter axis by @awni in #2263
- CUDA backend: binary ops by @zcbenz in #2259
- Report number of missing parameters by @FL33TW00D in #2264
- CUDA backend: sort by @zcbenz in #2262
- CUDA backend: random by @zcbenz in #2261
- Fix conv export by @awni in #2265
- CUDA backend: copy ops by @zcbenz in #2260
- Fix building cpp benchmarks on Linux by @zcbenz in #2268
- Add load_safe to the general conv loaders by @angeloskath in #2258
- start cuda circle config by @awni in #2256
- CUDA backend: reduce by @zcbenz in #2269
- CUDA backend: argreduce by @zcbenz in #2270
- CUDA backend: softmax by @zcbenz in #2272
- CUDA backend: layernorm by @zcbenz in #2271
- Fix warnings from latest CUDA toolkit by @zcbenz in #2275
- Make sliceUpdate general by @awni in #2282
- CUDA backend: compile by @zcbenz in #2276
- [CUDA] RMSNorm and VJP by @awni in #2280
- [CUDA] Fix build by @awni in #2284
- [CUDA] ternary with select op by @awni in #2283
- CUDA backend: indexing ops by @zcbenz in #2277
- Collection of refactors by @jagrit06 in #2274
- Fix complex power and print by @awni in #2286
- fix cuda jit by @awni in #2287
- Fix cuda gemm for bf16 by @awni in #2288
- Fix cuda arg reduce by @awni in #2291
- RoPE for CUDA by @angeloskath in #2293
- Add python testing for cuda with ability to skip list of tests by @awni in #2295
- [CUDA] Fix back-end bugs and enable corresponding tests by @awni in #2296
- Cuda bug fixes 2 by @awni in #2298
- [CUDA] Divmod, Partition, and sort fixes by @awni in #2302
- [CUDA] synch properly waits for all tasks to finish and clear by @awni in #2303
- Make ptx cache settable by environment variable by @angeloskath in #2304
- Build CUDA release in Circle by @awni in #2306
- Cuda perf tuning by @awni in #2307
- Fix
update_modules()
when providing a subset by @angeloskath in #2308 - Compile float64 functions on CPU by @awni in #2311
- Fix get 2d grid dims by @angeloskath in #2316
- Split broadcast so it is always fused in compile by @angeloskath in #2318
- [CUDA] Fix reductions by @angeloskath in #2314
- Fix module update in strict mode by @awni in #2321
- MLX_SWITCH macros to templates by @angeloskath in #2320
- Use fp32 for testing, add more complex ops by @awni in #2322
- Patch bump by @awni in #2324
- Allow parameters to be deleted from a module by @awni in #2325
- Fix compilation error from integral_constant by @zcbenz in #2326
- [CUDA] Switch to CUDA graphs by @awni in #2317
- [CUDA] Fix graphs for older cuda by @awni in #2328
- [CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size by @zcbenz in #2329
- Fix layernorm race condition by @angeloskath in #2340
- Build with all cpu cores by default by @zcbenz in #2336
- [CUDA] Do vectorized store/load in binary ops by @zcbenz in #2330
- Auto build linux release by @awni in #2341
- MoE backward improvements by @angeloskath in #2335
- Fix compilation with CUDA 11 by @zcbenz in #2331
- patch bump by @awni in #2343
- Align mlx::core::max op nan propagation with NumPy by @jhavukainen in #2339
- Add zero for argsort vjp by @awni in #2345
- [CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in #2342
- Align mlx::core::min op nan propagation with NumPy by @jhavukainen in #2346
- [CUDA] Set current device before cudaGraphLaunch by @zcbenz in #2351
- [CUDA] Put version in ptx cache dir path by @zcbenz in #2352
- Fix type promotion in Adam with bias correction by @angeloskath in #2350
- Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in #2355
- [CUDA] Implement Scan kernel by @zcbenz in #2347
- [Metal] fix copy dispatch by @awni in #2360
- [CUDA] Bundle CCCL for JIT compilation by @zcbenz in #2357
- [CUDA] Do not put kernels in annoymous namespace by @zcbenz in #2362
- Fix imag() vjp by @angeloskath in #2367
- Add Primitive::name and remove Primitive::print by @zcbenz in #2365
- update linux build by @awni in #2370
- [CUDA] Affine quantize by @awni in #2354
- Fix flaky linux test by @awni in #2371
- Install linux with mlx[cuda] and mlx[cpu] by @awni in #2356
- [CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in #2372
- lower memory uniform sampling by @awni in #2361
- [CUDA] Fix complex reduce + nan propagation in min and max by @awni in #2377
- Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in #2378
- fix ring distributed test by @awni in #2380
- Test with CUDA 12.2 by @awni in #2375
- [CUDA] Add work per thread to compile by @angeloskath in #2368
- [CUDA] Fix resource leaks in matmul and graph by @awni in #2383
- [CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in #2382
- Add contiguous_copy_gpu util for copying array by @zcbenz in #2379
- Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in #1914
- Patch bump by @awni in #2386
- Fix release build + patch bump by @awni in #2387
- Fix cuda manylinux version to match others by @awni in #2388
- [CUDA] speedup handling scalars by @awni in #2389
- Remove thrust iterators by @zcbenz in https://g...
v0.26.5
v0.26.3
v0.26.2
v0.26.0
Highlights
- 5 bit quantization
- Significant progress on CUDA back-end by @zcbenz
Core
Features
- 5bit quants
- Allow per-target Metal debug flags
- Add complex eigh
- reduce vjp for
mx.all
andmx.any
real
andimag
properties- Non-symmetric
mx.linalg.eig
andmx.linalg.eigh
- convolution vmap
- Add more complex unary ops (
sqrt
,square
, ...) - Complex scan
- Add
mx.broadcast_shapes
- Added
output_padding
parameters inconv_transpose
- Add random normal distribution for complex numbers
- Add
mx.fft.fftshift and
mx.fft.ifftshift` helpers - Enable vjp for quantized scale and bias
Performance
- Optimizing Complex Matrix Multiplication using Karatsubaβs Algorithm
- Much faster 1D conv
Cuda
- Generalize gpu backend
- Use fallbacks in fast primitives when
eval_gpu
is not implemented - Add memory cache to CUDA backend
- Do not check
event.is_signaled()
ineval_impl
- Build for compute capability 70 instead of 75 in CUDA backend
- CUDA backend: backbone
Bug Fixes
- Fix out-of-bounds default value in logsumexp/softmax
- include
mlx::core::version()
symbols in the mlx static library - Fix Nearest upsample
- Fix large arg reduce
- fix conv grad
- Fix some complex vjps
- Fix typo in row_reduce_small
- Fix
put_along_axis
for empty arrays - Close a couple edge case bugs:
hadamard
andaddmm
on empty inputs - Fix fft for integer overflow with large batches
- fix:
conv_general
differences between gpu, cpu - Fix batched vector sdpa
- GPU Hadamard for large N
- Improve bandwidth for elementwise ops
- Fix compile merging
- Fix shapeless export to throw on dim mismatch
- Fix
mx.linalg.pinv
for singular matrices - Fixed shift operations
- Fix integer overflow in qmm
Contributors
Thanks to some awesome contributors!
@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1
v0.25.2
v0.25.1
v0.25.0
Highlights
- Custom logsumexp for reduced memory in training (benchmark)
- Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
- Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
Core
Performance
- Fused vector attention supports 256 dim
- Tune quantized matrix vector dispatch for small batches of vectors
Features
- Move memory API in the top level mlx.core and enable for CPU only allocator
- Enable using MPI from all platforms and allow only OpenMPI
- Add a ring all gather for the ring distributed backend
- Enable gemm for complex numbers
- Fused attention supports literal "causal" mask
- Log for complex numbers
- Distributed
all_min
andall_max
both for MPI and the ring backend - Add
logcumsumexp
- Add additive mask for fused vector attention
- Improve the usage of the residency set
NN
- Add sharded layers for model/tensor parallelism
Bugfixes
- Fix possible allocator deadlock when using multiple streams
- Ring backend supports 32 bit platforms and FreeBSD
- Fix FFT bugs
- Fix attention mask type for fused attention kernel
- Fix fused attention numerical instability with masking
- Add a fallback for float16 gemm
- Fix simd sign for uint64
- Fix issues in docs