Releases · ml-explore/mlx

07 Aug 07:50

angeloskath

v0.28.0

56be773

v0.28.0 Latest

Latest

Highlights

First version of fused sdpa vector for CUDA
Convolutions in CUDA
Speed improvements in CUDA normalization layers, softmax, compiled kernels, overheads and more

What's Changed

[CUDA] Fix segfault on exit by @awni in #2424
[CUDA] No occupancy query for launch params by @awni in #2426
[CUDA] More sizes for gemv by @awni in #2429
Add more CUDA architectures for PyPi package by @awni in #2427
Use ccache in CI by @zcbenz in #2414
[CUDA] Use aligned vector in Layer Norm and RMS norm by @awni in #2433
Cuda faster softmax by @awni in #2435
Remove the kernel arg from get_launch_args by @zcbenz in #2437
Move arange to its own file by @zcbenz in #2438
Use load_vector in arg_reduce by @zcbenz in #2439
Make CI faster by @zcbenz in #2440
[CUDA] Quantized refactoring by @angeloskath in #2442
fix circular reference by @awni in #2443
[CUDA] Fix gemv regression by @awni in #2445
Fix wrong graph key when using concurrent context by @zcbenz in #2447
Fix custom metal extension by @awni in #2446
Add tests for export including control flow models and quantized models by @junpeiz in #2430
[CUDA] Backward convolution by @zcbenz in #2431
[CUDA] Save primitive inputs faster by @zcbenz in #2449
[CUDA] Vectorize generated kernels by @angeloskath in #2444
[CUDA] Matmul utils initial commit by @angeloskath in #2441
Fix arctan2 grads by @angeloskath in #2453
Use LRU cache for cuda graph by @zcbenz in #2448
Add missing algorithm header to jit_compiler.cpp for Linux builds by @zamderax in #2460
Default install cuda on linux by @awni in #2462
fix wraps compile by @awni in #2461
Feat: add USE_SYSTEM_FMT CMake option by @GaetanLepage in #2219
Use SmallVector for shapes and strides by @zcbenz in #2454
Fix install tags by @awni in #2464
Faster gather qmm sorted test by @awni in #2463
Fix cublas on h100 by @awni in #2466
revert default cuda install by @awni in #2465
feat: support a destinations based in tree flatten/unflatten by @LVivona in #2450
Fix typo in metal command encoder by @angeloskath in #2471
Update CUDA sdpa by @jagrit06 in #2468
version by @awni in #2470

New Contributors

@junpeiz made their first contribution in #2430
@zamderax made their first contribution in #2460
@GaetanLepage made their first contribution in #2219
@LVivona made their first contribution in #2450

Full Changelog: v0.27.1...v0.28.0

Contributors

zcbenz, angeloskath, and 6 other contributors

Assets 2

25 Jul 22:48

awni

v0.27.1

4ad5341

v0.27.1

Highlights

Initial PyPi release of the CUDA back-end.
CUDA back-end works for well with mlx-lm:
- Reasonably fast for LLM inference
- Supports single-machine training and LoRA fine-tuning

What's Changed

Avoid invoking allocator::malloc when creating CUDA event by @zcbenz in #2232
Share more common code in Compiled by @zcbenz in #2240
Avoid atomic updates across CPU/GPU in CUDA event by @zcbenz in #2231
Perf regression fix by @angeloskath in #2243
Add profiler annotations in common primitives for CUDA backend by @zcbenz in #2244
Default strict mode for module update and update_modules by @awni in #2239
Fix linux linking error by @awni in #2248
Improve metal elementwise kernels by @awni in #2247
CUDA backend: matmul by @zcbenz in #2241
Change layernorms to two pass algorithm by @angeloskath in #2246
Fix unintuitive metal kernel caching by @awni in #2242
Refactor the lu test by @emmanuel-ferdman in #2250
CUDA backend: unary ops by @zcbenz in #2158
Fix export to work with gather/scatter axis by @awni in #2263
CUDA backend: binary ops by @zcbenz in #2259
Report number of missing parameters by @FL33TW00D in #2264
CUDA backend: sort by @zcbenz in #2262
CUDA backend: random by @zcbenz in #2261
Fix conv export by @awni in #2265
CUDA backend: copy ops by @zcbenz in #2260
Fix building cpp benchmarks on Linux by @zcbenz in #2268
Add load_safe to the general conv loaders by @angeloskath in #2258
start cuda circle config by @awni in #2256
CUDA backend: reduce by @zcbenz in #2269
CUDA backend: argreduce by @zcbenz in #2270
CUDA backend: softmax by @zcbenz in #2272
CUDA backend: layernorm by @zcbenz in #2271
Fix warnings from latest CUDA toolkit by @zcbenz in #2275
Make sliceUpdate general by @awni in #2282
CUDA backend: compile by @zcbenz in #2276
[CUDA] RMSNorm and VJP by @awni in #2280
[CUDA] Fix build by @awni in #2284
[CUDA] ternary with select op by @awni in #2283
CUDA backend: indexing ops by @zcbenz in #2277
Collection of refactors by @jagrit06 in #2274
Fix complex power and print by @awni in #2286
fix cuda jit by @awni in #2287
Fix cuda gemm for bf16 by @awni in #2288
Fix cuda arg reduce by @awni in #2291
RoPE for CUDA by @angeloskath in #2293
Add python testing for cuda with ability to skip list of tests by @awni in #2295
[CUDA] Fix back-end bugs and enable corresponding tests by @awni in #2296
Cuda bug fixes 2 by @awni in #2298
[CUDA] Divmod, Partition, and sort fixes by @awni in #2302
[CUDA] synch properly waits for all tasks to finish and clear by @awni in #2303
Make ptx cache settable by environment variable by @angeloskath in #2304
Build CUDA release in Circle by @awni in #2306
Cuda perf tuning by @awni in #2307
Fix update_modules() when providing a subset by @angeloskath in #2308
Compile float64 functions on CPU by @awni in #2311
Fix get 2d grid dims by @angeloskath in #2316
Split broadcast so it is always fused in compile by @angeloskath in #2318
[CUDA] Fix reductions by @angeloskath in #2314
Fix module update in strict mode by @awni in #2321
MLX_SWITCH macros to templates by @angeloskath in #2320
Use fp32 for testing, add more complex ops by @awni in #2322
Patch bump by @awni in #2324
Allow parameters to be deleted from a module by @awni in #2325
Fix compilation error from integral_constant by @zcbenz in #2326
[CUDA] Switch to CUDA graphs by @awni in #2317
[CUDA] Fix graphs for older cuda by @awni in #2328
[CUDA] Add MLX_CUDA_GRAPH_CACHE_SIZE env for setting graph cache size by @zcbenz in #2329
Fix layernorm race condition by @angeloskath in #2340
Build with all cpu cores by default by @zcbenz in #2336
[CUDA] Do vectorized store/load in binary ops by @zcbenz in #2330
Auto build linux release by @awni in #2341
MoE backward improvements by @angeloskath in #2335
Fix compilation with CUDA 11 by @zcbenz in #2331
patch bump by @awni in #2343
Align mlx::core::max op nan propagation with NumPy by @jhavukainen in #2339
Add zero for argsort vjp by @awni in #2345
[CUDA] Do vectorized store/load in contiguous elementwise ops by @zcbenz in #2342
Align mlx::core::min op nan propagation with NumPy by @jhavukainen in #2346
[CUDA] Set current device before cudaGraphLaunch by @zcbenz in #2351
[CUDA] Put version in ptx cache dir path by @zcbenz in #2352
Fix type promotion in Adam with bias correction by @angeloskath in #2350
Fix edge check in QuantizedBlockLoader for qmm_n by @angeloskath in #2355
[CUDA] Implement Scan kernel by @zcbenz in #2347
[Metal] fix copy dispatch by @awni in #2360
[CUDA] Bundle CCCL for JIT compilation by @zcbenz in #2357
[CUDA] Do not put kernels in annoymous namespace by @zcbenz in #2362
Fix imag() vjp by @angeloskath in #2367
Add Primitive::name and remove Primitive::print by @zcbenz in #2365
update linux build by @awni in #2370
[CUDA] Affine quantize by @awni in #2354
Fix flaky linux test by @awni in #2371
Install linux with mlx[cuda] and mlx[cpu] by @awni in #2356
[CUDA] Use cuda::std::complex in place of cuComplex by @zcbenz in #2372
lower memory uniform sampling by @awni in #2361
[CUDA] Fix complex reduce + nan propagation in min and max by @awni in #2377
Rename the copy util in cpu/copy.h to copy_cpu by @zcbenz in #2378
fix ring distributed test by @awni in #2380
Test with CUDA 12.2 by @awni in #2375
[CUDA] Add work per thread to compile by @angeloskath in #2368
[CUDA] Fix resource leaks in matmul and graph by @awni in #2383
[CUDA] Add more ways finding CCCL headers in JIT by @zcbenz in #2382
Add contiguous_copy_gpu util for copying array by @zcbenz in #2379
Adding support for the Muon Optimizer by @Goekdeniz-Guelmez in #1914
Patch bump by @awni in #2386
Fix release build + patch bump by @awni in #2387
Fix cuda manylinux version to match others by @awni in #2388
[CUDA] speedup handling scalars by @awni in #2389
Remove thrust iterators by @zcbenz in https://g...

Contributors

zcbenz, angeloskath, and 8 other contributors

Assets 2

18 Jul 22:20

awni

v0.26.5

84b4d96

v0.26.5

🚀

Assets 2

08 Jul 21:26

awni

v0.26.3

fb4e8b8

v0.26.3

🚀

Assets 2

01 Jul 22:08

awni

v0.26.2

58f3860

v0.26.2

🚀

Assets 2

02 Jun 23:24

awni

v0.26.0

0408ba0

v0.26.0

Highlights

5 bit quantization
Significant progress on CUDA back-end by @zcbenz

Core

Features

5bit quants
Allow per-target Metal debug flags
Add complex eigh
reduce vjp for mx.all and mx.any
real and imag properties
Non-symmetric mx.linalg.eig and mx.linalg.eigh
convolution vmap
Add more complex unary ops (sqrt, square, ...)
Complex scan
Add mx.broadcast_shapes
Added output_padding parameters in conv_transpose
Add random normal distribution for complex numbers
Add mx.fft.fftshift and mx.fft.ifftshift` helpers
Enable vjp for quantized scale and bias

Performance

Optimizing Complex Matrix Multiplication using Karatsuba’s Algorithm
Much faster 1D conv

Cuda

Generalize gpu backend
Use fallbacks in fast primitives when eval_gpu is not implemented
Add memory cache to CUDA backend
Do not check event.is_signaled() in eval_impl
Build for compute capability 70 instead of 75 in CUDA backend
CUDA backend: backbone

Bug Fixes

Fix out-of-bounds default value in logsumexp/softmax
include mlx::core::version() symbols in the mlx static library
Fix Nearest upsample
Fix large arg reduce
fix conv grad
Fix some complex vjps
Fix typo in row_reduce_small
Fix put_along_axis for empty arrays
Close a couple edge case bugs: hadamard and addmm on empty inputs
Fix fft for integer overflow with large batches
fix: conv_general differences between gpu, cpu
Fix batched vector sdpa
GPU Hadamard for large N
Improve bandwidth for elementwise ops
Fix compile merging
Fix shapeless export to throw on dim mismatch
Fix mx.linalg.pinv for singular matrices
Fixed shift operations
Fix integer overflow in qmm

Contributors

Thanks to some awesome contributors!

@ivanfioravanti, @awni, @angeloskath, @zcbenz, @Jckwind, @iExalt, @thesuryash, @ParamThakkar123, @djphoenix, @ita9naiwa, @hdeng-apple, @Redempt1onzzZZ, @charan-003, @skyzh, @wisefool769, @barronalex @aturker1

Contributors

zcbenz, ivanfioravanti, and 15 other contributors

Assets 2

09 May 21:35

awni

v0.25.2

659a519

v0.25.2

🚀

Assets 2

24 Apr 23:11

awni

v0.25.1

eaf709b

v0.25.1

🚀

Assets 2

17 Apr 23:50

angeloskath

v0.25.0

b529515

v0.25.0

Highlights

Custom logsumexp for reduced memory in training (benchmark)
Depthwise separable convolutions
- Up to 4x faster than PT
- benchmark
Batched Gather MM and Gather QMM for ~2x faster prompt processing for MoEs
- benchmarks
- more benchmarks

Core

Performance

Fused vector attention supports 256 dim
Tune quantized matrix vector dispatch for small batches of vectors

Features

Move memory API in the top level mlx.core and enable for CPU only allocator
Enable using MPI from all platforms and allow only OpenMPI
Add a ring all gather for the ring distributed backend
Enable gemm for complex numbers
Fused attention supports literal "causal" mask
Log for complex numbers
Distributed all_min and all_max both for MPI and the ring backend
Add logcumsumexp
Add additive mask for fused vector attention
Improve the usage of the residency set

NN

Add sharded layers for model/tensor parallelism

Bugfixes

Fix possible allocator deadlock when using multiple streams
Ring backend supports 32 bit platforms and FreeBSD
Fix FFT bugs
Fix attention mask type for fused attention kernel
Fix fused attention numerical instability with masking
Add a fallback for float16 gemm
Fix simd sign for uint64
Fix issues in docs

Assets 2

03 Apr 20:18

awni

v0.24.2

86389bf

v0.24.2

🐛 🚀

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

Highlights

What's Changed

Contributors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Highlights

Core

Features

Performance

Cuda

Bug Fixes

Contributors

Contributors

Uh oh!

Uh oh!

Uh oh!

Highlights

Core

Performance

Features

NN

Bugfixes

Uh oh!

Uh oh!

Releases: ml-explore/mlx

v0.28.0

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

v0.27.1

Highlights

What's Changed

Contributors

Uh oh!

v0.26.5

Uh oh!

v0.26.3

Uh oh!

v0.26.2

Uh oh!

v0.26.0

Highlights

Core

Features

Performance

Cuda

Bug Fixes

Contributors

Contributors

Uh oh!

v0.25.2

Uh oh!

v0.25.1

Uh oh!

v0.25.0

Highlights

Core

Performance

Features

NN

Bugfixes

Uh oh!

v0.24.2

Uh oh!