HF Space?!?

#1
by asigalov61 - opened

@Chickaboo Great job!

Can you make a HF space so that it's easier to try?

Alex

This model was more of a research experiment for its hybrid architecture than anything at this scale It learns some basic musical rules and produces stuff with rhythm that's not too bad But at the same time it's also not super useful. I'm making an updated model with an improved architecture that will also be much larger with a target of 100m or more parameters. Once the new model releases, I'll try to make a space and you'll be able to use both through it hopefully.

@Chickaboo Thank you for your response and clarifications.

What do you plan to use? Can you share details for your architecture? Do you still plan to use Mamba?

Also, what is your downstream task goal? Music generation or music continuation?

Have you seen https://huggingface.co/spaces/skytnt/midi-composer ?

From my experience, Mamba is not very good choice for music, even though it should be good in theory. I would recommend going with LLaMA for music generation and with full attention for continuation.

Also, consider looking into multiple tokens embeddings and custom tokenization techniques for good results, especially on small scale.

If you have any thoughts about any of this, I will be happy to chat some more.

Otherwise, good job! Your project was very interesting and peculiar to me.

Sincerely,

Alex

P.S. Check out my Orpheus music transformer and my Discover MIDI dataset. It may help you with your project.

Ohh, I forgot to mention... You definitely should not use BPE - it significantly degrades performance and you might want to look into hybrid datasets like my Godzilla Piano. Pure performance datasets like MAESTRO, GiantMIDI, ATEPP, Aria do not produce good results unless diluted with non-performance compositions.

For continuations, the embedings size is what matters most. I.e 2048 would be a minimum choice IMHO.

Thank you so much for your feedback and kind words. I checked out your Orpheus Music Transformer and the Discover MIDI dataset both are really impressive.

I'm now reconsidering the architecture for the next version. I think that three competing architectural components fighting for parameter budget is less than optimal. I was originally planning Mamba + CfC + Attention but I think you're right about Mamba even though I added the attention layers to compensate for some of Mamba's weaknesses. I think it's too much to have something that you need to compensate for especially when its benefits aren't all that significant to begin with.

I'm considering a two-component repeating block: “GatedDeltaNet → GatedDeltaNet → CfC → GQA.” The reasoning is that GatedDeltaNet addresses some of Mamba's weaknesses for in-context learning, CfC provides the continuous-time dynamics I believe is worth trying as it could be valuable for MIDI timing. Incorporating it, a modern evolution of liquid neural networks, in a novel architecture is the main motivation behind this model. I am trying to understand how other setups besides pure transformers perform. GQA gives efficient attention for precise retrieval without full MHA cost. You asked about whether my end goal is continuation or unique generations from scratch. For now the focus is on continuations. At this scale generating from scratch feels like a harder problem to tackle first.

On tokenization I’ll take your word about BPE. I've been using REMI+BPE. What would you recommend instead? I'm particularly curious whether absolute timing tokens would pair well with continuous-time positional encoding or whether that's redundant. I also saw your note about embedding size. 2048 minimum for continuation is noted and I'll plan around it.

Regarding data sets, I think for the next model after looking at yours including Godzilla Piano, I’ll plan to use those as a foundation. I was originally going with MAESTRO, ADL-Piano, and Aria Deduped, and I'll probably still blend some of those in, but Godzilla Piano as the core seems like the smarter starting point. Especially considering how it turned out for your models.

Thank you again for your advice. I value your expertise.

Best, Lucas

You are very welcome! And thank you for sharing your thoughts and for being open to suggestions. I really appreciate it :)

I'm now reconsidering the architecture for the next version. I think that three competing architectural components fighting for parameter budget is less than optimal. I was originally planning Mamba + CfC + Attention but I think you're right about Mamba even though I added the attention layers to compensate for some of Mamba's weaknesses. I think it's too much to have something that you need to compensate for especially when its benefits aren't all that significant to begin with.

Yes, I would highly recommend keeping things as simple as possible because from my experience complexity does not always equal quality with AI architectures, especially when it comes ot symbolic music.

I'm considering a two-component repeating block: “GatedDeltaNet → GatedDeltaNet → CfC → GQA.” The reasoning is that GatedDeltaNet addresses some of Mamba's weaknesses for in-context learning, CfC provides the continuous-time dynamics I believe is worth trying as it could be valuable for MIDI timing. Incorporating it, a modern evolution of liquid neural networks, in a novel architecture is the main motivation behind this model. I am trying to understand how other setups besides pure transformers perform. GQA gives efficient attention for precise retrieval without full MHA cost.

I will be honest here, I've never tried to use GDN, CfC, or GQA, but your proposed architecture makes sense. Since you are doing the right thing (starting at a small scale), it most certainly worth investigating. I am not sure about GDN since it seems like it's still a variant of Mamba, albeit with improvements. But I do know that the biggest issue with symbolic music would temporal components (i.e onsets and durations) so if I were you, I would make an emphasis on improving temporal abilities of your architecture. This is very difficult and still unsolved task so it may be hard. But there are ways to do it. I.e MuseNet did not have durations - only onsets coupled with separate onsets embedding which produced pretty good results. You can also look into the time-series models (there are a few SOTA now available).

You asked about whether my end goal is continuation or unique generations from scratch. For now the focus is on continuations. At this scale generating from scratch feels like a harder problem to tackle first.

I see. Actually, it's quite the opposite :) But the upside is that if your model can continue any arbitrary composition well, it will generate well too.

On tokenization I’ll take your word about BPE. I've been using REMI+BPE.

REMI is a decent encoding but its very far from efficient and it also not very suitable for performance music since vanilla implementation has problems encoding performance timings well.

Now, in regard to BPE (and this is very important): Unless you use separated BPEs for each element of your tokenized sequences, it will greatly degrade performance because it will merge incompatible elements (i.e onsets with pitches or velocities) which will only confuse the model and will increase overfitting since the model will give up on trying to understand it and it will simply start memorizing it. So in general, BPE is definitely not worth it. Symbolic music sequences are not text or even images. They are much more complex and require a different approach.

What would you recommend instead?
I would recommend to look into efficient custom tokenization methods. MidiTok has a few but I am not sure about implementation (I do not know if its correct or not because its not official). However, give it a try, especially PerTok - it has micro-timings support which supposed to be great for solo piano models.

Otherwise, I would highly recommend to get dirty and create your own tokenization technique which will be suitable for your custom architecture. From my experience, simplicity and efficiency is a key.

For example (assuming solo piano): [delta onset, pitch+128, duration+256, pitch+128, duration+256, delta onset, ...]. This is what I used in my Godzilla Piano Transformer with great results.

You can also add velocities if you want to. I would go with 16 degrees or even 8 degrees for vels since full vels (128) are much harder for model to learn.

I'm particularly curious whether absolute timing tokens would pair well with continuous-time positional encoding or whether that's redundant.

Since you toying with things like Mamba/GDN and CfC, you might want to include absolute time bar tokens at the very minimum. This usually works well. And you can also include micro timings like they did in PerTok.

I also saw your note about embedding size. 2048 minimum for continuation is noted and I'll plan around it.

Yes, do not be scared of 2048. From my experience, 2048 dim with 4 layers (shallow) outperforms 1024 x 8 and most certainly outperforms 512 x 16 on continuation tasks, Num layers matter for long-term structure if you seq_len is long. This is where your custom Mamba architecture with linear attention can be very useful.

Regarding data sets, I think for the next model after looking at yours including Godzilla Piano, I’ll plan to use those as a foundation. I was originally going with MAESTRO, ADL-Piano, and Aria Deduped, and I'll probably still blend some of those in, but Godzilla Piano as the core seems like the smarter starting point. Especially considering how it turned out for your models.

Thank you. Yes, try Godzilla Piano. Its pre-tokenized and ready to go. It produces very good results. I used it in many of my projects already. Alternatively, you can dilute any of the ones you used with scores/non-performance compositions MIDIs which should produce similar results. And of'course, if you want your model to specialize in classical, you can always post-train or fine-tune after pretraining.

Thank you again for your advice. I value your expertise.

Best, Lucas

Again, you are very welcome, Lucas. I hope its not too much :) I would be happy to elaborate more on any of it if you have time. Otherwise, feel free to reach out to me at any time. I am always happy to help or even do a small colab work if needed.

Sincerely,

Alex.

Hey Alex,

I trained a new model if you're interested to take a look.

This time I used GDN arriving at the decision for this model through a series of architectural tests. I used small configurations about 10 million parameters in size. Across each different variant I tested including Attention + CFC, Pure Attention, GDN + CFC, and GDN all at a similar size ~10m parameters (Though there was some variation), using fifteen thousand pieces from your Godzilla MIDI dataset, I found that GDN had the best accuracy on loss when also considering the training speed and feasibility. CFC was possibly beneficial when combined with GDN, but the results weren't conclusive due to poor optimization which prevented me from doing much training at all. CFC can be optimized, so it will not have as great of a speed impact. However, I decided to leave it out for this model really aiming at just getting something that sounds intentional.

I’d really appreciate any feedback if you have time. Your dataset was a major part of making this possible.

Best,

Lucas

@Chickaboo Hey Lucas!

Fantastic work! And I am very happy that my dataset was helpful :)

I would love to check it out so please feel free to send me a link and/or inference code! :)

So GDN performed best? How did it compare to CFC?

Also, do you have any generated samples?

Let me know.

Alex.

Thanks for the response!

Here's the link: https://huggingface.co/Chickaboo/Pulse88-E-40M-Alpha-Preview It has some samples and a collab notebook on the model card they aren't perfect, but sound obviously intentional. I think it overall went well considering the size and amount of data. So looking back at my notes the overall best performers in my opinion were Pure Attention or GDN + Sparse Attention. Pure attention trained faster than the GDN variant, but despite also being somewhat less accurate in the beginning the GDN Variant surpassed the pure attention on loss accuracy. Here are some of the logs if you're curious to look at them. I changed how the logs worked in between the variant E and variant C, so that's why variant E posts every 20 steps so I trimmed it down a bit. And when it came to CFC I was never able to really get past more than a few hundred steps. It was way slower than either of them. The more testing I do I realize that CFC, even if it has the potential to maybe bring some quality, though that is all still theoretical really, is not feasible to train.

Architectural Variant E (GDN + Sparse Attention)

[17:52:34] [INFO] step=000020 train_loss=5.0427 lr=1.200000e-05
[17:53:03] [INFO] step=000040 train_loss=5.0332 lr=2.400000e-05
[17:53:31] [INFO] step=000060 train_loss=5.0126 lr=3.600000e-05
[17:53:59] [INFO] step=000080 train_loss=4.9929 lr=4.800000e-05
[17:54:27] [INFO] step=000100 train_loss=4.9758 lr=6.000000e-05
[17:54:55] [INFO] step=000120 train_loss=4.9564 lr=7.200000e-05
[17:55:23] [INFO] step=000140 train_loss=4.9346 lr=8.400000e-05
[17:55:52] [INFO] step=000160 train_loss=4.9076 lr=9.600000e-05
[17:56:20] [INFO] step=000180 train_loss=4.8770 lr=1.080000e-04
[17:56:48] [INFO] step=000200 train_loss=4.8397 lr=1.200000e-04
[17:57:16] [INFO] step=000220 train_loss=4.7895 lr=1.320000e-04
[17:57:44] [INFO] step=000240 train_loss=4.7371 lr=1.440000e-04
[17:58:12] [INFO] step=000260 train_loss=4.6725 lr=1.560000e-04
[17:58:40] [INFO] step=000280 train_loss=4.6072 lr=1.680000e-04
[17:59:09] [INFO] step=000300 train_loss=4.5364 lr=1.800000e-04
[17:59:37] [INFO] step=000320 train_loss=4.4621 lr=1.920000e-04
[18:00:05] [INFO] step=000340 train_loss=4.3944 lr=2.040000e-04
[18:00:33] [INFO] step=000360 train_loss=4.3188 lr=2.160000e-04
[18:01:01] [INFO] step=000380 train_loss=4.2458 lr=2.280000e-04
[18:03:17] [INFO] Epoch 001/100 | train_loss=4.7291 | val_loss=4.1362 | ppl=62.56 | time=1078.4s
[18:03:17] [INFO] Epoch 002 start | train_batches=795 | est_optimizer_steps=398 | log_every=20
[18:03:20] [INFO] step=000400 train_loss=4.1480 lr=2.400000e-04
[18:12:13] [INFO] step=000780 train_loss=3.3004 lr=2.999662e-04
[18:13:29] [INFO] Epoch 002/100 | train_loss=3.6342 | val_loss=3.2495 | ppl=25.78 | time=612.1s
[18:13:29] [INFO] Epoch 003 start | train_batches=795 | est_optimizer_steps=398 | log_every=20
[18:23:42] [INFO] Epoch 003/100 | train_loss=3.0409 | val_loss=2.8570 | ppl=17.41 | time=613.1s
[18:23:42] [INFO] Epoch 004 start | train_batches=795 | est_optimizer_steps=398 | log_every=20
[18:27:36] [INFO] step=001360 train_loss=2.7900 lr=2.996811e-04
[18:28:04] [INFO] step=001380 train_loss=2.7895 lr=2.996661e-04

Architectural variant C (Pure Attention)

Trainer DataParallel disabled for this run.
Train samples: 12709
Val samples: 1395
Batch size: 16
Workers: 0
Steps per epoch: 795
[03:53:20] [INFO] Epoch 001 start | train_batches=795 | est_optimizer_steps=398 | log_every=100
Forward pass OK - logits shape: torch.Size([16, 1024, 155])
Trainer live loss logs appear every 100 optimizer steps.
Checkpoint path: DataParallel active -> saving/loading unwrapped module state_dict (model.module).
Checkpoint save verification: DataParallel path -> unwrap==module: True
/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:897.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[03:53:44] [INFO] step=000100 train_loss=4.9946 lr=6.000000e-05
[03:54:08] [INFO] step=000200 train_loss=4.8599 lr=1.200000e-04
[03:54:32] [INFO] step=000300 train_loss=4.5913 lr=1.800000e-04
[03:55:13] [INFO] Epoch 001/100 | train_loss=4.6792 | val_loss=4.0819 | ppl=59.26 | time=113.2s
[03:55:13] [INFO] Epoch 002 start | train_batches=795 | est_optimizer_steps=398 | log_every=100
[03:55:14] [INFO] step=000400 train_loss=4.0916 lr=2.400000e-04
[03:55:38] [INFO] step=000500 train_loss=3.9297 lr=3.000000e-04
[03:56:03] [INFO] step=000600 train_loss=3.6685 lr=2.999957e-04
[03:56:28] [INFO] step=000700 train_loss=3.4969 lr=2.999827e-04
[03:57:07] [INFO] Epoch 002/100 | train_loss=3.6225 | val_loss=3.3050 | ppl=27.25 | time=113.9s
[03:57:07] [INFO] Epoch 003 start | train_batches=795 | est_optimizer_steps=398 | log_every=100
[03:57:08] [INFO] step=000800 train_loss=3.3228 lr=2.999612e-04
[03:57:33] [INFO] step=000900 train_loss=3.2589 lr=2.999310e-04
[03:57:57] [INFO] step=001000 train_loss=3.1700 lr=2.998922e-04
[03:58:23] [INFO] step=001100 train_loss=3.0838 lr=2.998447e-04
[03:59:01] [INFO] Epoch 003/100 | train_loss=3.1343 | val_loss=2.9495 | ppl=19.10 | time=114.4s
[03:59:01] [INFO] Epoch 004 start | train_batches=795 | est_optimizer_steps=398 | log_every=100
[03:59:03] [INFO] step=001200 train_loss=2.9934 lr=2.997887e-04

Architectural variant A (GDN + CFC + Sparse Attention)

 [INFO] Epoch 001/100 | train_loss=4.7141 | val_loss=4.1087 | ppl=60.87 | time=6665.5s

If you compare that to the first epoch of the simpler GDN + Sparse Attention the numbers on quality look okay but It also took 6x longer to train.

[18:03:17] [INFO] Epoch 001/100 | train_loss=4.7291 | val_loss=4.1362 | ppl=62.56 | time=1078.4s [23:43:51]

Lucas.

@Chickaboo Thank you for sharing this. It's very interesting and informative :)

Please give me some time to process all of this and also to eval your model/code. I will let you know soon what I think in details.

But here is some feedback right away for you...

GDN+sparse is a good combo I think. I think you are going in the right direction. However, your losses are way too high. With Godzilla dataset you should expect CE loss to be ~0.5 and accuracy in the mid 80ies (0.85). I am not sure why you are getting 2.5-3 loss, assuming its CE loss. One thing that maybe an issue is the model size (too small) and also maybe because you train with full velocities range (vels usually degrade performance). Also, check sparse attention. Its much harder to train and fit properly so maybe that is causing high loss.

As a reference, look at my Godzilla Piano Transformer logs: https://huggingface.co/asigalov61/Godzilla-Piano-Transformer/tree/main/logs
And also plot tokens embeddings. If the model is fitted well it should show clear structure.

I also wanted to give you some reference MIDI samples which you can use in your work.
https://huggingface.co/datasets/asigalov61/misc-and-temp/blob/main/Reference%20Samples.zip

These are from MuseNet and my Orpheus, with MuseNet showing best results still so use MuseNet samples as a ground truth.

Last, but not least, I listened to some samples that you posted in a model repo. Thanks for Blue Bird :) and other ones.

Alex.

Alex,

Thank you so much for all of the feedback this is also very interesting and extremely informative.

I'm going to look into everything you mentioned. Hopefully I'll have a another model done sometime soon that will be much improved with this advice. I looked at your MIDI samples, the Bluebird continuation by Orpheus Music Transformer hits it out of the park.

Lucas.

@Chickaboo You are very welcome, Lucas! :)

Please continue your music dev work. It's very useful and valuable to someone like myself.

And I will try to give you more feedback soon. I am just a bit busy with a few projects today.

Alex

Sign up or log in to comment