Open in Colab

Drum Generation Transformer

Welcome to this partitura tutorial notebook! In this tutorial, we will learn how to train a small auto-regressive transformer model for simplified drum beat generation in MIDI. At the end of this tutorial, we will be able to create a drum beat of a bar’s length and investigate some useful techniques for encoding MIDI as input to a transformer.

Setup

Install packages

If you run it locally, we assume that you cloned the full tutorial repository from https://github.com/CPJKU/partitura_tutorial.git, you are running this notebook from its parent path, and have partitura and the other dependencies installed. Partitura is available in github https://github.com/CPJKU/partitura You can install it with:

pip install partitura

If any other imports fail, please install them in your local environment. If you run it in colab, partitura and other dependencies will be installed automatically.

[1]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    !pip install partitura
    !git clone https://github.com/cpjku/partitura_tutorial
    import sys
    sys.path.insert(0, "./partitura_tutorial/notebooks/04_generation/")

Imports

Import the required packages.

[2]:
%matplotlib inline
import numpy as np
import pandas as pd
import random
import time
import glob
import matplotlib.pyplot as plt
import copy
import pickle
import warnings
warnings.simplefilter("ignore", UserWarning)

import torch
from torch import nn, optim
from torch.utils.data import Dataset, ConcatDataset, DataLoader
from torch.nn import TransformerEncoder, TransformerEncoderLayer
import os
if not IN_COLAB:
    os.environ['KMP_DUPLICATE_LIB_OK']='True'
import partitura as pt

External files

If run locally, external scripts are loaded directly from the cloned tutorial repository.

If run in colab, external scripts are imported by httpimport.

[3]:
from generation_helpers import (
    INV_PITCH_DICT_SIMPLE,
    tokens_2_notearray,
    save_notearray_2_midifile,
    generate_tokenized_data,
    batch_data,
    PositionalEncoding,
    Transformer,
    sample_from_logits,
    sample_loop
    )

Data Loading

I this tutorial we will work with the groove midi dataset, which contains MIDI drum grooves from a multitude of genres. For simplicity we load only the MIDI files in 4-4 from the dataset. For loading we use Partitura. If you want to load a precomputed and preprocessed dataset you can skip the next cells until the last cell before section 4 where it is loaded.

[4]:
if IN_COLAB:
    directory = os.path.join("./partitura_tutorial/notebooks/04_generation", "./groove-v1.0.0-midionly")
else:
    directory = os.path.join(os.getcwd(), "./groove-v1.0.0-midionly")

typ_44 = [os.path.basename(f) for f in list(
    glob.glob(directory + "/groove/*/*/*.mid"))
          if "4-4" in f]

beat_type = [g.split("_")[-2] for g  in typ_44]
tempo = [int(g.split("_")[-3]) for g  in typ_44]
vals, counts = np.unique(tempo, return_counts = True)

Histogram of tempos marked in file names

Let’s take a closer look at the data by plotting the histogram of tempi from the target pieces.

[5]:
plt.plot(vals, counts, c = "r")
plt.hist(np.array(tempo), bins= 60)
[5]:
(array([  1.,   0.,  10.,  13.,  15.,  18.,  19.,  36.,  42.,  24., 104.,
        110.,  85.,  69.,   8.,  69.,  91., 102.,  82.,  39.,  51.,  16.,
         60.,  42.,  10.,   1.,   0.,   2.,   0.,   0.,   2.,   1.,   3.,
          3.,   0.,   1.,   0.,   1.,   0.,   0.,   0.,   7.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   1.]),
 array([ 50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.,  82.,  86.,  90.,
         94.,  98., 102., 106., 110., 114., 118., 122., 126., 130., 134.,
        138., 142., 146., 150., 154., 158., 162., 166., 170., 174., 178.,
        182., 186., 190., 194., 198., 202., 206., 210., 214., 218., 222.,
        226., 230., 234., 238., 242., 246., 250., 254., 258., 262., 266.,
        270., 274., 278., 282., 286., 290.]),
 <BarContainer object of 60 artists>)
../../_images/notebooks_04_generation_Drum_Generation_Transformer_10_1.png

We can almost notice a Gaussian curve with a mean around a 100 beats per minute. Now let’s define the load_data fun

[6]:
def load_data(
      directory = "./groove-v1.0.0-midionly",
      min_seq_length=10,
      time_sig="4-4",
      beat_type = None
              ):
    """
    loads groove dataset data from directory
    into a list of dictionaries containing the
    note array as well as note sequence metadata.

    Args:
        directory: dir of the dataset
        min_seq_length: minimal sequence length, a
            all shorter performances are discarded
        time_sig: only performances of this time sig
            are loaded
        beat_type: type of beat to load ("beat",
            "fill", or None = both)

    Returns:
        a list of dicts containing note arrays.

    """
    # load data
    files = glob.glob(directory + "/groove/*/*/*.mid")
    files.sort()
    sequences = []
    if beat_type is None:
        beat_types = ["beat", "fill"]
    else:
        beat_types = [beat_type]
    for fn in files:
        bn = os.path.basename(fn)
        bns = bn.split("_")
        tempo = int(bns[-3])
        bt = bns[-2]
        ts = bns[-1].split('.')[-2]
        if ts == time_sig and bt in beat_types:
            seq = pt.load_performance_midi(fn)[0]
            if len(seq.notes) > min_seq_length:
                na = seq.note_array()
                namax = (na['onset_tick'] + na['duration_tick']).max()
                namin = (na['onset_tick']).min()
                dur_in_q = (namax - namin)/seq.ppq
                seq_object = {
                    "id": bn,
                    "na": na,
                    "ppq": seq.ppq,
                    "tempo": tempo,
                    "beat_type": bt,
                    "namax": namax,
                    "namin": namin,
                    "dur_in_q": dur_in_q
                }
                sequences.append(seq_object)
    return sequences

Call the loading function (might take a moment)

[7]:
seqs = load_data(directory = directory)

Histogram of performance durations

In this visualization example for the loaded data we plot the histogram of note durations.

[8]:
dur_in_quarters = [k["dur_in_q"] for k in seqs if k["beat_type"] == "beat"]
[9]:
plt.hist(dur_in_quarters, bins = 60)
liq = np.array(dur_in_quarters)
print((liq >= 4).sum(), (liq >= -4).sum())
470 484
../../_images/notebooks_04_generation_Drum_Generation_Transformer_17_1.png

3. Data Preprocessing: Segmentation & Tokenization

Reducing and Encoding the MIDI Sound Profiles

In this reduces drum generation we encode only 6 main sounds from a list of 22 standard perscussion MIDI sound slots.

Here is a table that shows the Typical Sound to pitch association for drums in MIDI:

MIDI PITCH

STANDARD ASSOCIATED SOUND

REWRITEN SOUND

36

Kick

Kick

37

Snare (X-stick)

Snare

38

Snare (Head)

Snare

40

Snare (Rim)

Snare

43

Tom 1

Tom (Generic)

45

Tom 2

Tom (Generic)

48

Tom 3

Tom (Generic)

47

Tom 2 (Rim)

Tom (Generic)

50

Tom 1 (Rim)

Tom (Generic)

48

Tom 3 (Rim)

Tom (Generic)

48

Hi-Hat Closed (Bow)

Hi-Hat

48

Hi-Hat Closed (Edge)

Hi-Hat

48

Hi-Hat Open (Bow)

Hi-Hat

48

Hi-Hat Open (Edge)

Hi-Hat

49

Crash 1

Crash

55

Crash 1

Crash

52

Crash 2

Crash

57

Crash 2

Crash

51

Ride (bow)

Ride

59

Ride (Edge)

Ride

53

Ride (Bell)

Ride

Any

Default

Default

In summary, we associate all the standard MIDI drum sounds to their sound family, i.e. Snare-type sounds to a Signle Snare sound, and so on.

Reducing the MIDI Velocity Encoding

For MIDI velocity values we encode 8 classes of velocity by applying:

\[newVel = oldVel / 2^4\]

We only accept integer values for \(newVel\), so we are left with 8 classes of velocity.

Reducing the Tempo Encoding

Similarly, we reduce the different tempi found in the training data into 8 classses.

Beat of Fill Characterization

We also add information about each measure of the data being a beat or fill/solo part.

Reducing the Time Encoding

For the time encoding we use a hierarchical tree division in base 2. We start by separating the bar in 2, for first and second half note of the 4/4 bar. We continue for quarter, eight notes, and so on, up to 128 notes. So we use a binary setting for every hierarchical level.

For example the binary hierarchical representation of the second 16th note of the bar would be :

Half Note

Quarter Note

8-th Note

16-th Note

32-th Note

64-th Note

128-th Note

0

0

0

1

0

0

0

[10]:
PITCH_DICT_SIMPLE = {
    36:0, #Kick
    38:1, #Snare (Head)
    40:1, #Snare (Rim)
    37:1, #Snare X-Stick
    48:2, #Tom 1
    50:2, #Tom 1 (Rim)
    45:2, #Tom 2
    47:2, #Tom 2 (Rim)
    43:2, #Tom 3
    58:2, #Tom 3 (Rim)
    46:3, #HH Open (Bow)
    26:3, #HH Open (Edge)
    42:3, #HH Closed (Bow)
    22:3, #HH Closed (Edge)
    44:3, #HH Pedal
    49:4, #Crash 1
    55:4, #Crash 1
    57:4, #Crash 2
    52:4, #Crash 2
    51:5, #Ride (Bow)
    59:5, #Ride (Edge)
    53:5, #Ride (Bell)
    'default':6
}

def pitch_encoder(pitch, pitch_dict = PITCH_DICT_SIMPLE):
    # 22 different instruments  in dataset, default 23rd class
    a = pitch_dict['default']
    try:
        a = pitch_dict[pitch]
    except:
        print("unknown instrument")
    return a

def time_encoder(time_div, ppq):
    # base 2 encoding of time, starting at half note, ending at 128th
    power_two_classes = list()
    for i in range(7):
        power_two_classes.append(int(time_div // (ppq * (2 **(1-i)))))
        time_div = time_div % (ppq * 2 **(1-i))
    return power_two_classes

def velocity_encoder(vel):
    # 8 classes of velocity
    return vel // 2 ** 4

def tempo_encoder(tmp):
    # 8 classes of tempo between 60 and 180
    return (np.clip(tmp, 60, 179) - 60) // 15

def tokenizer(seq):
    tokens = list()
    na = seq["na"]
    fill_encoding = 0
    if seq["beat_type"] == "fill":
        fill_encoding = 1
    tempo_encoding = tempo_encoder(seq["tempo"])

    for note in na:
        te = time_encoder(note["onset_tick"]%(seq['ppq']*4), seq['ppq'])
        pe = pitch_encoder(note["pitch"])
        ve = velocity_encoder(note["velocity"])

        tokens.append(te + [pe, ve, tempo_encoding, fill_encoding])

    return tokens

def measure_segmentation(seq, beats = 4, minimal_notes = 1):
    mod = beats * seq['ppq']
    no_of_measures = int(seq["namax"] // mod + 1)

    segmented_seq = list()
    for measure_idx in range(no_of_measures):
        na = seq["na"]
        new_na = np.copy(na[na["onset_tick"] // mod == measure_idx])
        if len(new_na) >= minimal_notes:
            new_seq = copy.copy(seq)
            new_seq["na"] = new_na
            segmented_seq.append(new_seq)
        else:
            continue
    return segmented_seq
[11]:
# try uncommenting the next lines to see the outputs of these functions.
t = tokenizer(seqs[0])
ss = measure_segmentation(seqs[0], minimal_notes = 18)
[12]:
print("First Tokenized Note of First sequence : {}".format(t[0]))
First Tokenized Note of First sequence : [0, 0, 1, 1, 1, 0, 1, 0, 3, 3, 0]

Training set generation (might take a moment)

[13]:
train_data = generate_tokenized_data(seqs,
                                     measure_segmentation,
                                     tokenizer,
                                     minimal_notes = 20)
train_dataloader = batch_data(train_data,
                              batch_size=128)
76 batches of size 128
[14]:
## try uncommenting the next lines to see the outputs of these functions.
# train_dataloader[0].shape # batch, max sequence length, num_tokens

Saving and loading a preprocessed training set

[15]:
# with open('./dataset129.pyc', 'wb') as fh:
#     pickle.dump(train_dataloader, fh)
[16]:
# with open('./dataset129.pyc', 'rb') as fh:
#     train_dataloader = pickle.load(fh)

4. Model

MODEL

The embedding input and output dimension of our model uses the reduced representations that were described above. First, we input with the 7 levels binary hierarchical timing representation in one hot, next instrument (6 classes), velocity (6 classes), tempo (6 classes), and a binary beat as one hot. For the output, every one of those levels also contain a START or END token. Therefore, the binary one-hot representations go from 2 dimensional vectors to 4 dimensional where the third position is for the START token and the fourth position is for the END token.

For consistency we repeated the same dimensions for the input Embeddings of the model. Of course this encodings are up to personal interpretation

[17]:
class MultiEmbedding(nn.Module):
    '''A Multiembedding'''
    # Initialization
    def __init__(self, tokens2dims):
        super().__init__()
        # self.em_time0 = nn.Embedding(num_tokens, dim_model)
        self.em_time0 = nn.Embedding(tokens2dims[0][0], tokens2dims[0][1])
        self.em_time1 = nn.Embedding(tokens2dims[1][0], tokens2dims[1][1])
        self.em_time2 = nn.Embedding(tokens2dims[2][0], tokens2dims[2][1])
        self.em_time3 = nn.Embedding(tokens2dims[3][0], tokens2dims[3][1])
        self.em_time4 = nn.Embedding(tokens2dims[4][0], tokens2dims[4][1])
        self.em_time5 = nn.Embedding(tokens2dims[5][0], tokens2dims[5][1])
        self.em_time6 = nn.Embedding(tokens2dims[6][0], tokens2dims[6][1])
        self.em_pitch = nn.Embedding(tokens2dims[7][0], tokens2dims[7][1])
        self.em_velocity = nn.Embedding(tokens2dims[8][0], tokens2dims[8][1])
        self.em_tempo = nn.Embedding(tokens2dims[9][0], tokens2dims[9][1])
        self.em_beat = nn.Embedding(tokens2dims[10][0], tokens2dims[10][1])
        # self.total_dim = np.array([[2,2,2,2,2,2,2, # time encoding
        #                  2,2,2,2]]).sum() # instrument, velocity, tempo, beat/fill
    def forward(self, x):
        '''Update Function and predict function of the model'''

        output = torch.cat((
        self.em_time0(x[:,:,0]),
        self.em_time1(x[:,:,1]),
        self.em_time2(x[:,:,2]),
        self.em_time3(x[:,:,3]),
        self.em_time4(x[:,:,4]),
        self.em_time5(x[:,:,5]),
        self.em_time6(x[:,:,6]),
        self.em_pitch(x[:,:,7]),
        self.em_velocity(x[:,:,8]),
        self.em_tempo(x[:,:,9]),
        self.em_beat(x[:,:,10])),
        dim=-1)
        return output

5. Training

[18]:
def train_loop(model, opt, loss_fn, dataloader, tokens2dims):
    """
    """
    t2d = np.array(tokens2dims)
    pred_dims = np.concatenate(([0],np.cumsum(t2d[:,0])))
    model.train()
    total_loss = 0

    for batch in dataloader:
        y = batch
        y = torch.tensor(y).to(device)

        # Now we shift the tgt by one so with the <SOS> we predict the token at pos 1
        y_input = y[:,:-1,:]
        y_expected = y[:,1:,:]

        # Get mask to mask out the next words
        sequence_length = y_input.size(1)
        tgt_mask = model.get_tgt_mask(sequence_length).to(device)

        # Standard training except we pass in y_input and tgt_mask
        pred = model(y_input, tgt_mask)

        # Permute pred to have batch size first again, that is batch / logits / sequence
        pred = pred.permute(1, 2, 0)
        loss = 0
        for k in range(11):
            # batch / logits / sequence ----- batch / sequence / tokens
            loss += loss_fn(pred[:,pred_dims[k]:pred_dims[k+1],:], y_expected[:,:,k])

        opt.zero_grad()
        loss.backward()
        opt.step()

        total_loss += loss.detach().item()

    return total_loss / len(dataloader)
[19]:
def fit(model, opt, loss_fn, train_dataloader, t2d, epochs):
    """
    """
    train_loss_list = []

    print("Training and validating model")
    for epoch in range(epochs):
        print("-"*25, f"Epoch {epoch + 1}","-"*25)

        train_loss = train_loop(model, opt, loss_fn, train_dataloader, t2d)
        train_loss_list += [train_loss]



        print(f"Training loss: {train_loss:.4f}")
        print()

    return train_loss_list

Initialize Model

The model figure above sets a clear direction for the input and output dimension of the model. So now we just need to initialize it likewise. The core model we use is a basic Transformer, that is already implemented in Pytorch. The Transformer model is not the focus of this tutorial so therefore we load it from a helper file.

[20]:
tokens2dims = [
            (4, 8),
            (4, 8),
            (4, 8),
            (4, 4),
            (4, 4),
            (4, 4),
            (4, 4),
            (8, 16),
            (10, 12),
            (10, 12),
            (4, 4)
        ]

model_spec = dict(
    tokens2dims=tokens2dims,
    num_heads=6,
    num_decoder_layers=6,
    dropout_p=0.1
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = Transformer(
    tokens2dims = model_spec["tokens2dims"],
    MultiEmbedding = MultiEmbedding,
    num_heads = model_spec["num_heads"],
    num_decoder_layers = model_spec["num_decoder_layers"],
    dropout_p = model_spec["dropout_p"],
).to(device)

opt = torch.optim.Adam(model.parameters(), lr=0.002)
loss_fn = nn.CrossEntropyLoss()

pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Number of parameters in model: ", pytorch_total_params)
Number of parameters in model:  264700

6. LOOP (only call to retrain / finetrain)

[21]:
"""
EPOCH = 1
PATH = "modelname.pt"
LOSS = train_loss_list[-1]
MODEL_SPEC = model_spec

torch.save({
            'epoch': EPOCH,
           'model_state_dict': model.state_dict(),
            'optimizer_state_dict': opt.state_dict(),
            'loss': LOSS,
            'model_spec': MODEL_SPEC,
            }, PATH)
"""
[21]:
'\nEPOCH = 1\nPATH = "modelname.pt"\nLOSS = train_loss_list[-1]\nMODEL_SPEC = model_spec\n\ntorch.save({\n            \'epoch\': EPOCH,\n           \'model_state_dict\': model.state_dict(),\n            \'optimizer_state_dict\': opt.state_dict(),\n            \'loss\': LOSS,\n            \'model_spec\': MODEL_SPEC,\n            }, PATH)\n'
[22]:
if IN_COLAB:
    PATH = os.path.join("partitura_tutorial/notebooks/04_generation", "Drum_Transformer_Checkpoint.pt")
else:
    PATH = "Drum_Transformer_Checkpoint.pt"

checkpoint = torch.load(PATH, map_location=torch.device(device))
model.load_state_dict(checkpoint['model_state_dict'])
opt.load_state_dict(checkpoint['optimizer_state_dict'])
[23]:
train_loss_list = fit(model, opt, loss_fn, train_dataloader, tokens2dims, 1)
Training and validating model
------------------------- Epoch 1 -------------------------
Training loss: 3.6172

7. SAMPLING

To generate a drum beat we need to call sample_loop from the model. The sampling loop will start with a START token in all fields (Timing, tempo, etc.) and autoregressively predict one note at the time until having a pre-specified number of notes.

[24]:
X = sample_loop(model, tokens2dims, device)
[25]:
df = pd.DataFrame(X.cpu().numpy()[0,1:,:],
                  columns=['Time Half Note',
                           'Time Quarter Note',
                           'Time 8th Note',
                           'Time 16th Note',
                           'Time 32nd Note',
                           'Time 64th Note',
                           'Time 128th Note',
                           'Pitch / Instrument',
                           'Velocity',
                           'Tempo',
                           'Beat type'])
df
[25]:
Time Half Note Time Quarter Note Time 8th Note Time 16th Note Time 32nd Note Time 64th Note Time 128th Note Pitch / Instrument Velocity Tempo Beat type
0 0 0 0 1 0 0 0 1 4 3 0
1 0 0 0 1 0 1 1 0 5 3 0
2 0 0 0 1 0 1 0 3 4 3 0
3 0 0 0 1 0 0 1 1 6 3 0
4 0 0 0 1 0 1 1 3 5 3 0
5 1 0 0 1 1 1 1 1 7 3 0
6 1 0 1 0 0 1 0 0 5 3 0
7 1 1 1 1 0 0 0 1 4 3 0
8 1 1 1 1 1 1 1 0 4 3 0
9 1 1 1 1 1 1 1 1 7 3 0
10 3 3 3 3 3 3 3 7 9 9 3
11 3 3 3 3 3 3 3 7 9 9 3
12 3 3 3 3 3 3 3 7 9 9 3
13 3 3 3 3 3 3 3 7 9 9 3
14 3 3 3 3 3 3 3 7 9 9 3
15 3 3 3 3 3 3 3 7 9 9 3
16 3 3 3 3 3 3 3 7 9 9 3
17 3 3 3 3 3 3 3 7 9 9 3
18 3 3 3 3 3 3 3 7 9 9 3
19 3 3 3 3 3 3 3 7 9 9 3
20 3 3 3 3 3 3 3 7 9 9 3
21 3 3 3 3 3 3 3 7 9 9 3
22 3 3 3 3 3 3 3 7 9 9 3
23 3 3 3 3 3 3 3 7 9 9 3
24 3 3 3 3 3 3 3 7 9 9 3
25 3 3 3 3 3 3 3 7 9 9 3
26 3 3 3 3 3 3 3 7 9 9 3
27 3 3 3 3 3 3 3 7 9 9 3
28 3 3 3 3 3 3 3 7 9 9 3
29 3 3 3 3 3 3 3 7 9 9 3
30 3 3 3 3 3 3 3 7 9 9 3

We can see that the sample loop predicts 30 notes (apart from the START token) but only 16 actual MIDI notes where predicted and from line 16 and down we get only the END token.

Finally, we can save generated MIDI files from the sampling process.

[26]:
for k in range(16):
    XX = X.cpu().numpy()[k,1:,:]
    na = tokens_2_notearray(XX)
    save_notearray_2_midifile(na, k, fn = "PartituraTutorialBeats")

Conclusion

This is the end of the Tutorial for drum beat generation using partitura.

To play the MIDI files you can import them in your DAW of preference and load a drum VSTi or create your own DrumSampler.

In this tutorial, we learned how to create MIDI drum generation using partitura and a Transformer autoregressive model. We investigated some possibilities for encoding and tokenization of timing, instrument, dynamics, and tempo.

Open in Colab

[ ]: