At the beginning of October, myself and my partner Aida, released a Twitter bot – LnH AI: The Band. This hobby project of ours is a music bot capable of composing music on-demand, based on tweets that users send to it. It is powered by special Deep Learning models that we have developed over the past few months and it is able to compose music in a few genres. For more details on how it works you can refer to FAQ page (http://lnh-music.ymer.org/faq/) but the topic of this blog post is about how we created the machine learning dataset needed for the project.
The models that we wanted to create used MIDI files from a specific genre as a training set. A MIDI file (Musical Instrument Digital Interface file) is a set of descriptions and instructions on how a musical piece is produced by instruments and synthesizers. It does not contain the sounds but shows the events that if played on a musical instrument would lead to production of the music.
There are many websites that contain MIDI files for various artists and genres of music. The website that we obtained most of our training data from was FreeMidi (https://freemidi.org/). They have separate pages for each genre (e.g. https://freemidi.org/genre-jazz for Jazz). All we wanted for a particular genre was to gather MIDI files from all of the artists listed within that genre and we would then use these MIDI files to train LnH AI.
Create a Machine Learning Dataset
Because of how the data is organized on the FreeMidi website, we had to build our machine learning dataset in two stages: first we gathered links to all the bands within a genre, and then gathered links for all the MIDI files from all those bands.
1. All the bands within a genre
Let’s start from the jazz genre page: https://freemidi.org/genre-jazz
I simply copied that URL and pasted it into the homepage on Import.io
Import.io automatically detected the list of all the jazz bands and pre-populated a dataset for me.
I was happy with the result and clicked on “Done” to create the Extractor. Then from the Import.io dashboard I added a few other genres to this Extractor that I was interested in.
Running the Extractor with a handful of genre URLs I was able to get the URLs for about 753 bands from those genres. Stage one done 🙂
2. All the MIDI files for a band
Now I needed to get the URLs to MIDI files from each of those bands which involved getting all the items from a sample page like this for “Black Sabbath”.
The actual download URL is on another page, but all have a pattern that looks like this: https://freemidi.org/getter-17459 where the last set of digits is the ID for the file. However, these IDs are all present on the band page. So I started by creating an extractor for a sample band:
This got me automatically all the items available for a band. In order to get the IDs, I created a custom XPath based column called “IDs” that provided me with the download links. All I had to do was to extract the IDs and create the actual download links using the Import.io regex functionality.
As you can see now the last column has the correct links. I saved the extractor and went to the dashboard. Since I would like to apply this extractor to all the bands identified in the first part, I set the “Extract from multiple URLs” to “Use URLs from the previous extractor”.
Once this was done, I ran the extraction which meant, Import.io’s backend did the following:
- crawled the genre pages I specified in the first part
- grabbed all the bands that were listed there
- for each of those ran the second extractor to create the final list of all download URLs.
The final result was a list of more than 11,000 MIDI file links.
Once this was done, I could use any batch download tool (curl in my case) to download all of these files. I now have a machine learning dataset that I can use to start training our models 🙂