Yahoo! Labs
Downloads
Questions?
Datasets
Yahoo! Music offers a wealth of information and services related to many aspects of music. The released data represents a sampled snapshot of the Yahoo! Music community's preferences for various musical items. A distinctive feature of this dataset is that user ratings are given to entities of four different types: tracks, albums, artists, and genres. In addition, the items are tied together within a hierarchy. That is, for a track we know the identity of its album, performing artist and associated genres. Similarly we have artist and genre annotation for the albums. Both users and items (tracks, albums, artists and genres) are represented as meaningless anonymous numbers so that no identifying information is revealed.
Noteworthy characteristics of the dataset include:
- - It is of a larger scale compared to other datasets in the field (over 300M ratings).
- - It has a very large set of items (over 600K) - much larger than any similar dataset, where usually only the number of users is large.
- - There are four different categories of items, which are all linked together within a defined hierarchy. This is particularly important considering the large number of items.
- - Given timestamps allow performing session analysis of user activities.
The two tracks of the competition employ two different datasets, which we describe below.
Track 1
The dataset is split into three subsets:
- - Train data: in the file trainIdx1.txt
- - Validation data: in the file validationIdx1.txt
- - Test data: in the file testIdx1.txt
For each subset, user rating data is grouped by user. First line for a user is formatted as:
<UsedId>|<#UserRatings>\n
Each of the next <#UserRatings> lines describes a single rating by <UsedId>, sorted in chronological order. Rating line format is:
<ItemId>\t<Score>\t<Date>\t<Time>\n
The scores are integers lying between 0 and 100. All user id's and item id's are consecutive integers, both starting at zero. Dates are integers describing number of days elapsed since an undisclosed date.
An item has at least 20 ratings in the total dataset (including train, validation, and test sets).
Each user has at least 10 ratings in the training data, which were given by the user earlier than his validation ratings. Then, each user has exactly four ratings in the validation data, which come earlier in time than the ratings by the same user in the test data. Finally, the test data holds the last 6 ratings of each user.
For the test subset, we withheld the scores. The contestants are asked to provide predictions for these test scores. The predictions should be arranged by the given test set order. Each prediction is quantized to one of the 256 evenly spaced numbers between 0 and 100 (100*0/255, 100*1/255,...,100*254/255, 100*255/255). Thus, a single unsigned byte would be dedicated for a predicted value, and predicting the 6,005,940 ratings of the test set would require 6,005,940 bytes. (We provide C and Python programs for converting into well-formatted prediction files.)
The evaluation criterion is the root mean squared error (RMSE) between predicted ratings and true ones.
The test subset is further split into two equally sized subsets, called Test1 and Test2. Test1 is used to rank contestants on the public Leaderboard and Test2 is used for deciding the winners of the contest. The split between Test1 and Test2 is not disclosed.
The dataset statistics are as follows:
| #Users | #Items | #Ratings | #TrainRatings | #ValidationRatings | #TestRatings |
| 1,000,990 | 624,961 | 262,810,175 | 252,800,275 | 4,003,960 | 6,005,940 |
Track 2
The dataset is split into two subsets:
- - Train data: in the file trainIdx2.txt
- - Test data: in the file testIdx2.txt
At each subset, user rating data is grouped by user. First line for a user is formatted as:
<UsedId>|<#UserRatings>\n
Each of the next <#UserRatings> lines describes a single rating by <UsedId>. Rating line format:
<ItemId>\t<Score>\n
The scores are integers lying between 0 and 100, and are withheld from the test set. All user id's and item id's are consecutive integers, both starting at zero.
An item has at least 20 ratings in the total dataset (including train and test sets), and each user has at least 17 ratings in the training data.
For each user participating in the test set, six items are listed. All these items must be tracks (not albums, artist or genres). Three out of these six items have never been rated by the user, whereas the other three items were rated "highly" by the user, that is, scored 80 or higher.
The three items rated highly by the user were chosen randomly from the user's highly rated items, without considering rating time. The three test items not rated by the user are picked at random with probability proportional to their odds to receive "high" (80 or higher) ratings in the overall population.
Note that many users do not participate in this test set at all.
The goal of such a task would be differentiating high ratings from missing ones, which requires extending the generalization power of the learning algorithm to the truly missing entries, as required in real life scenarios.
Predictions should be arranged by their order in the test set. Only two values can be used: "1" for items predicted to be rated highly by the user and "0" for items predicted not to be rated by the user. For each user in the test set, exactly three "1"s and three "0"s must be given, otherwise the submission will be rejected. These binary values will be represented by single bytes, ASCII codes of '1' (0x31) and '0' (0x30) respectively. Hence, the 607,032 ratings in test set are predicted using 607,032 bytes.
The evaluation criterion is the error rate, which is the fraction of wrong predictions.
Like Track1, the test subset is further split into two equally sized subsets, called Test1 and Test2. Test1 is used to rank contestants on the public Leaderboard and Test2 is used for deciding the winners in contest. The split between Test1 and Test2 is not disclosed.
We made the Track2 dataset significantly smaller than the Track1 dataset in order to make it more accessible to contestants with lower-end computing machinery. The dataset statistics are as follows:
| #Users | #Items | #Ratings | #TrainRatings | #TestRatings |
| 249,012 | 296,111 | 62,551,438 | 61,944,406 | 607,032 |
Please note that Track1 and Track2 use disjoint sets of users, and also use different encodings for item indexes. Hence, entities of Track1 need not be the same in Track2 even if they share the same name (e.g., item "5" in Track1 is not the same item as the item named "5" in Track2).
Item Taxonomy
A unique feature of the datasets is a taxonomy annotating known relations between the items. Such a taxonomy is expected to be particularly useful here, due to the large number of items and the sparseness of data per item (mostly attributed to "tracks" rather than to "artists").
Recall that item id's can represent tracks, albums, artists or genres. The type of each item, including a hierarchical structure linking tracks, albums, artists and genres, is stored (separately for each the two datasets) in the following four files:
- trackData.txt - Track information formatted as:
<TrackId>|<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n
- albumData.txt - Album information formatted as:
<AlbumId>|<ArtistId>|<Optional GenreId_1>|...|<Optional GenreId_k>\n
- artistData.txt - Artist listing formatted as:
<ArtistId>\n
- genreData.txt - Genre listing formatted as:
<GenreId>\n
Within these files, missing values are encoded as the string "None".