DataViz - Easy Visualization for Machine LearningNov 20, 2016
DataViz for Easy Plotting
Ever since machine learning and big data are getting extremely popular, it makes total sense that we need a better way to look into what’s going on behind the data and model. There exists a number of frameworks you can leverage to produce figures on your demand, most popular ones like matplotlib, seaborn, plot.ly, and also bokeh which we use in this project. Of course, there’re many other choices.
Frankly to say, I change to bokeh a while ago, because it does produce amazing figures and also with a number advantages to deploy your figures simply as a set of HTML files. I haven’t yet tired other frameworks, but as to my knowledge, bokeh is one of the most promising ones, plot.ly is also great, however I dislike their concept of plotting by API calls. I won’t bother discuss pro&cons of different plotting frameworks, I chose bokeh since it does support both low level and higher level plotting functions, except that it does not support contour plot, for this I felt like really sorry, but you can find alternative solution too.
So firstly I show a simple demonstration of how it should looks like in the end. The three examples above are actually plotted for three kinds of purposes in data analytics. The main purpose of having this DataViz wrappers built beyond bokeh is to better facilitate my daily research. In machine learning, usually we are more interested at a certain number of plots. For instances, you build a model for a task, probably you wanna evaluate the model with varying parameters, then you will plot a validation curve with mean and standard deviation, w.r.t. different parameter settings. You will also be interested at a learning curve of how this model performs with respect the training size.
Even scikit-learn provides two very convenient functions for you to create these common curves. See examples below,
The usage of these two curves are quite straightforward, I use it quite often in experiments. The estimator always corresponds to the classifier you build, which must have implemented fit and prediction methods, if you do not pass the scoring argument, you also need to implement an estimator.score() function in your estimator. Anyway, how to use these is out of scope for this post. If we look at the output, there are tr_scores and tt_scores which have dimension x_sizes*n_folds. x_sizes corresponds to the x-axis, and n_folds corresponds to the number of repetitions in your cross validation folds. At this point, it would be nice to have helper functions to directly generate plots with these arrays.
Frankly to say, bokeh still has a lot to be improved, especially it does not provide contour/contourf plots, which I use a lot when I plot classification boundary or density maps. But it does really nice in support modern browsers. It can generate directly HTML files that can be deployed on your own website, or share with friends. And also it integrates seamlessly with Jupyter notebook, which allows you interactively play around it. Besides, another very nice feature is the widget and controls, like you can create Button/Checklist/Input Box very easily in bokeh, and make them response to your plots, with matplotlib to achieve these? a lot of pain in ass, it is afterall not designed as so.
Concept of DataViz
Simply use bokeh’s native API will just be fine to plot your data, however, I want to make it even simpler for my research purpose. That is every time if I wanna inspect a model or data or experimental results, I can present all the plots that I am interested in one simple web page. Using bokeh, I can easily achieve this.
Now the DataViz is simply a main class contains several plotting methods,
The backbone code of the DataViz class is shown above. The basic idea is that one DataViz instance should represent a figure, where each method of DataViz returns a particular plot attaches to the figure. Each plotting method in the DataViz instance will firstly get a default plot configured by default settings. DataViz also provides other utilities, e.g., send the figure to server for hosting or email someone the plots.
So, why bother to have this? given that you can basically directly use bokeh to achieve your goals. Well, this is more or less helpers for those who only cares the plots important in machine learning community. There’s of course tradeoff between usability and flexibility of any kind of frameworks. The higher you encapsulate, the more easier for you to use, but of course you lose your flexibility.
We currently support following plots particularly for machine learning use cases, the list will be updated actively during my research needs. For the moment, I do not intend to publish the code, it’s just helpers anyway. But for those who really wanna give them a shot, just mail me to check if I can send you the code.
Currently we support only a few plot types, ofc, the list will be getting longer in the future.
This plot shows how the feature values distribute, along the x-axis are the indices of feature columns, and along the y-axis is the values of N data samples. By plotting it, you can get an overview of how is your data distributed, and decide afterwards how you’re gonna preprocess it. An example is shown below in Figure 1 (Left).
The name of project2d is self-explanatory. We wanna see how data distributed in lower dimension, For example, If we have two categories of data, And we use different colors to annotate the data, Therefore we can get inside what is going on behind the observations. One example you can see in the table 1 (Right). This is actually a very common task when you first time get your data. Well normally you’re going to get a very high dimensional data, and the data will be transformed into 2 dimensional data using PCA algorithm or some other dimension reduction methods. Some examples are Multidimension Scaling (MDS), t-Distributed Stochastic Neighbor Embedding (t-SNE) and of course PCA/Kernel PCA. Those are all my favoriate ones. Before you start to learn any model from the data, it is a common practice to run them in order to see if there’s any insight behind it.
Another common plot you wanna check is the correlation map among features/targets. After computing the correlation on pairwise features and targets, you can see which features are strongly correlated, or you can see which features are obviously more decisive for targets. This is quite crucial at the first step, because you might do a feature selection before feeding a huge number of irrelevant features to your model. And it is usually the case that your observations are very sparse, only a small subset of features are relevant. An example is show in Figure 2.
Simple curves and fill_between curves
Simple curves and fill_between curves target on same type of plots, that is for instance, an error curve or learning curve described as before. The only difference is that fill_between provides a upper and lower bound over the actual curve, to present information such as error bars, standard variation over multiple repetitions of runs. See the Figure 3 for an example.
Visualizing your data and model correctly is crucial while you develop your learning systems, I always put a lot more emphasis on demonstrating the model, it is not only for your own insight, but as well as how your customers or supervisors can understand what you have achieved.
For the future, I plan to work out more type of plots which are relevant for machine learning community, for example, the contour/contourf plots, classification boundary, visualization of neural networks and so on. Besides, I also want to add facilities like better deployment of figures, widgets for better interaction for website and so on.