Using Machine Learning for Exploratory Data Analysis

Joshua M. Lewis,Virginia R. de Sa

Using Machine Learning for Exploratory Data Analysis

2012

Using Machine Learning for Exploratory Data Analysis Joshua M. Lewis Virginia R. de Sa josh@cogsci.ucsd.edu Department of Cognitive Science University of California, San Diego desa@cogsci.ucsd.edu Department of Cognitive Science University of California, San Diego Abstract before. The lab sessions will start with synthetic datasets de- signed to reinforce conceptual lessons, and then move to real datasets provided by ourselves and the attendees. The clustering section will cover centroid-based methods (such as k-means), hierarchical methods (such as single link- age), spectral clustering, and probabilistic modeling (such as Gaussian mixture models). The dimensionality reduction sec- tion will cover linear methods (such as PCA and projection pursuit) and that nonlinear methods (such as Isomap, tSNE, and Kernel PCA). Though we will provide formal mathematical characteri- zations, our focus will be on conceptual differences between techniques, specifically related to choosing the correct tech- nique based on known structure in a dataset. Additionally, we will emphasize that there is no single best clustering or embedding for any given dataset (in other words, there is no universally agreed upon objective function for clustering and dimensionality reduction). One’s own analysis goals can play a significant role in, e.g., determining the number of clusters to search for. Finally, on the topic of evaluation we will cover the visualization and interpretation of algorithmic output as well as formal quality measures such as Silhouette for clus- terings and Trustworthiness for embeddings. We will transact our lab sections in Divvy, a free and open- source software platform for performing unsupervised ma- chine learning (see http://divvy.ucsd.edu where there is a video of Divvy in action). Divvy will allow attendees to rapidly cluster, reduce and visualize a wide variety of datasets without having to write any code. Divvy can concurrently visualize several perspectives on a dataset and can switch be- tween datasets with one click, even when algorithms are com- puting in the background. Divvy integrates well with existing research workflows—it can import data from Matlab and R and it exports data and visualizations in standard formats for further analysis. This tutorial will introduce attendees to fundamental concepts in the clustering and dimensionality reduction fields of unsu- pervised machine learning. Attendees will learn about the as- sumptions algorithms make and how those assumptions can cause the algorithms to be more or less suited to particular datasets. Hands-on interaction with machine learning algo- rithms on real and synthetic data are a central component of this tutorial. Students will use the software platform Divvy (freely available from the Mac App Store or divvy.ucsd.edu) to visualize and analyze data in real time while testing the con- cepts learned during formal instruction. We encourage atten- dees to bring their Mac laptops and their own datasets for the hands-on portion of the tutorial, and if possible to email their datasets ahead of time to josh@cogsci.ucsd.edu. Attendees will leave the tutorial with a much better under- standing of basic concepts in unsupervised machine learn- ing. Pragmatically they will understand when to apply, e.g., k-means to a dataset versus single linkage clustering. Atten- dees will also learn how to integrate Divvy into their existing research workflow so that they can quickly test and compare machine learning algorithms on their data. Objectives and Scope This tutorial will introduce attendees to fundamental concepts in the clustering and dimensionality reduction fields of unsu- pervised machine learning. Attendees will learn about the as- sumptions algorithms make and how those assumptions can cause the algorithms to be more or less suited to particular datasets. Hands-on interaction with machine learning algo- rithms on real and synthetic data are a central component of this tutorial. Students will use the software platform Divvy to visualize and analyze data in real time while testing the concepts learned during formal instruction. We will encour- age attendees to bring their own datasets for analysis in the hands-on portion of the tutorial. Attendees will leave the tutorial with a much better un- derstanding of basic concepts in unsupervised machine learn- ing. Pragmatically they will understand when to apply, e.g., k-means to a dataset versus single linkage clustering. Atten- dees will also learn how to integrate Divvy into their existing research workflow so that they can quickly test and compare machine learning algorithms on their data. Qualifications Joshua Lewis recently completed his PhD thesis, Anthro- pocentric Data Analysis, on the topic of reintegrating humans into the data analysis process. He is the lead software archi- tect behind Divvy, and has done several studies on the rela- tionship between human reasoning and machine learning. He is a postdoc in UCSD’s Natural Computation Lab under the supervision of Virginia de Sa. He has attended CogSci and presented papers every year starting in 2009. Joshua will lead the tutorial. Virginia de Sa is an associate professor at UCSD in the Cognitive Science department. She has done extensive re- Topics We will split the tutorial into two sections, a morning section focused on clustering and an afternoon section focused on di- mensionality reduction. Both sections will start with a brief (approximately 1.5 hours) formal introduction to mathemat- ical and conceptual underpinnings of the topic, followed by a hands-on lab session applying the concepts learned directly

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations