TAEM Interview with Dr. Kirk Borne of George Mason University

TAEM- The Arts and Entertainment’ Magazine’s publisher, Joseph J. O’Donnell, issued a challenge in the December 15th issue of our publication to start a ‘grass roots movement’ to support NASA. This support is spreading over the academic world and its start has taken place in the George Mason University faculty and student body. The challenge has centered on not only on Support of NASA, but to give the agency ideas for space exploration for its future programs.

   Dr. Kirk Borne, of GMU, is a Data Scientist and Astrophysicist, and is one of the many professors from the school that has stepped forward to offer insights into what can be achieved. Professor Borne, please tell our readers about your educational training for your fields.

KB- My undergraduate B.S. degree was in Physics at Louisiana State University, with a lot of math and some astronomy.  My goal was to study astronomy in graduate school, so the math and physics coursework was essential.  I loved all of those topics, and astronomy gave me the opportunity to study them all. I went to graduate school at Caltech, receiving a PhD in astronomy in 1983. I studied under some of the great astronomers of that era. It was a fantastic experience. In the years since then, I worked on NASA’s Hubble Space Telescope project for 10 years and at NASA’s Astronomy Data Center within the Space Science Data Operations Office at the Goddard Space Flight Center for another 10 years, and I have now been at George Mason University since 2003.  All of my research and my work experiences at NASA always involved working with scientific data – this led me to the field of Data Science, which is the application of data methods and algorithms to the study of any discipline.

TAEM- You are also a member of your university’s SPACS program. Please tell us about it and the goals that it has set forth.

KB- SPACS is the School of Physics, Astronomy, and Computational Sciences. This is a unique program among universities in the US. Our faculty and students focus on a wide range of research problems involving physics, astronomy, computational science, and data science.

TAEM- What part do you play in its agenda ?

KB- I have helped to develop the Data Science curriculum within the school. In that capacity, I am the undergraduate advisor for students in the Computational & Data Sciences B.S. degree program.  I also advise many graduate students within our CSI (Computational Science & Informatics) PhD program. I teach courses related to Data Science, including Scientific Databases, Computational Data Science, Data Mining, and Data Ethics.  In addition to advising and teaching in this program, I also carry out data science research – mostly in astronomy, but covering many other fields.

TAEM- Your specialties are listed as a Data Scientist, Astrophysicist, Big Data Science Consultant, and Public Speaker. Please describe your capacities in these venues and how they are connected.

KB- I have many years of experience working with data, databases, and data science methodology (including data mining, statistics, and visualization).  This experience includes teaching and research, but it also has led me to assist, advise, and consult with other organizations and federal agencies regarding their data activities.  This has gained me some notoriety, so I receive many (10 to 20) invitations to speak at conferences and universities worldwide each year on the topic of Data Science, specifically “Big Data”. My two most amazing experiences in this capacity are these: first, in 2001, I was asked to brief the US President on data mining; and second, in 2011, I was the conference keynote speaker at the Medicare and Medicaid Statistics Conference.  I never imagined such experiences when I was focusing only on my astronomy research years ago.

TAEM- Please tell us in detail about Transdisciplinary Data Science.

KB- Transdisciplinary is different from multi-disciplinary or interdisciplinary in that it refers to the fact that Data Science transcends traditional discipline boundaries.  I can work with financial experts, climate scientists, agriculture specialists, criminologists, drug safety organizations, and library staff on data issues without necessarily requiring me to learn their field or requiring them to learn astronomy. The language of Data Science (databases, data, metadata, statistics, visualization, data mining) transcends those discipline-specific concepts. Data Science enables productive, meaningful, and enlightening research experiences across discipline boundaries for everyone involved.  For me personally, I like to think of myself as a Transdisciplinary Data Scientist because my research on data mining algorithms, data structures, data management methods, and statistics are applicable to almost any discipline.

TAEM- How alike are Big Data and Large Data Bases, and how do they interact ?

KB- Big Data is a concept that conveys many meanings and implications to different audiences. It includes large databases, but it includes tons of other things, including large data collections that are not in databases (such as Internet blogs, social network postings, news reports, online video content, images, audio, publications, articles, and anything else that is in digital or non-digital form).  Large databases are only a small subset of Big Data collections. The Big Data concept also conveys additional meanings, such as the challenges associated with discovery, access, mining, analysis, and interpretation of massive data collections.  These challenges include the large data volume, but also the complexity of the data (as indicated above, the large variety of data types), as well as the enormous rate at which data are being generated in the world today. The data production rate doubles every year, which means that the world will have at least 1000 times more data ten years from now, one million times more data 20 years from now, and so on.  One estimate of the total amount of data created in the world each and every day is three exabytes, which is three million terabytes or three billion gigabytes.  This is more than a thousand times all of the information in all of the books and journals in all of the libraries in the world that have ever been published in the history of humanity. This is the amount that we create each new day this year – we will create roughly this same amount of data every 90 seconds ten years from now, and roughly this same amount every 1/10 of a second 20 years from now! We must teach the next generation how to handle, deal with, cope with, and make use of this information flood.  Big Data therefore refers to the combination of all of these challenges and issues (both technological and human).

TAEM- How does Data Mining and knowledge discovery figure into these ?

KB- The collection of large data into archives and databases is useless unless you intend to use the data for something; and that “something” is discovery of patterns, trends, correlations, features, outliers, anomalies, and unexpected “knowledge” nuggets hidden within these enormous data sets (i.e., finding the proverbial needle in the haystack).  That is what we call data mining, which is also called Knowledge Discovery from Data.  This process is also referred to as “Learning from Data”.  When it is applied to making decisions about future behaviors of consumers, or systems, or physical processes, data mining is called “Predictive Analytics”.  Yes, people use data mining to predict the future.  There are numerous movies and TV shows that highlight these techniques – the techniques are real (even though the shows are fictional).  Finding new knowledge is what science is all about.  The amazing thing now is that businesses, agencies, sports teams, grocery stores, entertainers, social networking sites, and everyone else have realized that they too can discover new knowledge (e.g., predict behaviors or outcomes) from their data collections: ticket sales, purchases, buying patterns, behavior patterns, system logs, and more.

TAEM- We understand that you basically invented the idea of astro-informatics.  Please tell our readers about this and some of the Big Data issues that faces science today.

KB- Informatics is the application of Data Science to a specific discipline, though we sometimes simply define Informatics as Data Science. Some science-specific informatics disciplines are well established, including Bioinformatics and Geoinformatics. It occurred to me about 10 years ago that astronomy’s big (and growing) data collections would require a similar methodological subdiscipline of the field of astronomy – I called this Astroinformatics. I used this word for many years without much uptake by the astronomy community. I published a journal paper and another paper for the National Academy of Sciences on Astroinformatics 3 years ago – now, it is a very commonly used term, there are Astroinformatics conferences every year, and there are Astroinformatics committees within the major astronomy professional societies (in the US and internationally). I am a member of all of these committees. It is very exhilarating to see how far the field has come in such a short time.  The real reasons for these informatics approaches to science are the same reasons mentioned earlier: scientists are generating huge amounts of data from their experiments, we want to explore these data collections as effectively and efficiently as possible, we want to discover all of the knowledge that is hidden in these data, and we want to unlock the mysteries of the world and Universe around us that our huge experiments can now reveal to us.

TAEM- We understand that your findings involving Large Database Astronomy has pinpointed Groups and Clusters of Galaxies. How does Massive Databases and Large Sky Surveys make this possible ?

KB- My current research is focused on outlier detection, which I prefer to call Surprise Discovery – finding the unknown unknowns and the unexpected patterns in the data.  These discoveries may reveal data quality problems (i.e., problems with the experiment or data processing pipeline), but they may also reveal totally new astrophysical phenomena: new types of galaxies or stars or whatever. That discovery potential is huge within the huge data collections that are being generated from the large astronomical sky surveys that are taking place now and will take place in the coming decades. I haven’t yet found that one special class of objects or new type of astrophysical process that will win me a Nobel Prize, but you never know what platinum-plated needles may be hiding in those data haystacks.

TAEM- What advice and support can you give to NASA for its future programs of space exploration?

KB- In the context of data science, the most important lesson is that the data generated from all NASA missions be made openly available to the research community in useful and self-explanatory ways, to facilitate new and interesting uses of the data for research and discovery.  It is also helpful if these data are annotated or tagged with rich metadata, which will further enable integration and fusion of data products from multiple missions, thus enabling far greater discovery potential. One of the primary functions of metadata is to provide a short-hand condensed representation of the data product. This helps to address some of the challenges associated with Big Data: making the data more manageable and conveniently usable. We are already familiar with such hierarchical data structures – anyone who has used Google or Bing Maps experiences this – the map of the world is initially presented to you in very low resolution mode, but the resolution getting higher and higher as you drill down to some specific location – you are finally able to view your backyard or some destination at the highest resolution available from some satellite image collection, but you are definitely not viewing the whole Earth at one time at that same high resolution. Exploratory data analysis makes use of such hierarchical data structures, which NASA missions should generate for their science data users.

TAEM- What programs are you working on that would pertain to this, and what research projects are you planning on for the future?

KB- I have been working on Citizen Science projects, such as Galaxy Zoo and Zooniverse.org, in which volunteer citizens tag and annotate our scientific data products (e.g., images, or time series, or model outputs, or whatever) with descriptive characteristics. This characterization of complex data products becomes part of the metadata associated with that data product, thus enabling linkages between different data products and discovery of new patterns, trends, correlations, and behaviors.  I envision extending this to other projects in the future, examining the role of non-experts in metadata generation (characterization) and conducting exploratory research into the “best” pattern recognition methods for discovering interesting, surprising, and informative features in large science data collections.

TAEM- What information can you give us so that the many students who read our publication learn more about you and the school’s programs ?

KB- I invite others to check out some of the research and academic programs within Mason’s SPACS school at http://spacs.gmu.edu and to check out my own research and teaching interests at http://classweb.gmu.edu/kborne . If you are a Twitter user, you can follow me there as I actively tweet about Big Data, Data Science, and Astronomy under the handle @KirkDBorne.

TAEM- Dr. Borne, it has been a sincere honor to be able to interview you for our magazine. We have discovered that, like yourself, the faculty at George Mason University is a virtual well of information and that your school has been one of the premier sources of one of the most well trained student bodies in the nation. We want to thank you for your time and look forward to talking again with you in the very near future.