Big Data Analytics by a Logic Approach

Modern computer networking, increased ability to collect molecular information in biology, and cheap low-power electronic sensors have made it easier than ever to collect more data than anyone knows what to do with. Some sense of the importance and scope of the field can be appreciated by a 2013 study on the issue by the National Research Council.

The native territory of computability and model theory is with infinite structures, so in this field we have been dealing for years with what, for many in computing, statistics, and combinatorics has always before been a metaphor. The potential of logic to make significant contributions to the field is considerable.

One example worth considering is the problem of computing persistent homology. Algebraic topology has suggested a very reasonable tool for certain kinds of unsupervised machine learning problems, which is described at some length in Ghrist's recent survey. However, much is still unexplored around the possibility of carrying out, in a practical presentation, the necessary computations. Some friends and I showed some years ago that, depending on how data are given, it is not necessarily even possible to compute the fundamental group of a space. Once we know about computing the invariants, it still remains to know what we can learn from them.

There are also more general problems of what is learnable and what is not. I recently characterized the difficulty of determining whether a concept class can be learned by a machine, in the sense that the computer program must look at some random data and determine the true boundaries with error which is very small in two specific senses. In the process, I formulated a setting in which it makes sense to ask about degrees of relative learnability. Are they dense, for instance? What other order-theoretic properties does this degree structure have?

It is common in the life sciences to use the entropy of a system to measure its complexity. Since biologists are often looking for the simplest (and thus, most likely) system that could produce a given effect, could Kolmogorov Complexity give more scientifically meaningful information? This measure is commonly used in algorithmic randomness, and arises from work on universal Turing machines. Its design goal is to measure the "shortest description" --- really the simplest Turing machine --- that could account for a sequence.

Finally, there are practical problems that arise in data-driven science that beg a mathematical solution, and on which logic might have a good approach. Consider a fixed graph G. Let H and J be two random subgraphs. What sampling distributions can we compute for the intersection of H and J? Model theory and computability tend to have quite a lot to say about random structures --- by being random, they tend to exhibit high regularity at a level that logic is very good at highlighting.

Send correspondence to

Department of Mathematics
Mail Code 4408
Southern Illinois University
1245 Lincoln Drive
Carbondale, Illinois 62901
Office: (618) 453-6582
Fax: (618) 453-5300
Home: (618) 985-3429

Wesley Calvert's Web Page
wcalvert@siu