In the modern world, large volumes of data are continuously being generated in many organisations, ranging from network traffic to prices in financial markets. These increasingly common “big data” situations pose a challenge for data analysts, who must come up with ever more sophisticated methods to process torrential data flows.
One of the most fundamental problems in data analysis is to identify relationships between variables. If a data set contains just a few variables, this is relatively simple to work out. The statistical procedures for calculating such “correlations” are routinely taught to undergraduate students.
For big data streams, however, finding correlations is an extremely difficult computational problem, even for modern computers.
Now, a breakthrough in solving this problem has been achieved by Assistant Professor Li Yi at Nanyang Technological University, Singapore (NTU Singapore), and Associate Professor David P. Woodruff and Taisuke Yasuda at Carnegie Mellon University.
In a paper presented at the Conference on Learning Theory, the most prestigious conference in theoretical machine learning, the computer scientists describe a novel mathematical method for efficiently computing correlations in big data sets. When the number of variables is large, the new algorithm provides an exponential improvement over the best algorithm previously known.
Finding Relations Between Two Variables
Suppose we are interested in data containing two variables x and y, each of which can take only the values 0 or 1. We receive a short data stream consisting of eight data points, which are pairs of (x,y) values:
(0,0), (1,1), (0,1), (1,0), (0,1), (0,1), (0,0), (0,1)
To analyse whether x and y are related, we can count up the number of occurrences of each possible (x,y) pair, and divide by eight. This yields the following table, called an “empirical joint density table”:
x/y | 0 | 1 |
0 | 2/8 | 4/8 |
1 | 1/8 | 1/8 |
Next, we can check how often each variable takes on the values 0 or 1, a quantity called the “marginal density”. For x, the value 0 appears six times in the data stream, while 1 appears twice, as summarised in the following table:
0 | 1 |
6/8 | 2/8 |
Likewise, for the variable y,
0 | 1 |
3/8 | 5/8 |
If x and y are independent variables, their joint distribution should be the product of the corresponding marginal densities, resulting in the following joint density table:
x/y | 0 | 1 |
0 | 6/8 × 3/8 = 9/32 | 6/8 × 5/8 = 15/32 |
1 | 2/8 × 3/8 = 3/32 | 2/8 × 5/8 = 5/32 |
Evidently, these numbers differ from the contents of the empirical joint density table (i.e., the first table that we calculated, above). This allows us to conclude that the two variables are not independent.
But how closely related are the variables? One way to answer this is to sum the absolute values of the differences between the two joint density tables. In the above example, the result is 1/8. This is not very large, implying that the two variables are only weakly correlated.
It Gets Complicated
Real-world data sets seldom contain only two binary variables. If the number of variables and the number of possible values are reasonably small, we can adapt the above method to test for correlations.
If there are q variables and each variable takes d distinct values, each data point has dq possibilities. Thus, the joint density table must have dq cells. (This is consistent with the above example: for q = 2 and d = 2, the table size is 22 = 4.)
When dealing with big data, the standard procedure becomes difficult to carry out if the size of the joint density table, dq, becomes prohibitively large.
In 2010, a partial solution to this problem was devised by Vladimir Braverman and Rafail Ostrovsky, two computer scientists at the University of California, Los Angeles. They invented an algorithm that allows correlations to be computed using a table of size (log d)M, where M = qq.
When the number of possible values d is extremely large, but the number of variables q is small, the Braverman-Ostrovsky algorithm is a significant improvement. For example, if d = 100,000 and q = 2, the required table size is around 104, compared to 1010 for the standard algorithm.
However, when the number of variables is large, the Braverman-Ostrovsky algorithm is not helpful and may even perform worse than the standard algorithm.
The new algorithm by Li, Woodruff, and Yasuda requires a table size of approximately (log d)P, where P = q2. For data sets with multiple variables, this is a significant improvement over the Braverman-Ostrovsky algorithm.
For example, if d = 100,000 and q = 4, the Li-Woodruff-Yasuda table size is around 1016, compared to 1020 for the standard algorithm or 10271 for the Braverman-Ostrovsky algorithm.
“We believe that our solution for this mathematical problem is nearly optimal. It may be possible to improve the exponent P = q2 to P = q, but nothing substantially better can be done – we think q has to be in the exponent”, says Assistant Professor Li, who is a faculty member at the School of Physical and Mathematical Sciences in NTU Singapore.
One shortcoming of the new algorithm is that it focuses on minimising the memory usage of the joint density table, without worrying about how long it takes to calculate or process the table. In practical big data applications, calculating the whole joint density table may still take an extremely long time.
Nonetheless, the work of Li, Woodruff, and Yasuda may find uses in certain data management processes. For example, the algorithms used to merge digital databases – so-called “join operations”– typically assume the database columns are independent, and therefore require dq blocks of memory space. Using insights from the Li-Woodruff-Yasuda algorithm, it may be possible to pre-process the databases being merged, and thereby reduce these steep memory requirements.
“By combining mathematical insights with advanced computational methods, such as machine learning, we can greatly improve our ability to use and analyse the huge volumes of data available in the modern world,” says Assistant Professor Li. “The management of data streams is an important frontier in algorithm design, which I am very excited to be exploring.”