Hashing is a fundamental operation in most online databases, such as a library catalog or an e-commerce website. A hash function generates codes that replace input data. Since these codes are shorter than the actual data, and usually a fixed length, it makes it easier to find and retrieve the original information.
However, since traditional hash functions generate codes randomly, sometimes two pieces of data can be hashed with the same value. This causes collisions — when searching for an item points a user to multiple pieces of data with the same hash value. It takes longer to find the right one, resulting in slower searches and reduced performance.
Certain types of hash functions, known as perfect hash functions, are designed to sort data in a way that avoids collisions. But they must be specially built for each dataset and take more computing time than traditional hash functions.
Because hashing is used in so many applications, from database indexing to data compression to cryptography, a fast and efficient hash function is critical. So, researchers from MIT and elsewhere set out to see if they could use machine learning to build better hash functions.
They found that, in certain situations, using learned models instead of traditional hash functions could result in half as many collisions. Learned models are those created by running a machine-learning algorithm on a dataset. Their experiments also show that learned models are often computationally more efficient than perfect hash functions.
“What we found in this work is that in some situations we can have a better tradeoff between computing the hash function and the collisions we’ll face. We can increase the computing time for hash function a little, but at the same time we can reduce collisions dramatically in some situations,” said Ibrahim Sabek, a postdoc in the MIT Data Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).
Their research, to be presented at the International Conference on Very Large Databases, shows how a hash function can be designed to significantly speed up searches in a large database. For example, their method could speed up the computational systems scientists use to store and analyze DNA, amino acid sequences, or other biological information.
Sabek is co-lead author of paper along with electrical engineering and computer science (EECS) graduate student Kapil Vaidya. They include co-authors Dominick Horn, a graduate student at the Technical University of Munich; Andreas Kipf, an MIT postdoc; Michael Mitzenmacher, professor of computer science at the Harvard John A. Paulson School of Engineering and Applied Sciences; and senior author Tim Kraska, associate professor of EECS at MIT and co-director of the Data Systems and AI Lab.
Hashing it out
Given an input of data, or key, a traditional hash function generates a random number, or code, that corresponds to the slot in which to store that key. To use a simple example, if there are 10 keys to be inserted into 10 slots, the function will generate a random integer between 1 and 10 for each input. There is a high probability that two keys will end up in the same slot, causing collisions.
Perfect hash functions provide a collision-free alternative. Researchers provide the function with some additional information, such as the number of spaces in which the data will be placed. It can then perform additional calculations to figure out where to place each key to avoid collisions. However, these added calculations make the function more difficult to perform and less efficient.
“We wonder, if we know more about the data — that it’s going to come from a certain distribution — can we use the learned models to build a hash function that will actually reduce collisions?” said Vaidya.
A data distribution shows all possible values ​​in a dataset, and how often each value occurs. The distribution can be used to calculate the probability that a particular value is present in a sample of data.
Researchers take a small sample from a dataset and use machine learning to estimate the shape of the data distribution, or how the data is spread out. The learned model then uses the approximation to predict the location of a key in the dataset.
They found that learned models are easier to build and run faster than perfect hash functions and that they lead to fewer collisions than traditional hash functions if the data is distributed in a predictable way. But if the data is not predictably distributed, because the gaps between data points are too wide, using learned models can cause more collisions.
“We may have a large number of data inputs, and each one has a different interval between it and the next, so learning is quite difficult,” Sabek explained.
Fewer collisions, faster results
When data is predictably distributed, learned models can reduce the ratio of colliding keys in a dataset from 30 percent to 15 percent, compared to traditional hash functions. They also achieve better throughput than perfect hash functions. In the best cases, the learned models reduced the runtime by about 30 percent.
As they explored the use of learned models for hashing, the researchers also discovered that overall the number of sub-models affected the majority. Each learned model consists of smaller linear models that approximate the distribution of the data. With more sub-models, the learned model produces a more accurate estimate, but it takes more time.
“With a certain limit of sub-models, you get enough information to form the estimate you need for the hash function. But after that, it doesn’t lead to much improvement in collision reduction,” Sabek said. .
Building on this analysis, the researchers wanted to use the learned models to design hash functions for other types of data. They also plan to explore learned hashing for databases where data can be inserted or deleted. When the data is updated in this way, the model has to change accordingly, but changing the model while maintaining accuracy is a difficult problem.
“We want to encourage the community to use machine learning within more fundamental data structures and operations. Any kind of core data structure gives us the opportunity to use machine learning to capture data properties and get better performance. We still have a lot to explore,” said Sabek.
This work was supported, in part, by Google, Intel, Microsoft, the National Science Foundation, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.