1


[1910.06169v1] The PGMindex: a multicriteria, compressed and learned approach to data indexing
We have introduced the PGMindex, a learned data structure for the fully indexable dictionary problem which improves the query performance and the space occupancy of both traditional and modern learned indexes up to several orders of magnitude
*ABSTRACT The recent introduction of learned indexes has shaken the foundations of the decadesold field of indexing data structures. Combining, or even replacing, classic design elements such as Btree nodes with machine learning models has proven to give outstanding improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they lack any guarantees both in their time and space requirements. We propose the Piecewise Geometric Model index (shortly, PGMindex), which achieves guaranteed I/Ooptimality in query operations, learns an optimal number of linear models, and its peculiar recursive construction makes it a purely learned data structure, rather than a hybrid of traditional and learned indexes (such as RMI and FITingtree). We show experimentally that the PGMindex improves the space of the best known learned index, i.e. FITingtree, by 63.3% and of the Btree by more than four orders of magnitude, while achieving their same or even better query time efficiency. We complement this result by proposing three variants of the PGMindex which address some key issues occurring in the design of modern big data systems. First, we design a compressed PGMindex that further reduces its succinct space footprint by exploiting the repetitiveness at the level of the learned linear models it is composed of. Second, we design a PGMindex that adapts itself to the distribution of the query operations, thus resulting in the first known distributionaware learned index to date. Finally, given its flexibility in the offered spacetime tradeoffs, we propose the multicriteria PGMindex whose speciality is to efficiently autotune itself in a few seconds over hundreds of millions of keys to the possibly evolving spacetime constraints imposed by the application of use.
‹Figure 1: Linear approximation of a multiset of integer keys within the range [27, 46]. The encoding of the (dashed) segment fs takes only two floats, and thus its storage is independent of the number of “encoded” keys. The key k = 37 is repeated three times in the multiset S, starting from position 5 in A, but fs errs by ε = 2 in predicting the position of its first occurrence. (Introduction)Figure 2: Each segment in a PGMindex is “responsible” for routing the queried key to one of the segments of the level below. In the picture, the dotted lines show that the root segment s10 routes the queried key to one of the segments among {s7, s8, s9} (which together cover the same range of keys as s10), whereas s8 routes the queried key to either s3 or s4 (which together cover the same range of keys as s8). Segments at the last level (i.e. levels[2]) are εapproximate segments for the subrange of keys in A depicted by wavy lines of which they are responsible for. (The PGMindex)Figure 3: The PGMindex saved from 37.7% to 63.3% space with respect to a FITingtree over the three realworld datasets. The construction time complexity of the two approaches is the same in theory (i.e. linear in the number of processed keys) and in practice (a couple of seconds, up to hundreds of millions of keys). (Space occupancy of the PGMindex)Figure 4: A loglog plot with the ratio between the number of segments m, stored in the last level of a PGMindex, and the size n of the realworld datasets as a function of ε. For comparison, the plot shows with a dashed line the function 1/(2ε) which is the fraction of the number of keys stored in the level above the input data of B+ tree with B = 2ε (see text). Note that m is 2–5 orders of magnitude less than n. (Space occupancy of the PGMindex)Figure 5: The query performance of several configurations of the PGMindex. The Recursive PGMindex, depicted as a pentagon, had better space and time performance than all the other configurations. (Query performance of the PGMindex)Figure 6: The Recursive PGMindex improved uniformly Recursive Model Index (RMI) with different secondstage sizes and traditional indexes with different page sizes over all possible spacetime tradeoffs. Results of traditional indexes for smaller page sizes are not shown because too far from the plot range. For example, the fastest CacheSensitive Search tree (CSStree) occupied 341 MiB and was matched in performance by a PGMindex of only 4 MiB (82.7× less space); the fastest B+ tree occupied 874 MiB and was matched in performance by a PGMindex which occupied only 87 KiB (four orders of magnitude less space). (Query performance of the PGMindex)Figure 7: The slope compression algorithm of ?? reduces the number of distinct slopes by up to 99.9%. (The Compressed PGMindex)Figure 8: Slope compression reduces the space taken by the slopes up to 81.2%. Longitude is the only dataset on which compression does not help for ε ≥ 29 because of its special features. In this case compression would not be adopted. (The Compressed PGMindex)›



Related: TFIDF
[1801.10207] ATree: A Bounded Approximate Index Structure[1905.08898] ALEX: An Updatable Adaptive Learned Index[1712.01208v3] The Case for Learned Index Structures[1712.01208] The Case for Learned Index Structures[1709.03949] Skyline Queries in O(1) time?1808.02066[1907.00236] Streaming Quantiles Algorithms with Small Space and Update Time[1509.08608] Probabilistic Threshold Indexing for Uncertain Strings[1910.04728] LISA: Towards Learned DNA Sequence Search[1707.01414] Efficient Approximate Query Answering over Sensor Data with Deterministic Error Guarantees
