Pocket OLAP in Javascript and IndexedDB Performance

From the sandbox

Hello, Habr!

Recently, I decided to test Javascript performance using the example of creating a simple WEB application that can build pivot tables based on poorly structured data. Repeating all the functionality of Excel or adult OLAP-systems was not expected, but I wanted to test the performance of Javascript in general and IndexedDB in particular on various desktop and mobile browsers. Looking ahead, I’m saying that after completing the first stage of the work - building a pivot table with a single-pass algorithm for storing facts (indexing frequently used sections and caching computed aggregates is postponed for the future) - I was disappointed with the reading performance of IndexedDB, surprised that mobile browsers were hardly behind the desktop, and puzzled by the epic failure of my beloved Firefox in one of the tests.

forming a pivot table, where the basis of the algorithm is a single cycle on the IndexedDB cursor, working with Object, Array, Set, Map objects (key extraction, insertion, iteration), string concatenation and simple arithmetic;
decoding (drillthrough) rows of the pivot table with the output of the result in the DOM, where the basis of the algorithm is multiple (in a cycle) extracting one record from the IndexedDB by key, and then output the results to the html table in groups of 100 lines using the insertAdjacentHTML ('beforeEnd', html) ).

Testing was carried out on a JSON file containing 20 thousand facts, of which 9 records were a reference book of products, the rest represented purchase / sale operations. The plate with the results of testing on a netbook and phone (time in seconds), as well as implementation details and conclusions - under the cut.

		Firefox_linux 64.0	Chromium_linux 71.0.3578.80	Falkon_linux QtWebEngine 5.9.5
one	Complete calculation of 20 thousand facts - filter, grouping lines, calculating aggregates	12.31, 10.21, 10.69	4.76, 4.38, 4.43	5.08, 4.76, 4.73
2	No filter, calculated attributes and aggregates - only row grouping	8.08, 8.14, 8.15	3.4, 3.5, 3.48	3.39, 3.37, 3.42
3	Without filter, groupings and calculations - “naked” cycle according to the IndexedDB cursor	7.83, 7.72, 7.71	3.36, 3.38, 3.44	3.23, 3.11, 3.17
four	Decoding (drillthrough) with the output in the table html 20 thousand lines	108	90.5	100
five	Drillthrough without output to DOM is a “bare” loop over an array of keys, and extracting records one by one.	11.57, 14.71, 11.52	18.62, 18.91, 18.27	January 18, September 18, November 18

	Firefox_android 63.0.2	Chrome_android 71.0.3578.98	Edge_android 42.0.0.2804
Full calculation of 20 thousand facts on the phone	13.58, 13.15, 13.67	5.89, 5.84, 5.91	6.48, 6.45, 6.51

Initial data

Since our database is NoSQL, the data schema and relationships between entities are not previously described. You can submit to the input any JSON file containing an array of objects where elements of directories, transactions, lines of documents, etc. can be mixed. Relationships are established at the time of grouping lines (or calculating formulas) based on matching attribute values, so you can say that human-readable text keys are used. For example, product information and buy / sell transactions will be represented in the fact database with the following entries:

[
	{"product": "milk-2.5", "EAN-code": "4770148229477-3"},
	{"product": "petrol-95", "EAN-code": "74820123490016-3"},
	{"product": "fish-pollock", "EAN-code": "4640014465660-3"},
	{"operation": "purchase", "partner": "nalchik-moloko", "product": "milk-2.5", "price": 15.5, "quantity": 50},
	{"operation": "purchase", "partner": "lukoil", "product": "petrol-95", "price": 35, "quantity": 200},
	{"operation": "purchase", "partner": "china-fish", "product": "fish-pollock", "price": 90, "quantity": 100},
	{"operation": "sale", "partner": "ivanov-petr", "product": "milk-2.5", "price": 20.30, "quantity": 3.5},
	{"operation": "sale", "partner": "smith-john", "product": "petrol-95", "price": 42, "quantity": 30},
	{"operation": "sale", "partner": "ivanov-petr", "product": "fish-pollock", "price": 120, "quantity": 2}
]

Formula Syntax

Javascript syntax is used, and 3 additional functions:

fact (columnname) - returns the attribute value (including the calculated one) of the current fact;
row (columnname) - returns the current (intermediate) row value of the pivot table to which this fact falls;
selectlast (columnname, where) - returns the value of the attribute of the last fact from a set of facts that satisfy the where condition.

Example of a formula for the filter:

['purchase', 'sale'].includes(fact('operation'))

Example of a calculated fact attribute:

"amount": "round(fact('price') * fact('quantity'), 2)"

Examples of formulas for pivot table columns (aggregates, post-aggregate calculations, queries to reference books):

{
	"quantity-sum": "row('quantity-sum') + fact('quantity')",
	"amount-sum": "round(row('amount-sum') + fact('amount'), 2)",
	"quantity-avg": "round(row('quantity-sum') / row('count'), 2)",
	"price-max": "fact('price') > row('price-max') ? fact('price') : row('price-max')",
	"price-min": "row('price-min') == null || fact('price') < row('price-min') ? fact('price') : row('price-min')",
	"price-sum": "row('price-sum') + fact('price')",
	"price-avg": "round(row('price-sum') / row('count'), 2)",
	"price-avg-weight": "round(row('amount-sum') / row('quantity-sum'), 2)",
	"EAN-code": "selectlast('EAN-code', 'fact(\"product\") == row(\"product\")')"
}

Algorithm features

Since we have a single pass algorithm, at each moment of time we have one current fact and one row of the pivot table into which this fact falls. Thus, using the fact () and row () functions, it is possible to calculate simplest aggregates such as sum, min / max, average, etc. More complex statistical functions that require the storage of the entire array of values are not yet ready.

It was more difficult to implement the left join (selectlast () function) for the purpose of extracting additional attributes from reference books (essentially from other facts) within a single cycle of storing facts. I made the assumption that the number of directory entries will always be orders of magnitude smaller than the amount of operational data, and solved the problem as follows — simultaneously with grouping rows and calculating aggregates — all directories are selected into a separate sandbox (that is, the facts that have the required attribute). After the formation of the rows of the pivot table - the second pass through the sandbox, we pull up the missing attributes in the corresponding columns of the pivot table. Thus, we have only one heavy cycle.

The last optimization was the conversion of all formulas in the JS function at the start, to avoid using eval () in a loop.

Performance testing

To my surprise, the insertion of 20 thousand facts into the non-indexed ObjectStore took about a second, but there are significant problems with extracting data.

Poor performance in row 4 is understandable - output to the DOM, and even to the table - predictably difficult operation, moreover, rather meaningless on the prod, since it’s impossible to work normally with such a table anyway.

Of interest are lines 3 and 5, representing the “bare” performance of a sample of IndexedDB. In these tests, I commented out the entire algorithm, leaving only the operations with the base - in line 3 one large cursor was used and iterated over it, in line 5, vice versa - iteration over an array of keys (previously prepared), and extracting records one by one. The key is integer, auto-increment.

Here are the code snippets for tests 3 and 5, respectively:

// single cycle on facts
db.transaction('facts', 'readonly').objectStore('facts').openCursor(undefined, 'prev').onsuccess = (ev) => {
	let cur = ev.target.result;
	if (cur) {
		cfact = cur.value;
		;
		// next fact
		cur.continue();
	}
}

rowout(0);
functionrowout(i) {
	if (i < ids.length) {
		db.transaction('facts', 'readonly').objectStore('facts').get(ids[i]).onsuccess = (ev) => {
			cfact = ev.target.result;
			;
			rowout(i+1);
		}
	}
}

Since the IndexedDB interface is asynchronous, in both cases we observe tail recursion, which should be optimized by the engine (otherwise it would have fallen from the stack overflow), but the reason for such epic brakes is incomprehensible to me. Moreover, Firefox, the loser in all tests, significantly won in the test for single requests, which gives us hope that it will definitely cope with queries on the index.

Conclusion

I felt sad, and I am not sure that I will continue to experiment. Of course, you can build indexes and abandon fullscan, but you still have to summarize the units in a cycle, and, as we can see, a sample from the cursor is the most expensive operation. It may be worth refusing to use IndexedDB at all, store the data on disk in JSON format (well, parsing takes seconds), and rewrite the algorithm to wasm. Or wait for the implementation of Rust in browsers (just kidding).

The application is available here , the file with test data here .

I would be grateful for the advice and just clever thoughts, since the Javascript language is not my first language.

PS
Senior comrades explained what was the matter. All requests to the IndexedDB are performed in a separate thread (worker), and even when the cursor is bypassed, there is an inter-stream call for each entry, which is very costly, because the cost of creating a message and copying all data is added to the cost of the actual callback itself (complex objects should be sent to Javascript is not able to link). Thus, high performance is incompatible with an asynchronous inter-stream interface. Hopefully, browsers will eventually solve this problem.

PPSS
Continuation 1: The performance problem is solved by organizing block data storage.
Continued 2: Accelerated by multi-threaded computing.

Tags: