The fastest SAX parser for python

Suddenly I wanted to count all the xml tags in 240 thousand xml files with a total weight of 180 GB. Python - and faster.

Task


Actually, I wanted to figure out how realistic it is to overtake the Library, whose Name can not be Aloud, from fb2 to the docbook. In connection with the “specificity” of FB2, you need to figure out which tags you can simply skip due to rarity. Those. just count the number of occurrences of each tag in all files.
On the way, it was planned to compare different sax-parsers. Unfortunately, testing did not work, because and xml.sax and lxml on the first fb2 broke. As a result, xml.parsers.expat remained.
Yes, and more - * .fb2 files are packaged in zip archives.

Initial data


The initial data is a snapshot of the Library as of 2013.02.01, tselnotyanuty of torus Internet: 242 525 File * .fb2 a total weight of 183 909 288 096 bytes, packed in 56 zip-archives of the total weight of 82540008 bytes.
Platform: Asus X5DIJ (Pentium DualCore T4500 (2x2.30), 2GB RAM); Fedora 18, python 2.7.

The code


Written in haste, with a claim to versatility:
#!/bin/env python
# -*- coding: utf-8 -*-
'''
'''
import sys, os, zipfile, hashlib, pprint
import xml.parsers.expat, magic
mime = magic.open(magic.MIME_TYPE)
mime.load()
tags = dict()
files = 0
reload(sys)
sys.setdefaultencoding('utf-8')
def start_element(name, attrs):
	tags[name] = tags[name] + 1 if name in tags else 1
def	parse_dir(fn):
	dirlist = os.listdir(fn)
	dirlist.sort()
	for i in dirlist:
		parse_file(os.path.join(fn, i))
def	parse_file(fn):
	m = mime.file(fn)
	if (m == 'application/zip'):
		parse_zip(fn)
	elif (m == 'application/xml'):
		parse_fb2(fn)
	else:
		print >> sys.stderr, 'Unknown mime type (%s) of file %s' % (m, fn)
def	parse_zip(fn):
	print >> sys.stderr, 'Zip:', os.path.basename(fn)
	z = zipfile.ZipFile(fn, 'r')
	filelist = z.namelist()
	filelist.sort()
	for n in filelist:
		try:
			parse_fb2(z.open(n))
			print >> sys.stderr, n
		except:
			print >> sys.stderr, 'X:', n
def	parse_fb2(fn):
	global files
	if isinstance(fn, str):
		fn = open(fn)
	parser = xml.parsers.expat.ParserCreate()
	parser.StartElementHandler = start_element
	parser.Parse(fn.read(), True)
	files += 1
def	print_result():
	out = open('result.txt', 'w')
	for k, v in tags.iteritems():
		out.write(u'%s\t%d\n' % (k, v))
	print 'Files:', files
if (__name__ == '__main__'):
	if len(sys.argv) != 2:
		print >> sys.stderr, 'Usage: %s ' % sys.argv[0]
		sys.exit(1)
	src = sys.argv[1]
	if (os.path.isdir(src)):
		parse_dir(src)
	else:
		parse_file(src)
	print_result()

results


We charge:
time nice ./thisfile.py ~/Torrent/....ec > out.txt 2>err.txt

We
get : * Runtime - 74'15..45 "(a little work was done in parallel and the music was listening, essno);
* It turned out that the processing speed was ~ 40 MB / s (or 58 cycles / bytes)
* 2584 files were discarded *. fb2 (expat, although non validate parser - but not to the same extent ...) - ~ 10%;
* in the results.txt file - which just doesn’t exist ...
* well, actually, for what everything was started: of 65 tags FB2 _not_ only one is applied (output-document-class); a couple more (output, part, stylesheet) can be skipped; the rest are applied from 10 thousand times.
* According to rough estimates, reading files (with unpacking) takes 52%, parsing - 40%, start_element processing - 8%.

And faster - is it possible? In python.

Also popular now: