Convert lib.ru library to epub format using Java
Good day to all. Recently I got an electronic reader - Kobo Touch, and the question arose about where to get books from. The notorious Flibusta is certainly a good thing and I take many books from there, but still I was drawn to lib.ru And for the sake of interest I wanted to write a converter. Haters of the copro code should think about that. to read this text. For the code is really incredibly cruel.
After analyzing the library catalog, it immediately became clear that most books have the same scheme, namely:
[ Author ] . [ Names ]
[ Technical data ]
[ # Chapter ]
[ Text of the chapter ]
[ # Chapter ]
[Text of the chapter ]
And so on. I met other forms, but did not complicate. It is worth noting that [ Author ] [ Title ] and [ # Chapter ] are between certain tags - "" and "" (these are different characters, Ascii format).
The only thing left is to write a simple parser for the page. I have used Java. To begin with, the question arose in which encoding to read the data, because according to my observations, the encoding on each page varies. To do this, resorted to a third-party library juniversalchardet. So I recognize the encoding and write it to a string.
Next, I read the page using BufferedReader.
For convenient parsing, change the character to and add a new character at the end of the file.
In the end, it remains to be seen how many total chapters are in the book (as already mentioned, chapters are between and tags). I also take away one chapter, for I myself have added it.
The case is drawing to a close. It remains to divide all the content into chapters, descriptions and the author with the title.
Of course, it was possible to get it all out with regulars, but what I got into my head, I wrote it.
Now the name is everything for creating the document. I used the EPUBGen library to create the final document. Fortunately, the examples are very informative and it took just a couple of minutes. First, create a document and enter metadata.
Next, you need to save the image in the OPS / images directory and make a link to it in the cover.xhtml document
The last action is to add a table of content followed by a recursive addition of the content itself.
The end result came out as follows:
The interface on Swing, but cheap and cheerful.
Since I did not make much effort to understand the entire library, it only works with books of the old model (simple text model), such as this one .
Who didn’t die after reading such an abundance of shit code, I ask you to take the binary from bitbucket from here.
After analyzing the library catalog, it immediately became clear that most books have the same scheme, namely:
[ Author ] . [ Names ]
[ Technical data ]
[ # Chapter ]
[ Text of the chapter ]
[ # Chapter ]
[Text of the chapter ]
And so on. I met other forms, but did not complicate. It is worth noting that [ Author ] [ Title ] and [ # Chapter ] are between certain tags - "" and "" (these are different characters, Ascii format).
The only thing left is to write a simple parser for the page. I have used Java. To begin with, the question arose in which encoding to read the data, because according to my observations, the encoding on each page varies. To do this, resorted to a third-party library juniversalchardet. So I recognize the encoding and write it to a string.
URLConnection con = url.openConnection();
con.connect();
InputStream urlfs;
urlfs = con.getInputStream();
byte[] buf = new byte[4096];
UniversalDetector detector = new UniversalDetector(null);
int nread;
while ((nread = urlfs.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
Next, I read the page using BufferedReader.
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), encoding));
String str;
while ((str = in.readLine()) != null) {
string = string + str;
}
in.close();
For convenient parsing, change the character to and add a new character at the end of the file.
string = string.replace("", "");
/* Т.к в конце страницы нету знаков, то я их доабавляю для легкого парсинга. */
string = string + " ";
In the end, it remains to be seen how many total chapters are in the book (as already mentioned, chapters are between and tags). I also take away one chapter, for I myself have added it.
int count = 0;
for (char c : string.toCharArray())
if (c == '')
count++;
loop = (count-1);
The case is drawing to a close. It remains to divide all the content into chapters, descriptions and the author with the title.
/* Третья ячейка масива содержит строку вида "Автор. Название". Разделяю её. */
String[] authorandtitle = parsedstring[2].split("\\.");
AUTHOR = authorandtitle[0];
TITLE = authorandtitle[1];
/* Начинаю добавлять главы, т.к они на четном месте массива. И текст на нечетных. Вырвиглазно, согласен */
for(int i = 4; i <= loop; i++){
if((i % 2) ==0 ){
CHAPTER[i] = parsedstring[i];
HEADER[i] = parsedstring[i];
}else{
PARAG[i] = parsedstring[i];
}
}
Now the name is everything for creating the document. I used the EPUBGen library to create the final document. Fortunately, the examples are very informative and it took just a couple of minutes. First, create a document and enter metadata.
Publication epub = new Publication();
epub.addDCMetadata("title", TITLE);
epub.addDCMetadata("creator", AUTHOR);
epub.addDCMetadata("language", "ru-RU");
Next, you need to save the image in the OPS / images directory and make a link to it in the cover.xhtml document
DataSource dataSource = new FileDataSource(new File(cover));
BitmapImageResource imageResource = epub.createBitmapImageResource(
"OPS/images/cover.jpg", "image/jpeg", dataSource);
DataSource coverdata = new StringDataSource("\n\n\nCover \n\n\n\n\n\n\n\n");
Resource coverres = epub.createResource("OPS/cover.xhtml", "xhtml", coverdata);
epub.addToSpine(coverres);
The last action is to add a table of content followed by a recursive addition of the content itself.
NCXResource toc = epub.getTOC();
TOCEntry rootTOCEntry = toc.getRootTOCEntry();
for(int i = 4; i <= loop; i++){
if((i % 2) ==0 ){
/* Создаю главу.*/
OPSResource main = epub.createOPSResource("OPS/"+i+".html");
epub.addToSpine(main);
/* Открываю файл глав. */
mainDoc = main.getDocument();
/* Добавляю главу в таблицу контента.*/
TOCEntry mainTOCEntry = toc.createTOCEntry(CHAPTER[i], mainDoc
.getRootXRef());
rootTOCEntry.add(mainTOCEntry);
body = mainDoc.getBody();
/* Добавляю тайтл. */
Element h1 = mainDoc.createElement("h1");
h1.add(HEADER[i]);
body.add(h1);
}else{
/* Добавляю основной текст. */
Element paragraph = mainDoc.createElement("p");
paragraph.add(PARAG[i]);
body.add(paragraph);
}
}
Сохраняю конечный документ
OCFContainerWriter writer = new OCFContainerWriter(
new FileOutputStream(output));
epub.serialize(writer);
The end result came out as follows:
The interface on Swing, but cheap and cheerful.
Since I did not make much effort to understand the entire library, it only works with books of the old model (simple text model), such as this one .
Who didn’t die after reading such an abundance of shit code, I ask you to take the binary from bitbucket from here.