Friday, July 20, 2012

Altering ebooks -- adding pages to an existing epub


I've been working with ebooks (specifically epub files) lately, particularly with modifying them.  There are some great tools out there to help with this work.

Calibre is fantastic at converting between formats, and better yet it has a command-line interface, so it can be part of an automated script.  It can also do some simple editing of metadata, allowing you to update the author, title, etc. of a book.  But there's no functionality for editing the content of an ebook.

Sigil is another awesome tool.  This is the go-to program for editing the content of an ebook.  It has one major drawback for me though -- it's not scriptable.  There's no way to do something like adding an informational page into an existing ebook without doing it by hand.

I did some research but wasn't able to really find anything that would let me add a page to an ebook non-interactively, so I did it myself.  I thought this might be useful to other people looking to modify epub files, I've included it below.

A couple of notes to keep in mind:

  • I wrote this in PHP because I needed to interface with a large existing PHP codebase.  This would be even easier to do in Python, but the logic here is pretty straightforward and should be easily adapted.
  • The epub format is really simple.  At heart it's some XML, some HTML (or XHTML), maybe a few images, all wrapped up in a zip container.  That's fortunate in that there are a lot of libraries out there to work on exactly these formats.
  • That said, there are some tricky bits to how the epub zip has to be structured.  You can't just throw everything into a zip and rename it, which means the built-in PHP zip libraries don't work for it.
The code is here, and a quick text overview follows after.

<?php
    /*** Helper function from php.net ***/
    // This allows the delete of a directory and its contents
    function rrmdir($dir)
    { 
        if (is_dir($dir))
        { 
            $objects = scandir($dir); 
            foreach ($objects as $object)
            { 
                if ($object != "." && $object != "..")
                     if (filetype($dir."/".$object) == "dir")
                         rrmdir($dir."/".$object);
                     else
                         unlink($dir."/".$object);  
            } 
        reset($objects); 
        rmdir($dir); 
       }
     }


    /*** Setup ***/
    // Since this is an example, we can hard-code some things...
    $loc = '/home/fader/Projects/Libboo/epub-test'; // Where epubs live
    $epub = 'test.epub'; // The epub we will modify
    $newepub = 'new.epub'; // The new epub we will generate
    $added_page = 'newpage.xhtml'; // The new page we're going to insert into it


    // Allocate a directory to work in
    $temp_path = sys_get_temp_dir() . "/" . uniqid("epub-");
    mkdir($temp_path) or die("Couldn't create temporary path.");


    /*** Let's do this thing! ***/
    // Open the epub archive
    $zip = new ZipArchive;
    $res = $zip->open($epub);
    if ($res !== TRUE)
        die("Couldn't open epub as a zip.");


    // Unzip the epub into a temporary location
    $zip->extractTo($temp_path);


    // *** Dig into the ebook container
    // The path is defined by the epub spec, so as long as this is a compliant
    // epub file, we should be able to find fit at this location
    $container_path = $temp_path . "/META-INF/container.xml";
    $container_xml = file_get_contents($container_path);
    if ($container_xml === FALSE)
        die("Couldn't open container XML file.");


    // Look in the container to find the spine
    $container = new SimpleXMLElement($container_xml);
    $spine_path = $temp_path . "/" . $container->rootfiles[0]->rootfile["full-path"];


    // Pull up the spine
    $spine_xml = file_get_contents($spine_path);
    if ($spine_xml === FALSE)
        die("Couldn't open the table of contents.");


    // Copy the new page into the correct location
    if (!copy($added_page, dirname($spine_path) . "/" . basename($added_page)))
        die("Unable to copy new page into temporary location.");


    // *** Decide where to insert a node
    // For this example, we'll just plug it in as the third element
    // Unfortunately, SimpleXML is too... simple to let us insert a node into
    // an arbitrary position, so we use the DOM object
    $dom = new DOMDocument;
    $dom->loadXML($spine_xml);
    // Fortunately the structure for an epub spine is pretty simple.  So we can
    // just get the list of pages ("item"s) and run down the tree a bit.
    $items = $dom->getElementsByTagName("item");
    $itemrefs = $dom->getElementsByTagName("itemref");


    // Let'ss grab the third element
    // (NB: Pretty much any epub should have at least 3 items.
    // (ncx, css, title, pages...)  But boundary checks are always a Good Thing.)
    if ($items->length < 3)
        die("Book is ridiculously short.");


    // *** Create and insert the new nodes
    // We'll need two nodes here -- one for the "item" and one for the "itemref".
    // Both need to be present for the new page to be found by the reader.
    $newitem = $dom->createElement("item");
    $newitem->setAttribute("id", "newpageid0");
    $newitem->setAttribute("href", basename($added_page));
    $newitem->setAttribute("media-type", "application/xhtml+xml");
    $insert_point_item = $items->item(3);
    $insert_point_item->parentNode->insertBefore($newitem, $insert_point_item);


    $newitemref = $dom->createElement("itemref");
    $newitemref->setAttribute("idref", "newpageid0");
    $newitemref->setAttribute("linear", "yes");
    $insert_point_itemref = $itemrefs->item(3);
    $insert_point_itemref->parentNode->insertBefore($newitemref, $insert_point_itemref);


    // *** Write it out
    $newxml = $dom->saveXML();
    $result = file_put_contents($spine_path, $newxml);
    if ($result === FALSE)
        die("Unable to write new XML file.");


    // *** Zip everything back up again
    // The mimetype needs to be stored, not compressed.  Unfortunately I have not
    // seen a way to do this with the PHP ZipArchive object.
    // This is the quick, dirty, nonportable, ugly way to do it:
    system("zip -q0Xj $temp_path/$newepub " . $temp_path . "/mimetype");
    // Since we're already calling the system zip binary, this is about 30 lines smaller
    // than using the PHP zip object to accomplish the same thing:
    system("cd $temp_path ; zip -q0Xj $newepub mimetype ; zip -qXr $newepub * -x mimetype");


    /*** Clean up after ourselves ***/
    // Move the new epub file to the working directory
    if (!rename($temp_path . "/" . $newepub, $loc . "/" . $newepub))
        die("Unable to move new epub file to $loc.");
    // Delete the temporary path
    rrmdir($temp_path);
?>

In short, here's what the above does:

  • Sets up a convenience function for cleaning up later
  • Extracts the contents of the epub file into a temporary location
  • Reads the container XML file (specified by the epub spec) to find the index of files (which could be in an arbitrary location inside the epub)
  • Copies in the new page to be added
  • Creates two XML nodes
    • One is the location of the file containing the new page
    • The other is a referent indicating where in the book that page should fall
  • Adds these nodes to the index
  • Zips everything back up
  • Moves the new epub to a specified location
  • Cleans up the temporary files created
It's pretty straightforward, all told.  The tricky bit is in zipping the files up -- epub requires that the mimetype file (specifying that it is an epub) must be the first file in the archive and stored rather than compressed.  This bit's tricky in PHP, so I copped out and just called the native system binary.

If anyone has any questions I'm happy to discuss this... it's a fun toy problem!



5 comments:

  1. Hi Ronald.
    Is it possible to insert a page including a php variable? It's just a short text with the variable in.
    I have tried to send the parameter in the url (to the xhtml file) and use javascript in the xhtml file to get the parameter and insert it in the text, but it doesn't work. Do you have any idea?

    ReplyDelete
    Replies
    1. I got an idea. I just create the xhtml file with php first :-) That solved my problem. Thank you for sharing. You script was just what i needed! :-) Have a great day!

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. I het this error: Warning: copy(newpage.xhtml) [function.copy]: failed to open stream: No such file or directory in /home/xxxx/public_html/domain.ro/dve/EPUB/epub.php on line 69
    Unable to copy new page into temporary location.

    How can i solve this?

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete