I started working on a version 0.4.2 of the EGE program, but during the development of OxGarage, version 0.4.3 of EGE was created. I have tried to merge the changes in this version into the current version of OxGarage. Basically all the changes were implemented with only some small deviations from the original 0.4.3 version. In addition, there is a brief description of what I have done to the Java source code during my internship time here at OUCS.
Afterwards I started working on adding new conversions to OxGarage. Firstly I added the conversion to ePub. This conversion was added into TEIConverter and most of the conversion logic is performed there. However I also needed to change the framework a little bit, as ePub required a special zipping procedure. Therefore, this procedure was added to the framework utilities which take care of zipping.
Next, I have added support for converting documents containing images. To do this, I needed to change the web service, TEIConverter and I have added a completely new class ImageFetcher, which is responsible for most of the operations with images. In order to enable user to upload images, I also had to change the web client. To keep the requests for conversion of documents without images unchanged, I decided to solve the uploading of images by uploading the document first and then uploading all the image files. Therefore, now the web service treats the first file it receives from the form to be the document to convert. All subsequent files are treated as images. The type of the files uploaded is never checked during the conversion. It simply takes all the <graphics> tags from the TEI document and copies the images mentioned in the URL of these tags. All other uploaded files are simply discarded. If you are converting from ODT or DOCX format, the directories containing images are simply copied and zipped together with the converted document.
Some of the conversions were not desired to be displayed as possible input type, but still they should be counted with when creating conversion paths. Therefore I needed to change the API and framework a little bit to implement this. I needed to add a boolean field into ConversionActionArguments class and I needed to make the framework offer only those conversions which were defined as visible.
Because conversions to odt and to docx and from odt and from docx were so similar, I decided to rewrite completely the DocXConverter class, when implementing the odt conversions. I have taken almost everything these conversions have in common and put it into an abstract class called ComplexConverter. Now this class is responsible for most of the conversion logic and the DocXConverter and OdtConverter classes are more like configuration classes for these conversions.
After some Googling I have found a library called JODConverter, which was able to run a headless OpenOffice.org process and make calls to it to convert files. Using this library I have created the OOConverter. I have added possibility of running several process at the same time and to wait for a free process, in case all the processes are currently being used. The processes are terminated after conversions, so that there are not too many OpenOffices running all the time on the server. The maximum number of processes running and also the port numbers on which they can run can be easily changed in OOConverter class.
Adding the OOConverter caused quite a lot of new conversions to appear on the list and also quite a lot of ways of how to get from input to output. Therefore the user interface was starting to get again a bit chaotic and with adding more and more conversions, the current way of calculating all possible paths was taking too long. Therefore, I needed to add some mechanism to decide which conversion paths should be used and which shouldn't. After some discussions, I have decided to assign a cost to each conversion. These costs are very subjective and simply represent how good did I think the conversions were. By good I mean mostly how close to the original the resulting document was. The lower the cost, the better the conversion seemed to me. To accommodate of this cost mechanism I had to change several parts of API and framework and also each of the converters.
Now new extensions can be easily added into XslConverter by providing a plugin.xml file together with style-sheets. To implement this I needed to change some bits in EGEConfigurationManager class. To be more precise, I needed to add a code, which looks through the style-sheets directories to find all plugin.xml file and then adds the conversions described by these files as extensions to XslConverter plugin using the Java Plugin Framework, which was already being used in EGE.
Adding new csv conversion was quite easily done using the automatic extension lookup mentioned before. However, there were some other issues to be solved. I needed to make some minor changes to XslConverter in order to convert also csv documents. Now the provided input stream is written down to a file. This file then serves as a new input stream. Converter tries to convert it as it was an xml document and when it realises it isn't, it throws an exception. After this happens, the converter tries to convert the document as it was a non-XML document.
Last change to the code was made during my last day at OUCS. I have added an automatic generation of cover images for ePub documents. This takes an image template and adds into the image the name of the author and the title of the publication, which are extracted from the <teiHeader> tag. To do this, I am using Java's ImageIO and Graphics classes from javax and awt libraries. Currently it tries to fit into the image the maximum possible font (but not higher than the maximum set by maxAuthorFontSize and maxTitleFontSize variables). It allows 1 line for author and 2 lines for title. After the title and author fit into the image's width, it tries to align them to the center both horizontally and vertically. For vertical alingment, you need to define the vertical size of the region inside the code, where the marginY variable is being calculated.
Apart from the 10 main changes written above, I had also to repair some bugs. There was a bug which allowed conversions like TEI → DOCX → TEI → EPUB to take place. This is probably not very good, as it is potential source of errors that can occur during the not needed conversion to DOCX. Therefore I have changed the way how the framework determines whether something is a cycle in EGEimpl class. I also changed several other things in this class to correct some more bugs, or to improve performance. Another bug that occurred in this class was caused by wrong numbering of edges when constructing graph. I got rid of the original numbering and now I label the edges 1 to n, where n is the number of edges. Another change I made in this class is that paths, which are already longer than the shortest path of the same input and output, which was seen so far, are simply discarded and not used in further calculations of possible paths. This is because now the graph of conversions is just so huge, that it takes too much time to calculate all the possibilities. Another thing I changed is that now document formats are written out in alphabetic order. To do this I had to change several HashMaps into TreeMaps or LinkedHashMaps. I also had to make DataType and ConversionsPaths implement Comparable interface, which enabled sorting them. Please note, that in ConversionsPaths class the result of a.equals(b) is not the same as a.compareTo(b)==0, as I had to implement different rules for equals and for compareTo.
As far as I can remember right now, these are all the important changes I have implemented. Most of them were ideas of my supervisor, who was very helpful during the course of my internship, as well as all other people working on this project.