AANRO to Fez
- Peter Sefton, Tim McCallum & Bron Dye
- This document follows the steps to ingest a set of AANRO records into FEZ.
- AANRO repository implementers, anyone else looking for sample code to add data to Fez
- AANRO data
- Python 2.4 installed on the system to run the harvest, with the Cheetah templating system installed.
- Fedora 2.2 installed and running.
- Optionally, an instance of FEZ for ingest.
- Fez wiki:
- Documentation on the FOXML (Fedora Object XML) specification:
- Official Python website:
- Official py.test tool and library website:
- Official Subversion website:
- The scripts have been developed on OS X and Linux based system. Python is a cross platform programming language and therefore the scripts should also run under Microsoft Windows operating systems. but we have not tried.
- Installing the Python programming language, Fedora and Cheetah is outside the scope of this technical report.
A component of the work undertaken at RUBRIC-Central is the development of various data migration strategies. These strategies are designed to assist RUBRIC Project Partners to migrate data into, and out of, various systems. The data migrations specifically target the three institutional repository solutions under consideration as part of the project: Dspace, Fez and VITAL.
This document covers a script to take a proprietaty database dump of the AANRO data and it is off little direct use to others, but may serve as an example.
This data migration differs from most of the others we have published at RUBRIC so far. in that is not modular, does not use the DSpace archive format is an intermediate stage, and makes no use of XSLT. For performance reasons we wrote this a stand-along script that goes straight from the data files to FOXML records for ingest into Fedora.
There are a few reasons for the change all to do with performance:
Linux only allows 32000 files in one directory so our simple Dspace archive class would have needed to be rewritten.
The libxslt library for Python has memory leaks so the script fails on large data sets.
A multi-step process is ideal for small (up to ten thousand or so records but is unmanageable when dealing with two hundred thousand.
We wanted to try using Cheetah templates as the basis for data migration as they seemed to offer an easy-to-use alternative to XSLT.
The data migration uses a single script to process the text-based data dump into FOXML records you can ingest into Fedora.
- The main script. This takes .dmp files from the AANRO database.
- This script is accompanied by a minimalist py.test test-script: we recommend adding new test cases before you change the script.
- The result is an output directory which contains FOXML – to allow it to work with large numbers of files the output is broken into a number of directories (0..n) with up to 10,000 records in each.
- NOTE: The script can currently deal with the AANRO publications data, publications archive, and the research archive but not the research data – it uses a different format which will require more development.
- aanro_foxml_template.tmpl, aanroFoxmlTemplate_research.tmpl
- Cheetah templates to transform AANRO data to FOXML, including both MODS and Dublin Core versions of the data. They deal with publication and research project respectively.
All of the data migration scripts, and associated code libraries, modules and files, are made available via a publicly accessible website.
If you have the subversion client installed you can download the Python scripts, test files, and other files used during development. The URL that you will need to check out is as follows:
svn co https://rubric-central.usq.edu.au/svn/Public/code/migration_toolkit
Change into the AANRO directory
Compile the templates using cheetah
(do this again after any changes you make to the .tmpl files)
cheetah compile *.tmpl
- Input (-i)
- Full path to the AANRO data file to be transformed (.dmp) file
- Output (-o)
- Name of directory that transformed files will be sent to (This directory will be created if it does not exist)
- Template (-t)
Name of template file to be used in transformation
python aanro_dump_to_foxml.py -i Input -o Output -t Template
python aanro_dump_to_foxml.py -i research.dmp -o output_directory -t aanroFoxmlTemplate_research
python aanro_dump_to_foxml.py -i publication.dmp -o output_directory -t aanroFoxmlTemplate
sudo /usr/local/fedora-2.2/client/bin/fedora-ingest.sh d [output_dir] foxml1.0 O localhost:8080 fedoraAdmin fedoraAdmin http ""
Follow the latest instructions for Fez to index the resulting data into the Fez indexes.