C3PO and Digital Curation

C3PO
I recently took some time to learn more about content profiling. It is very important to me to keep my digital skill set sharp and so I enrolled in a WebEx virtual training out of the United Kingdom. The training was on May 31 at 13:00 BST/14:00 CET (UK time), which meant I needed to be logged in by 5:30am (US/NM time). The training started at 6:00am in my time zone and ended at 7:00am. It was a Friday, so after several cups of coffee and a brain loaded with information, I made my way into work. I didn’t want to miss taking taking a closer look at C3PO, a contemporary tool being used for digital curation (collection, management and preservation).

My Computers

~~My true loves… My computers.~~
~~~Call this cross-platform and ready for training!~~~

This digital tool is supported by the Information and Software Engineering Group (IFS), Institute of Software Technology and Interactive Systems (ISIS), and the Vienna University of Technology in Austria. C3PO was developed by Petar Petrov of Creative Pragmatics. Petrov also delivered the virtual training titled C3PO: An Introduction to Content Profiling. Petrov studied Business Informatics and Software Engineering. The system he has developed addresses content profiling in three steps: the gathering of metadata; data processing and aggregation; and metadata analysis.

Content Profiling

~~One of Petar Petrov’s presentation slides.~~

  • C3PO is for content and planning- content means- personal content (documents etc.), cultural heritage (libraries, museums, archives, etc.) scientific data and government documents.
  • He says that the “future growth” of what happens in an “internet minute” is “staggering.” The question is “what do we preserve because we can’t preserve all of it. We need to evaluate what we can preserve.”
  • Preservation planning- identifying risks to digital objects and developing a preservation plan. This should describe content and describe how you will go about preserving that using a certain digital repository or software.
  • Plato~ supports the Preservation Planning Workflow. It was developed by the University of Technology in Vienna. This program will help through the process and enable the development of a real preservation plan.
  • Discussed how objects, technology, usage criteria, policies, and actions affect preservation environments.
  • Scout~ monitors interesting aspects of the world, notifies you when a certain important event happens, helps you know when you should reevaluate your preservation plan (format migration, current tools, trend watch on preservation and migration tools, notifications come through for a redeploy of software).
  • Content profiling~ Property (format), FileA (PDF 1.2), FileB (PDF 1.2), FileC (PDF 1.4). Example about which are similar. Not A and B, but a closer look at the metadata is required. After looking at page count, encryption, file size, and if the files are valid and well-formed it can then be determined which files are most similar. In preservation planning, this is a necessary evaluation. It is difficult to do this both on a large scale and in detail. We need to take more data out of the digital repository and secure a data characterization process to aggregate the data.
  • Heterogeneity: one size does not fit all”
  • He discussed how to perform a sample selection based on the metadata to experiment on.
  • QA and limitations~ “how do you know if your content profile is good or no good?” You need to understand the tools which provide the metadata so that your characterization data is clean and your content profiles are good.
  • C3PO~ how does it work? What does it do? This software merges several command line tools into one tool. This is one tool that does many things that several tools can do. Call it the “Swiss Army Knife of tools.”
  • v0.3.0~ it’s a Command Line Application and a Web Application. It is an open source Mongo Database and it stores documents differently than a traditional database. It uses Java technology and processes FITS (and Tika) files, stores them in the document store, XML Profiling and CSV Export. It can process close to 1 million (945699) objects in about 2 hours (1hr48m) and can define profiles in 12m. The web application provides an overview of metadata, to browse, filter, sample and export metadata.
  • Hands on C3PO demonstration~ very useful for data sampling. He started a terminal window to use the command line version of C3PO. He was demonstrating on about 2,000 objects. He ingested data using commands into the metadata database. He was then able to export the metadata to Excel for further sorting options and to detect problems with files and metadata. The web based app provides statistics using bar graphs of mime types, format versions, object validity, if the objects are well-formed, sizes.
  • Object can be filtered by format and other things to see detailed and regenerated tools to analyze particular sets of objects. For example you can see all “invalid PDFs.” You can then export any generated profiles to Plato to help you develop a preservation plan. He said “content profiling will never be completely solved, we can only improve the process.”
Explore posts in the same categories: Computers, Digital Archives, Digital Issues, Edification, Obsessions, Open Source, Pragmatics, Preservation, Professional Endeavors, Software and Hardware, Studies, Technology, Worthy Reads

Leave a comment