David Cross shows us how to use Perl for "munging" data--"...storing information in databases, extracting it from files, reorganizing rows and columns, converting to and from bizarre formats, summarizing documents, tracking data in real time, creating statistics, doing back-up and recovery, merging and splitting data streams, logging and checkpointing computations." His book is full of techniques for transforming data from dumps into databases.
The book is written for programmers or analysts who transform data as a regular part of their jobs. It assumes a beginning knowledge of Perl programming, as one might gain from reading Learning Perl
. Part I introduces data munging as a recurring necessary evil and points out aspects of Perl that recommend it for this task. Part II surveys different types of unstructured and semi-structured data formats and suggests Perl-based strategies for working with them. PART III examines the limitations of simple data formats and discusses parsing strategies and specific techniques for working with HTML, XML and other hierarchical data structures. PART IV extracts some useful lessons from the previous chapters and suggests sources for additional study. The organization is logical and easy to follow.
Cross has written a well-designed book with helpful examples and insights. The accompanying book web site and author web site provide downloadable code and other resources. This book is of course most useful to those working in Perl. But many general concepts and strategies have transferred well to data munging tasks I have done in TextPipe.
One of Perl's mottos is: "There's more than one way to do it." A variety of ways are illustrated and explained in this book. Note that it is over ten years old and does not include the latest evolutions of the Perl language.