pygrametl (pronounced py-gram-e-t-l) is a Python framework which offers commonly used functionality for development of Extract-Transform-Load (ETL) processes. It is open source and the newest version is released under a BSD license.
pygrametl was first made publicly available in 2009. Since then, we have made different improvements and added new features. Among other things, there is now much better support for parallel processing. Further, we have changed the license to a BSD licence and added much more support for Jython such that you also can use existing Java code and JDBC drivers in your ETL program.
When using pygrametl, the developer codes the ETL process in Python code. This turns out to be very efficient, also when compared to drawing the process in a graphical user interface (GUI).
Concretely, the developer creates an object for each dimension and fact table.
(S)he can then easily add new members by dimension.insert(row)
where row is a dict holding the values to insert. This is a very simple example,
but pygrametl also supports much more complicated scenarios. For example, it is possible
to create a single object for a snowflaked dimension. It is then still possible to
add a new dimension member with a single method call as in snowflake.insert(row).
This will automatically do the necessary lookups and insertions in the tables
participating in the snowflake.
pygrametl also supports slowly changing dimensions. Again, the programmer only has to
invoke a single method: scdim.scdensure(row). This will perform the needed updates
of both type 1 (i.e., overwrites) and type 2 (i.e., addition of new versions).
You can download pygrametl and/or browse the source. (Note: We have recently moved the entire code base to Google Code.) You can also browse the pydoc documentation online.
There is also a Technical Report that introduces pygrametl and its use (but not the new features for parallel processing, please refer to the documentation or this paper for these).
Example programs from "Easy and Effective Parallel Programmable ETL" (DOLAP'11) are avavilable here. The data generator and the example ETL program used in "pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers" (DOLAP'09) are also available.
Questions? Comments? Bug reports? Suggestions? Please tell us what you think! Send an email to chr AT cs DOT aau DOT dk. We would also like (but don't require!) to know if you choose to use/not to use pygrametl.
Here you can see a complete example of a pygrametl program. The ETL program
extracts data from two CSV files and joins their content before it is loaded into a
data warehouse with the following schema.