I have been working on a data warehouse project for almost five months now. We didn’t use any existing ETL process tool on the market, instead we develop our own.
After I studied the source code of Rhino ETL, I believe that it is possible to write our own ETL process. I didn’t use Rhino ETL, instead, my team develop our own framework that suit our need. The key of what I learn about Rhino ETL is that I can yield result from a database connection. This allow me to save memory usage. Instead of loading a list of data, I yield only the data I need to be process. Of course, I still need to load reference data during the transformation process.
Anyways, our application has over 150 tables, and the data warehouse has about 100 tables so far. There will probably be more data warehouse tables need to be added in the future. We are at the final phrase of the first data warehouse project release. My team has been doing performance and manual testing while I am developing a integration test framework. We have unit test that test the data transformation and other parts of the ETL process, but we don’t have an integration which I believe it is necessary if we want to have a robust product.
Long story short, I finally finish coding the basic integration testing framework. I am using NDBUnit part my my framework.
Here is the basic attributes of ETL Process:
1. It uses ParallelLinq to process the transformation
2. It chunk the process by date range to reduce memory usage
3. It clean up old data from the live system once ETL is completed
Here is the basic attributes of the Integration Test
1. It loads the entire live database to a Dataset before the ETL process
2. It loads the entire live database to a Dataset after the ETL process
3. It loads the entire data warehouse database to a Dataset after ETL process
The data can be either an XML file, or it can be an existing database.
Other than the fact that my Visual Studio Dataset design almost die because of the amount of table that I have, the testing was fine until I use an customer database for testing, and this is what happened:
Good thing I was running Windows 7 64 bit which allow my applications allocate more memory than my other developers’ 32 OS. However, the test in the end took 1 hour to run and fail to perform clean up due to SQL connection was disposed error.
The problem was that I was too optimistic about loading data into memory for comparing data. To resolve this issue, I now have to explicitly load and unload data when doing comparison.
In conclusion, memory and performance issues don’t just exist in your application, but even in your tests.