We present a
framework
for
validation
.
In our case
very focused
to
validate that
ETL
load
has
worked well.
The purpose and use (desired ) for this STValidations (free download for previous link), is to automate the validation routine and sometimes we fail to do just that. For routine .
Its use is quite simple and are in a version 0.1 , so it can evolve much and that we have few use cases how to make sure their robustness and to cover all needs. Hope this code from Stratebi helps you.
ETL is a process that reads from a table a list of queries to be executed and the expected result. After that , run the query and compares it with the expected result if the result is correct is a log of executions correct and if the result is not expected is a log of botched executions . At the end sends an email with the two records to the user to decide so you can review the implementation .
Here you can see the aspect of work in general :
The purpose and use (desired ) for this STValidations (free download for previous link), is to automate the validation routine and sometimes we fail to do just that. For routine .
Its use is quite simple and are in a version 0.1 , so it can evolve much and that we have few use cases how to make sure their robustness and to cover all needs. Hope this code from Stratebi helps you.
ETL is a process that reads from a table a list of queries to be executed and the expected result. After that , run the query and compares it with the expected result if the result is correct is a log of executions correct and if the result is not expected is a log of botched executions . At the end sends an email with the two records to the user to decide so you can review the implementation .
Here you can see the aspect of work in general :
And the detail
of the validation
:
Simple
...
powerful and
versatile.
From
simple queries
like "
select count
(*) from
table"
to validate that
you have
records to
complex queries
for comparison
of values in
different tables.
In fact, if
what I
see
is data
in a table
so I think
that most
of the validations
can be performed using
this method.
You can follow instructions in this video tutorial:
You can check more videos in our Youtube Channel
The process, in step settings , read the configuration data from configuration.properties , check that there is a validation table and if not creates and inserts a validation sample . Then, the process reads all existing queries in the table , runs and compare the result with the expected result.
The lookup table looks like and filled manually (we are at version 0.1 , remember .....) :
You can follow instructions in this video tutorial:
You can check more videos in our Youtube Channel
The process, in step settings , read the configuration data from configuration.properties , check that there is a validation table and if not creates and inserts a validation sample . Then, the process reads all existing queries in the table , runs and compare the result with the expected result.
The lookup table looks like and filled manually (we are at version 0.1 , remember .....) :
We did
that
because we had some
problems of inconsistency
in some projects
that made
clear that
the loads
must be validated
and
is a tedious
and repetitive work.
- We must always validate charges . We have detected errors due to inconsistent or unexpected data input formats . And that can not be controlled unless the charges are validated .
- Loads must be validated especially after making a change in the ETL and it is pretty boring , perform the same query validation.
- We must make a series of routine consultations to ensure that the data is equal to the source .
- If you can automate ... I'd rather be doing other things.
Strengths and weaknesses :
- It is a version 0.1 we have tested in a couple of clients and so far so good. But we know that, for example , still can not compare data from different sources faith .
- Queries can become heavy and this is a potentially VERY expensive in terms of resource consumption. That's why ETL is a process that can be run independently attached to the ETL process or in another moment of time with little load. We do it well . The process does not run right after the ETL but at 6:00 a.m. how a separate process.
- Do not rely on the BI server. Simple and useful.
Well , I think it can be very useful in projects and we hope you find it useful too!!
By the way ... this is the error log to send :