Data Quality | Tutorial | Data Quality Management | Data Quality Assessment | Data Quality Issues - YouTube

Channel: unknown

[1]
Hello, and welcome to this data quality tutorial. First, let us understand. What is data quality?
[11]
Now, basically data is of high quality if it does not suffer from data issues.
[17]
There are many possible issues with data. We will see multiple examples shortly.
[26]
Now, data is used for many functions such as operations, customer management, marketing analysis and decision making.
[33]
if the data is of poor quality, there are several problems like wastage of time and money due to incorrect reports, poor decisions and
[41]
frustration.
[42]
Now, let us see examples of data issues.
[46]
First issue is incompleteness. Here is the Customer table
[50]
with columns FirstName, LastName, BillingAddres, Shipping Address and Email.
[56]
For example, Dave has a billing address of 111 Main City street, and the email is [email protected]
[66]
Now, here LastName is missing, Shipping Address is missing and the Email is truncated.
[77]
Robert has the default BillingAddress and
[81]
his ShippingAddress is missing.
[83]
It is also possible that the Customer table misses records that exist in the source data, or the customer table is not updated.
[94]
Second data quality issues is accuracy. Here is the employee table with columns, Name, Position, Department and Email.
[104]
Now, Anna's Position and Department is interchanged.
[110]
Berry's Position has a typo in Software.
[116]
Charles' Department has a junk value.
[122]
The third data quality issue is inconsistency. Here is the Order table with columns, OrderNumber, CustomerId,
[128]
ShipDate and Total.
[132]
Now, the totals are in different currency. This one is in dollars, and this one is in euros.
[141]
The currency should be consistent.
[146]
Fourth data quality issue is Invalidity.
[149]
Here is the Product table with columns, ProductCode, Name, UnitsAvailable and RetireDate.
[159]
Now, UnitsAvailable is negative in this row. Which is impossible.
[166]
Also Widget2 is already retired, but still present in this table.
[173]
The fifth data quality issue is Redundancy. Here is a Customer table with columns, FirstName, LastName and Email.
[183]
Now, the second row is a duplicate of the first row because the email addresses are the same.
[193]
The sixth data quality issue is Non-standard. Here is the Order table with columns, OrderNumber, CustomerID and Total.
[205]
In the
[206]
second row, the total has four places after the decimal point, so it is in non-standard format.
[215]
Now, let us see the process of data quality assurance, which can be used by the data quality analyst.
[224]
It has three main activities.
[227]
First is data profiling.
[230]
It involves exploring the data
[233]
to understand and analyze it.
[236]
Then identify
[238]
the data quality issues and then summarize the data in Table Row counts, blank values, duplicates and so on.
[246]
If the data quality issue is widespread, the data quality is poorer.
[251]
For example, if the Customer table has 2% rows missing, the data quality is higher than if the Customer table has 20% rows missing.
[263]
Data quality can be expressed in data quality metrics such as % Complete, % Accurate, % Conformance.
[271]
(that means conformance to business rules
[274]
like value >= the lower range).
[279]
Now, manual data profiling with SQL queries is effort intensive, so only a small subset of data can be profiled.
[288]
However tools like SQL Server Data Quality Services do not need writing sql queries and
[296]
can run checks on the entire data set quickly.
[302]
The next activity in data QA is Data Cleansing. It involves fixing data issues.
[309]
Simple updates fix data quality issues like misspellings, typos, incorrect formats and missing foreign keys.
[316]
In case of duplicates,
[318]
one of the rows is selected as the Master.
[321]
Data in the duplicate rows is used to enrich the Master or just discarded.
[328]
Selective imports bring in the missing rows.
[331]
Obsolete data is deleted.
[334]
The important point is that due to the risk associated with data changes, such changes are NOT applied to the database directly.
[342]
Each proposed change is listed and
[346]
reviewed by a domain expert also known as the Data Steward.
[350]
The Data Steward approves or rejects proposed changes. Only the approved changes are applied by the Data Quality Analyst.
[362]
The next activity in data QA is data monitoring. The data should be maintained in a clean state.
[369]
Users entering new data check business rules first.
[373]
Users who consume data, submit data quality issues for review.
[377]
The Data Quality Analyst
[380]
runs checkers like parsers, spell-checkers, business rules checkers, duplicate checkers etc. to find out
[388]
data quality issues and submit them to the Data Steward.
[394]
If the Data Seaward lacks in domain knowledge, he or she can use
[399]
pre-built knowledge bases with business rules from multiple domains like address, zip code etc.
[408]
That's all in this video. Thank you and see you in my other videos on Database normalization.