Mindy Becker, Senior Data Governance Solutions Consultant
Many organizations are data-rich. So rich, in fact, that they don’t know where to begin to find, access, organize or use this data to improve their processes, procedures and costs, or, to identify new business opportunities.
They know the data’s there, but where?
Most likely, it’s hiding among the separate spreadsheets, documents, databases and devices available to many associates (too many?!) or even worse, ill-intentioned hackers.
To the rescue comes CLAIRE, an out-of-the-box AI solution embedded in Informatica Enterprise Data Catalog (EDC) that can remedy many related issues. Although a powerful tool for data discovery, CLAIRE, like any “associate,” requires training to turn data into reporting, insights, strategy, content or conversions. While highly beneficial, CLAIRE often requires more-than-anticipated training time and realistic initial expectations to be fully useful.
Consider the needs of an insurance company, which can accumulate many types of personalized data in a single policy or claim, such as the agent, policyholder, payer, payee, beneficiary, payable amounts, incident or expiration dates and much more. For the company to do its work in selling, collecting, processing claims etc., it needs an efficient way to assess its ongoing business and map the way forward.
Faced with staggering amounts of data ranging from a single day to many years, CLAIRE can help data analysts or scientists in these ways with proper and patient training:
- Discover where specific data resides. If you’re seeking items like dates (usually 6 or 8 digits) or Social Security numbers (always 9 digits) – and allowing for format variances (123456789 or 123-45-6789 or 123/45/6789) – CLAIRE can be trained to find these in different files, databases or devices. Likewise, a given data point can be excluded from consideration.
- Identify data you have or don’t have. By training CLAIRE to recognize common data formats for a specific term, or by knowing what is sought and assigning an affiliated domain, you can learn whether or not a desired element exists in the data. Differently, the ID process can tell you if you’ll need to find or obtain the element elsewhere.
- Integrate the available data. Since desired data can exist in different formats or reside in different locations, CLAIRE can be trained to seek out the variations and place them in separate columns for a common purpose.
For these three initiatives, which only scratch the surface of CLAIRE’s capabilities, here’s how to train this tool for your benefit:
EDC has several predetermined, out-of-the-box (OOB) domains representing Personal Identifiable Information (PII), Personal Health Information (PHI), and Personal Credit Information (PCI). These rules are written to seek the data’s patterns and common naming conventions.
- Before you ever start scanning for information, you can set CLAIRE and your team on a path for success by reviewing all of the provided OOB domains and domain groups in order to decide what kinds of information you will scan for. As an international leader, EDC offers several domains that are country-specific and some might not apply to your owned data. There’s no need to scan for information you know you don’t have, and then continually train CLAIRE to not assign that domain to anything.
- You can also sample your known information to see what is stored together. Domain parameters can be assigned (known proximity), to help increase certainty of successful domain assignment. Items like address, city, state and zip code can be stored in the same table, and therefore have a higher likelihood of being correct if CLAIRE found all of those domains close together.
- Lastly, you can change the preference for auto-acceptance of domains. For OOB assignments, the “likely” minimum threshold is 40%, and the maximum is 80%.
The work begins after the OOB scans have run. CLAIRE has made its best guesses, but it is up to Data Curators and Stewards to evaluate CLAIRE’s success. An operator must individually review each asset to both remove incorrect domains and approve correctly assigned domains. Though CLAIRE’s prospects are exciting, the early results can be disappointing. Some operators have unfortunately seen SSNs, dates, and account IDs all assigned to the same column – ugh!
Given what’s required to train CLAIRE, and the potential “confusion” among data items like SSNs and dates, it can help to start with private PII information and proceed to less sensitive data. After initiating assignments, you may need to remove the unwanted domain assignments, such as dates, before CLAIRE works its “magic” by auto-correcting domain assignments. While assigning multiple domains can be redundant and tedious early on, it helps to recognize that CLAIRE is machine learning, and not a trained machine.
Whether by a person or machine, time is needed to understand any data asset. With each acceptance or denial of an assigned domain, CLAIRE is learning and will continue learning with each action taken by a Data Steward. The speed at which CLAIRE will learn is vastly different per company due to many variables, including the type, quality, and quantity of data to be scanned. If told where and how, CLAIRE – once trained – can recognize lineage, synonyms, glossary definitions, data types and patterns, column names and more.
CLAIRE can move huge volumes of data from unknown or unidentified to useful. Once trained, CLAIRE can bring efficiency and expediency to often laborious tasks – from data management to reporting – leaving your organization with more time to solve everyday business problems and challenges.