Getting your arms around the data
Next, we need to prepare the data for analysis, using ETL (extract, transform, load).
Extract: Pull some or all data from a multitude of sources that have information about identities, accounts, rights, activities, and resources. Expect the data to exist in various repositories, with different storage formats and data representations, each with their own security challenges. Anticipate needing to employ different techniques and technologies to connect to and extract the needed data. Most systems have data available that help answer the who, what, where, when, why, and how. For example, an HCIS (health care information system) typically has information about the following:
- Accounts for workers, clinicians, researchers, affiliates (Dr. Smith)
- Rights assigned to the accounts (Dr. Smith can schedule appointments, dispense medication)
- Resources accessible via the assigned rights (schedules for Dr. Smith's team of clinicians and records for Dr. Smith's patients)
- Activity done within the HCIS (Dr. Smith logged in and viewed the records of patient X)
The extraction phase may be performed in a batch/bulk manner, or it may be conducted real time, where data is extracted as it changes.
Transform: Next the data must be converted and normalized to get it into an understandable format. A simple example is date and time: Data may show 9:00 a.m., but what time zone? Is it Daylight Savings Time or Standard Time? Does all the data conform to the same level of granularity in minutes, seconds, or microseconds? Typically you resort to transforming all data to Greenwich Mean Time (GMT). The time stamp format for logon events may vary with each system extract and needs to be converted to a consistent format for analysis.
Many other data transformations may be done to prepare the extracted data for storage and analysis. The data may need to be augmented from another repository, split into new data elements, validated against other repositories, or changed to a new value. Here's how a ZIP code might be transformed:
- ZIP codes may be either the five-number format or ZIP+4
- Split ZIP+4 records into two fields
- Data without a ZIP code or with alpha characters should be discarded
- Verify that the numeric five-digit ZIP code is valid, verify ZIP+4 if present
- ZIP code lookup populates and corrects city and state information
Load: The last step is to store the transformed data in a repository for analysis and determine what data is overwritten and what data is changed. For example, is the "load" data authoritative, or is the data already present in the repository authoritative? Expect to collect a large amount of data and then, depending on your data retention policy, add it to an already large data set. When activity data is collected, not only is it likely to be large, but it may also be arrive quickly (in real time).
Furthermore, the need to do forensics often drives the need to retain detailed records, resulting in larger data sets. Expect that your disk storage needs may increase based on the size of your organization and your forensics needs.
To get answers: Analyze, relate, infer, and visualize
With the data normalized and loaded, it's ready to analyze. The analysis itself generates new data in the form of facts, relationships, indicators, trends, and inferences.
Multidimensional analysis reorganizes the data and provides new ways to pivot, view, and analyze. IAI (identity and access intelligence) analytics solutions are specifically tailored to provide analysis and visualization specific to IAM, making the connections between identities, the access assigned, permissions, and ultimately the resulting access that a person has to a given resource.
Relationships between objects such as inheritance and hierarchy add to the complexity of understanding the access environment and help us understand and answer the question about whether Bob can really approve large budget items, as well as assess the risk related to the given access rights. For example, assume that we "know" (from collecting identity and access data) that Bob can approve budget items over $100,000. This allows us to infer that Bob is a power user in the application where he can approve $100,000 items.