Assistance, Not Resistance: 11 Keys to Big Data Analytics Success – Part Two
As introduced in Assistance, Not Resistance: 11 Keys to Big Data Analytics Success – Part One, large organizations continuously seek ways to leverage existing data in new architectures to improve business aspects—including anything from cost-reduction initiatives such as consolidating systems; construction of predictive machine learning algorithms for analytics; or even tried-and-true data quality, reporting, and other types of analysis. Once these core capabilities are developed, it becomes easy to extract valuable insights to improve the bottom line.
To make construction as seamless as possible, address these six additional considerations for successful big data analytics solutions:
6. Build a prototype environment to test out complex functions
Especially in the brave new world of Apache™ Hadoop®, functionality that worked seamlessly in the past may suddenly require effort. For example, indexing is still critically important, but constructing and maintaining indexes might not be as straightforward as developers expect. Even simple tasks like updating records can’t be taken for granted.
Another area ripe for prototype testing is enablement of granular security. If analytics users must be granted access to some data but denied access to other data, and if command-line query access is an option, then application-level permissions cannot be relied on to implement security. Prototype these options and develop proven templates for use in future projects.
7. Test datasets
This is not testing to verify data, although that is an important task. This is about performing development and integration testing using subsets of the full dataset. Why? Pushing millions of rows of data through transformation code and storage layer pipelines consumes time and computing resources, and there is no reason to analyze the full dataset until it works for a representative sample. Keep in mind that datasets require referential and temporal integrity, so they must be carefully chosen and defined. As a best practice, build code that verifies input/output counts, sample checks, valid values, date/time formats and time zones.
8. Manage historical records
Some systems keep historical versions of records while others only maintain the current version. First, decide if handling of history is required. If it is necessary, identify those scenarios early in the project. Be sure to coordinate with records management to ensure that company policies are followed. This is particularly important when removing information after it has expired. Consider construction of views that show only the current records for situations where both historical changes and current versions are needed.
9. Consider requirements for harmonization
Many analytics projects will require reformatting data from various input structures and models into a preferred or standard structure. This can include harmonization of entity structure names, such as tables or objects; entity attributes and properties, including complex and array types; and data values. Harmonization of values can include conversion to standard units, standard time zones and standard lists of valid values.
Decide when the source system names and values should be provided by reference, either in metadata or supplemental definitions, and when the source information should be included in the analytics data itself. Provide decoded values when the source system used enumerations or other types of encoding to minimize the size of values. Space is usually not a significant concern, so include enough information to support all kinds of analysis, including original value distinctions such as abbreviations and typos (they might be helpful for some future use!). Harmonizing data usually reduces the list of possible values, which means that some distinctions between values will be lost, unless the original values are also provided.
10. Determine user access methods
Since the analytics platform will need to support a wide range of data consumers and users, develop prototyping and requirements around access methods. Typical high-level access methods that should be considered and prioritized in project development cycles are listed below.
- Services – including FTP/batch files, web services, or messaging/queueing.
- Publish/subscribe
- Snapshots of all current and historical records
- Incremental updates, new records, and deletions (if applicable)
- Query capability (for applications without a synchronized copy of the data)
- Publish/subscribe
- Command Line – using SQL or other query languages.
- Views – providing restricted or compiled views of the underlying data.
- User Interface – including the ability to browse or search the entire dataset, view maps or other capabilities.
- Dimensional Models – supporting traditional business intelligence or data warehouse front ends
- Sandboxes – developing special-purpose datasets using Python, R and other tools.
11. Plan for smooth operations
There will be issues after the analytics platform is deployed. Build in capabilities for support and troubleshooting from the beginning, by specifying components that control and monitor the status of all environments, tools for restarting processes or restoring from checkpoints, informational and accessible error logging and recovery and other functionality.
As with other large, enterprise-wide initiatives, analytics projects tend to face all sorts of unforeseen difficulties in accessing the data. Plan to spend a large amount of time discovering what exists, who understands it and what the best source is for each topic. Start small, gain experience making incremental changes and you’ll go far.
Check out our website to discover how Xtensible Solutions makes analytics implementation more efficient and effective.
Back To Blog