Modern data is continuous, diverse, and ever accelerating
How companies such as athenahealth can transform legacy data into insights.
Application development in the big data space has changed rapidly over the past decade, driven by three factors: the need for continuous availability, the richness of diverse data processing pipelines, and the pressure for accelerated development. Here we look at each of these factors and how they combine to require new environments for data processing. We also look at how one company, athenahealth, is adjusting its legacy systems, for billing, scheduling, and treatment for health care providers, to accommodate these trends, using the Mesosphere DC/OS platform.
Continuous availability: Data is always on
This is a world where people could be working or visiting your site at three in the morning, and if it’s unavailable or slow you’ll be hearing complaints. Failure recovery and scaling are both critical capabilities. These used to be handled separately: failure recovery revolved around heartbeats and redundant resources, whereas scaling was an element of long-term planning by IT management. But the two capabilities are now handled through the same kinds of operations. This makes sense, because they both require monitoring and resource management.
Continuous availability is now supported by virtualization, whether on-premise or in a third-party vendor’s cloud, and whether the underlying instances are Iaas or PaaS. New instances of an application must be spun up quickly whenever old instances fail (recovery) or usage spikes require more servers (scaling). Changes to applications (which may occur frequently, as we’ll see) require all these instances to be kept up-to-date. Containers and microservices are popular mechanisms for this flexibility.
As part of its transformation with Mesosphere, athenahealth added cloud support on the Microsoft Azure platform, which allowed them to support multiple services in a scalable and fault-tolerant manner.
Diverse data processing pipelines: Flexibility is key
While system operations coalesce around the general goal of availability, data processing has split up to be handled by a wide variety of tools: Hadoop, Spark, Kafka, Flink, etc. (The most popular tools cluster around Apache’s open source Hadoop project, but because Hadoop was created when data size was more important than speed, new tools for fast data are emerging.) This diversity and fast-paced evolution requires great flexibility in data handling and in the workflows organizations use.
The database industry has accommodated diverse processing for decades. Traditionally, a relational database would handle transactions while the organization set up an extract, transform, and load (ETL) workflow to support analytics. The field is being significantly disrupted by the newer tools, often in support of soft real-time handling for streaming data. The workflows for data processing get set up and torn down quickly and repeatedly, so the containers and virtual systems described in the previous section fit well with data processing requirements.
Performance on the old athenahealth platform could not keep up with growth or new types of applications requiring rapid data access. Retrieving and filtering a list of doctors, for instance, could take several minutes. They upgraded the platform to efficiently expose data to popular analytic tools such as Spark and TensorFlow, which help them support the hundreds of thousands of messages exchanged among doctors each day.
Accelerated development: Modularity means scalability
The markets and other environments in which businesses operate have been changing so quickly that application development has been rethought over the past decade as well. Such buzzwords as Agile programming, microservices (part of the earlier topic on continuous delivery), and DevOps show development is breaking out of traditional organizational hierarchies.
Fast development requires rigorous modularization so a small team can focus on one piece of the application and update with minimal risk of breaking other pieces. athenahealth, before their transformation, had a relatively monolithic architecture which made it difficult to identify bugs before production. Modularization of their legacy system improved testing and sped up maintenance.
Coordination headaches: The solutions are out there
Continuous availability, diverse data processing pipelines, and accelerated development are powerful capabilities that make it possible for organizations to keep up with the frantic pace of business environments. But they impose daunting burdens on IT staff and organizations.
Organizations not only need to set up workflows to support these capabilities, but they also have to automate them and extensively test them to make sure they work at three in the morning. Luckily, many vendors recognize the needs of modern development and are simplifying the tools and processes required to make it all work.
This post is a collaboration between O’Reilly and Mesosphere. See our statement of editorial independence.