[et_pb_section fb_built="1" _builder_version="3.22"][et_pb_row make_equal="on" _builder_version="3.25"][et_pb_column type="4_4" module_class="ds-vertical-align" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text admin_label="The Challenge" _builder_version="3.27.4"]

[text-blocks id="requirements-2" align="right"]

The Challenge

An established asset manager approached us with a research platform challenge. Their application brought together market data with their own research and opinions into one view to create and evaluate their strategies.

Their legacy implementation ran on several servers as a monolithic application, with a hand managed and complex infrastructure. The overall performance was poor, which is unacceptable in a trading research environment where timeliness, speed and accuracy are paramount. In addition, the ongoing cost of this setup was hard to justify.

 

Key Considerations

  1. The solution had to be able to process several data feeds reliably and merge with the in-house research data.
  2. Security was paramount, with the highly sensitive trading research information held within the system.
  3. The application had to be responsive, dynamic with auto-scaling and auto-healing capabilities.
[/et_pb_text][et_pb_text admin_label="Solutions 3" _builder_version="3.27.4"]

[text-blocks id="technologies-used-2"]

The Solution

After reviewing the challenges, Hentsū were able to create an elegant serverless solution in Azure leveraging the huge power and scale of the cloud, which is simply not possible in a private cloud. Where possible we used cloud native services for their low cost, breadth of features available, and ease of management.

Hentsū took the existing legacy application and broke it up into microservices. The code was entirely rewritten into a modern cloud native stack, separating real-time applications from batch processing to be able to handle these appropriately.

The platform has built-in multi-region resiliency across the Azure cloud and is also hugely auto-scalable based on workloads and user demand. Within the application there are health checks and integrated alerts, automated recovery and restarts if any service fails.

Utilising such a cloud native serverless solution allows us to obtain huge processing power when needed, and only pay for what was actually used.

Technical Details

The entire platform has been built as a serverless solution. The batch work now uses Azure Batch; scripts have been turned into Docker containers which are scheduled onto temporary machines used for the smallest amount of time necessary. This allows for simplified management, huge parallelism, and low cost.

The application servers are now hosted with Azure App Service to leverage the reliability and huge number of management tools available. Hentsū built out the data pipelines to move data from the large number of feeds into an Azure Cosmos DB. A managed database was used to take away the headaches of managing a database on a machine, such as patching or disk management, while getting access to features such as georedundant automatic backups and one-button scaling.

Hentsū also developed the API server that allows individual services to get access to the data, as well as a web app that the client interacts with to perform their data analysis.

Bitbucket and Bamboo are used as the CI/CD pipelines, which test and build the application and batch Docker containers and then deploy to the container registry, all automatically when the code is updated. This can be adapted to use any repository and build server products if needed, such as Azure DevOps.

Security is crucial, with all access to the application handled natively within Azure and integrated with Azure AD for a seamless and authentication experience with world-class security.

The Benefits of Azure App Service

Everything you need to secure and manage a web app is contained in Azure App Service:

  • The servers are created as Docker containers, making them easy to create, version, and deploy
  • Deployment tools such as blue/green deployments, split traffic, and slot dependent environment variables for minimal interruption in deployments
  • All authentication handled through Azure Active Directory. This leveraged Azure class security while simplifying the platform development as app security did not need to be included in the app itself
  • One button upgrading / downgrading of hardware, with a change taking around 2 minutes
  • Custom domain handling and TLS/SSL set up in a few button presses
  • Network security, such as restricting access by IP
  • Log compilation and log streaming
  • Encrypted connection strings
[/et_pb_text][et_pb_text admin_label="Impact" _builder_version="3.27.4"]

Impact

The client was very pleased with the final product, particularly around the much lower cost and ease of administration. Additional iterations of the platform are already being seamlessly deployed, with the underlying infrastructure able to handle any amount of additional data or load from end users.

[/et_pb_text][/et_pb_column][/et_pb_row][/et_pb_section]

Date/Time

Date(s) - 01/01/1970
12:00 AM - 12:00 AM

Location

600 5th ave. NY, NY
[et_pb_section fb_built="1" _builder_version="3.22" custom_padding="0px||0px"][et_pb_row _builder_version="3.25" background_size="initial" background_position="top_left" background_repeat="repeat"][et_pb_column type="4_4" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text admin_label="The Challenge" _builder_version="3.27.4"]

The Challenge

[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="3.25" custom_margin="0px||" custom_padding="7px||7px"][et_pb_column type="4_4" module_class="ds-vertical-align" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text _builder_version="3.27.4"]

[text-blocks id="requirements"]A client recently approached us with a data science challenge regarding one of their data sets. The data was provided to the client in an AWS environment in a Redshift data warehouse. While this was fast they found it to be very expensive, in AWS the data and compute costs are coupled together. As such, a large data set necessitates a high spend on computing costs, even if this level of speed is not necessary for their analysts.

However, the data was also available in CSV format in an S3 storage bucket, which could be the starting point of a new approach. The client already had all their infrastructure deployed and managed by Hentsū in Azure, so they wanted to consolidate into the existing infrastructure. 

After reviewing the challenges, we were able to create an elegant solution leveraging the huge power and scale of the cloud, which is simply not possible in traditional infrastructure.

[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="3.25" background_size="initial" background_position="top_left" background_repeat="repeat"][et_pb_column type="4_4" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text admin_label="Key Considerations" _builder_version="3.27.4"]

Key Considerations

  • The solution had to be able to process this large data set consisting of over 11,000 files and a total compressed size of ~2TB, with additional files every day.
  • Raw files had to be stored for any future needs, whilst also being ingested into a database.
  • The ingestion should both be parallelisable and rate controlled, to ensure we manage the number of database connections and have orderly ingestion.
  • Not only was this to be a one-time load of historical data, but new files created needed downloading and ingesting in an automated fashion.
  • Every file had to be accounted for to ensure that all the data is moved correctly, so keeping track of each file's status was important. Things happen; connections break, processes stop working, so we must have a system in place when these do occur.
  • Keep ongoing maintenance low effort, cost-efficient and automated, and delegate as much of the maintenance away from end-users.
[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="3.25" background_size="initial" background_position="top_left" background_repeat="repeat"][et_pb_column type="4_4" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text admin_label="The Solution" _builder_version="3.27.4"]

The Solution

[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="3.25" custom_margin="0px||" custom_padding="7px||7px"][et_pb_column type="4_4" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text _builder_version="3.27.4"]

[text-blocks id="technologies-used" align="right"]Hentsū  recommended a solution built on Azure Data Factory (ADF), Microsoft's Extract-Transform-Load (ETL) solution for Azure. While there are many ETL solutions that can run on any infrastructure, this is very much a native Azure service and easily ties into the other services Microsoft offers.

The key functionality is the ability to define the pipelines to move the data in a web user interface, set the schedules which can either be event based (such as a creation of a new file) or on a time schedule, and then Azure handles the execution of the pipelines to process the data. The pipeline creation requires relatively little coding experience so it makes it easy to delegate this to staff with little technical experience.

 

[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="3.25" background_size="initial" background_position="top_left" background_repeat="repeat"][et_pb_column type="4_4" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text admin_label="Technical Details" _builder_version="3.27.4"]

Technical Details

Hentsū built out the data pipelines to move the data from AWS into Azure. The initial load was triggered manually, but then the update schedules were set to check for new files at regular intervals.

Hentsū created status tables to keep track of each file. This allows us to keep track of the state of the data as it passes through the pipelines and use a decoupled structure so that any troubleshooting or manual intervention can happen at any stage of the process without creating dependencies. The decoupled structure meant that individual files and steps can be fixed in isolation, and then the rest of the pipelines and steps continue uninterrupted. The clean decoupling means any errors on a particular step were easily identified and notified to users for investigation.

All the data was then mapped back to these tables, to be used if we ever needed to do further processing or cleaning on the final tables. The data was further transformed with additional schema changes to match the client's end use and to map it to the traditional trading data.

The pipelines were deliberately abstracted to allow for the least amount of work to add new data sources in the future. The goal was to make it easy for the client's end users to do themselves as and when required.

[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="3.25"][et_pb_column type="4_4" _builder_version="3.25" custom_padding="|||" custom_padding__hover="|||"][et_pb_text admin_label="Benefits & Caveats" _builder_version="3.27.4"]

The Benefits of Azure Data Factory

ADF can run completely within Azure as a native serverless solution. This means there is no need to worry about where the pipelines are run, what instance types to choose upfront, manage any servers/operating systems, configure networking, and so on. The definitions and schedules are simply set up and then the execution is handled.

Running as a serverless solution means true "utility computing", which is the entire premise of cloud platforms such as Azure, AWS, and Google. The client only pays for what is used, there are no times with idle servers costing money without producing anything, and it can scale up as needed.

ADF also allows the use of parallelism while keeping your costs to only what is used. This scaling up was a huge benefit of ADF for the client and when time is of the essence; one server for 100 hours or 100 servers for one hour cost the same, but the work is done in 1/100th of the time. Hentsū tuned the solution so the speed of the initial load was only restricted by the power of the database, allowing the client to balance the trade-off between speed and cost.

ADF has some programming functionality, such as loops, waits, and parameters for the whole pipeline. Although there is not as much flexibility as a full language (Python for example) it allowed Hentsū significant flexibility to design the workflows.

Caveats

There are limited sources and sinks (i.e. inputs and outputs). The full list is available in the Microsoft documentation. Microsoft's goal with ADF is to get data into Azure products, so if one needs to move data into another cloud provider a different solution is needed.

The pipelines are written in their own proprietary "language", which means the pipelines code does not integrate well with anything else, which would not be the case if they were written in a language like Python, as many other ETL tools will provide. This is also the key reason we have developed our own ETL platform for more complex solutions which uses Docker and more portable Python code.

There were some usability issues when creating the pipelines, with confusing UI or vague errors on occasion; however, these were not showstoppers. Our advice when using the ADF UI is to make small changes and save often. We can see that Microsoft is already aggressively addressing some of the issues we encountered.

Impact

The client was very pleased with the ADF and Azure SQL Data Warehouse solution. The solution automatically scales the compute power to process the data as it changes week by week, it scales up when there is more data, and scales down with less data. Overall the solution costs a fraction of what it did previously whilst keeping it all within the client's Azure environment.

[/et_pb_text][et_pb_cta title="Reach Out To Find Out How We Can Support Your Data Science Needs" button_url="https://hentsuprod.wpengine.com/contact" button_text="Contact Us" _builder_version="3.17.6"] [/et_pb_cta][/et_pb_column][/et_pb_row][/et_pb_section]

Date/Time

Date(s) - 01/01/1970
12:00 AM - 12:00 AM

Location

600 5th ave. NY, NY