October 15, 2018

Top Challenges of Scaling your Data Science Infrastructure

The challenges involved in scaling your data science research infrastructure include infrastructure consolidation, resource management, consistent runtime environment, and data management.


Most enterprises today explore data science business opportunities. Some of them are making their first steps exploring the field, working out which direction they should follow and how Machine Learning (ML) can provide a new edge to their portfolio, while others have already progressed and face the subsequent question – how should they scale their research operations?


Without diving too deep into details, it will be fair to state that the world of Machine Learning is inherently different from the traditional software development with which most enterprises are familiar. As a result, many companies end up with a siloed, small-scale environment, typically comprising a small team with their GPU-powered personal compute units, cloud-based or desktop. This works fine – at least until they accomplish their preliminary research.


At this point, management is keen on investing and delegates the responsibility for scaling the ML infrastructure to IT.


Enterprise IT objectives are the exact opposite of the ML seed team – while the data science team is focused on their research, IT cares about standardization, efficiency, resource utilization and governance, data management and backup, security etc.


Onboarding ML operation into your IT operation is not a simple process.


It is clear that what worked so far will not be adequate; it calls for a different approach – and everyone should be on board.


If done properly, everyone can benefit from the change:

  1. NoOps, consistent runtime environment: At the end of the day, operating wide-scale compute resources is a job no data scientist wants to do or should be spending time on. Often, such practice encounters inconsistent research results, leading to time-consuming root-cause analysis. On the other hand, any research team’s dream is to get GPU-powered infrastructure as a service, to execute his training job and get the link to its results on completion.
  2. Infrastructure Consolidation: The benefits of infrastructure consolidation are clear for the IT organization. It provides higher resource utilization, simple and cost-effective operation, predictable expenditure, data security and more. No CFO/CIO/CISO wants to get a wake-up call regarding unexpected spend, data loss or security breach on their watch.



Our experience shows these are the top challenges such a change presents:


  1. ML-as-a-Code
    When things get past the experimentation phase in the enterprise organization, ML research should be treated as a code, controlled in a central repository with the ability to be deployed automatically. This will enable IT to guarantee a consistent runtime environment resulting in consistent results. Job execution should be as automated as possible, leaving as less room for operational mistakes.
  2. Choosing the infrastructure building blocks
    Efficiency is the key for such a decision, both in terms of shortening each training total run time, as well as keeping the scale of the infrastructure under control. Most enterprises will look into investing in a hybrid operation – an on-premise solution with cloud-bursting capability.
  3. Infrastructure resource control
    In a world in which there is no longer a 1:1 scientist:compute ratio, everyone worries they will not get their fair share of the pie, and resource efficiency means jobs are about to wait in queue for their turn. Such a transition requires an advanced queueing, monitoring, priority and control mechanism. In some organizations that have multiple business units sharing the infrastructure, multi-tenancy should be supported as well, to guarantee resources and keep separate cost centers.
  4. Data Management
    Working as a team on a shared infrastructure means also sharing your data across the infrastructure. Existing storage and backup practices might not serve that purpose well, and ML datasets differ in access and performance requirements, possibly resulting in different data protection tools and policies.


Our professional services team has repeatedly addressed these challenges and others, leading to the development of dedicated tools.



Machine Learning

Next Articles


5 June, 2024

TeraSky & Google Roundtable: Taming Kubernetes for Business Growth
Read Entry

4 June, 2024

The Future of FinOps: Automation, AI, and Collaboration
Read Entry

23 May, 2024

TeraSky Lights Up Google Cloud Summit Tel Aviv 2024
Read Entry
Skip to content