June 12, 2023

Ray: Distributed Data Processing with Python

Ray is a project started by RISELab in 2017 that conducts research on real-time data processing systems and artificial intelligence. Developed as an open-source library with a focus on parallel and distributed computing, Ray has recently become a frequently used tool in data analysis, artificial intelligence, and machine learning projects by data scientists and Python programmers. This article provides an overview of Ray, its advantages, and its layers.

Ray enables us to work faster and more effectively with a large number of data sets. It also has APIs that support multiple languages and provide users with many features.

Figure-1: Ray
Figure-1: Ray

What Are the Advantages of Ray?

  • Easy to use: Ray provides a simple and user-friendly API that simplifies parallelization and accelerates the development process.
  • High performance: Ray improves code performance through parallel processing and distributed computing.
  • Scalability: Ray can dynamically manage computing resources and support large-scale computations.
  • Fault tolerance: Ray provides automatic fault tolerance and maintains system stability in the event of crashes. It backs up and restarts tasks and data.

Ray Layers

 

1. Ray Core

Ray Core is an API that provides the basic functions required for parallel and distributed computing. With Ray Core, you can run your own Python code in parallel and distributed fashion.

Figure-2: Ray Core
Figure-2: Ray Core
  • Ray Task: Ray Task is used to define parallel functions. These functions can be run on multiple computers or processor cores at the same time.
  • Ray Actor: Ray provides an actor-based programming model. The actor model is a message-based programming model. Actors work by messaging each other. Ray actors facilitate data processing and coordination in scalable and distributed systems.
  • Ray Object: Enables data to be shared between different functions and actors.
  • Placement Groups: Ray configurations that allow operations or actors to be run in a specific location.
  • Environment Dependencies: Python packages and system libraries used by Ray for environment configurations.

2. Ray Cluster

Ray Cluster is a structure that allows multiple computers to work as a single computer. With Ray Cluster, we can make computational operations faster and more scalable. In other words, it enables the functions and classes we defined with Ray Core to work in parallel and distributed. Ray is available on many platforms, including AWS, Google Cloud, Microsoft Azure, Docker, Kubernetes, Slurm, macOS, Windows, Linux, and many more. These platforms increase Ray’s scalability and can be configured according to users’ needs. Ray’s flexible structure is a great advantage for users because it can be run on different platforms.

Figure-3: Ray Clusters
Figure-3: Ray Clusters

 

  • Head Node: A node that serves as the manager of an application running on Ray. The head node coordinates other nodes and ensures the application runs.
  • Worker Node: A node where an application running on Ray performs its workloads. Worker nodes carry the main computational load of the application and execute tasks in parallel.
  • Autoscaling: Ray automatically increases the number of nodes when the workload increases. This feature allows application performance and scalability to be increased. Automatic scaling enables users to make their applications more efficient and powerful.
  • Ray Jobs: Ray is used to plan tasks to be performed. Jobs can be a call to a Python function or a class method. Ray automatically parallelizes and runs jobs on different nodes. This can increase the performance of applications and balance the computational load.

3. Ray AIR

Ray Air is a Ray API customized for machine learning and artificial intelligence applications. Ray Air is designed to provide efficiency and scalability from model training to model deployment and application.

 

Figure-3: Ray AIR
Figure-3: Ray AIR

The main concepts in Ray AIR are as follows:

  • Datasets: Collections where data used in learning algorithms are collected and prepared. This data usually comes from different sources and can be in different formats. Preparing datasets involves processes such as data cleaning, completing missing data, data transformation, and feature selection.
  • Preprocessors: Functions used to perform data preprocessing steps that organize, standardize, and determine attributes of data. Preprocessing steps may vary depending on the characteristics of the dataset and are generally aimed at improving the quality of the dataset.
  • Trainers: Functions used to train learning algorithms and create a model. These functions create a model using examples in the dataset and use a test set to measure the model’s performance.
  • Tuner: A function that automatically adjusts the hyperparameters of the learning algorithm. This function selects the hyperparameters that provide the best performance by trying many different hyperparameter combinations.
  • Checkpoints: Stages where the model is saved during the training process and can be reloaded. This function is used to resume training from where it left off in the event of an unexpected situation during the training process or if the training process is interrupted.
  • Batch Predictor: A function used to predict model results on multiple data points. This function processes all examples in the dataset at the same time and saves prediction results to an output file.
  • Deployments: Functions and configurations used to make the model ready for use and to predict real data. These functions are customized for the platforms where the model is deployed and made suitable for real-time data flow.

This article introduces Ray, a tool that leading companies such as Anyscale, Hugging Face, Intel, and NVIDIA use to improve efficiency when working with large datasets. Anyscale uses Ray to develop scalable artificial intelligence applications, Hugging Face prefers Ray for training and deploying natural language processing models, and Intel and NVIDIA benefit from Ray’s advantages in high-performance computing and deep learning. It is expected that the use of Ray will continue to increase, and we are excited to follow its developments. Thank you for reading!

References:

Related articles