The integration of Machine Learning (ML) into the DevOps culture, forming MLOps, is revolutionizing how teams develop, deploy, and maintain ML systems. This chapter delves into an MLOps architecture designed to streamline the lifecycle of ML projects. I focus on a comprehensive setup that includes GitHub for version control, automated workflows for data updates, a Virtual Private Server (VPS) with Dockerized containers for various services, and a GPU server for handling embeddings. This architecture supports a wide range of applications, from web scraping to model training and inference, emphasizing automation, scalability, and reproducibility.
Version Control with GitHub
GitHub stands at the core of our MLOps architecture, serving as the central repository for all codebases. It hosts the source code for data extraction scripts, ML models, frontend and backend applications, and infrastructure as code (IaC) for service deployment. GitHub's role extends beyond version control; it facilitates collaboration, code review, and integrates with CI/CD pipelines for automated testing and deployment.
Automated Workflows and code repository
Leveraging GitHub Actions, I automate repetitive tasks such as data updates, testing, and deployment processes. For instance, cron jobs or GitHub Workflows can be configured to trigger data scraping scripts or update datasets periodically. This automation ensures that our ML models are trained on the most current data, improving their accuracy and relevance.
Infrastructure on a VPS Server
Our MLOps architecture utilizes a VPS server to host Dockerized containers, each serving a distinct role in the ML lifecycle:
- Web Scraping Containers: Utilize Scrapy for general web pages and Playwright for JavaScript-heavy sites. These containers are responsible for collecting raw data.
- Database Containers: MongoDB stores unstructured data collected from web scraping, while PostgreSQL, enhanced with pgvector, manages structured data and embeddings. This setup provides a robust data storage solution that supports complex queries and efficient data retrieval.
- Amazon S3: Serves as the backbone for backups and storage of large datasets, model artifacts, and embeddings. Its reliability and scalability make it an ideal choice for ML projects.
- Fuzzy Search Entities Container: Extracts and identifies entities using fuzzy search algorithms. This container improves data quality and enriches datasets by identifying relationships and patterns.
- Python Services Containers: Host multiple Python services for data manipulation, process automation, and ML model training with libraries like PyTorch.
- Frontend and Backend Containers: The frontend container hosts a React application, providing a user interface for interacting with the ML system. The backend container, running a Node.js app, connects to MongoDB and PostgreSQL, serving data to the frontend and handling business logic.
GPU Server for Embeddings
A dedicated GPU server hosts an embeddings microservice, which generates embeddings or vectors from text inputs. This service leverages PyTorch and is crucial for tasks requiring semantic understanding, such as recommendation systems or natural language processing (NLP) applications. The GPU server ensures high-performance computation, enabling real-time embeddings generation and accelerating model training processes.