Scaling Appends, Updates, Deletes, and Compliant Purging for GDPR Compliance

GDPR logo

Introduction

Under the GDPR, users have the "right to be forgotten," which allows them to request the deletion of their personal data when it is no longer necessary for its original purpose, they withdraw consent, or their data has been unlawfully processed. Additionally, users have the right to have inaccurate or incomplete personal information corrected, ensuring that their data is accurate and up-to-date. These rights empower individuals to maintain control over their personal information and enhance their privacy and data protection. This case study highlights our approach to enhancing the Machine Learning data platform's capabilities to handle efficient appends, updates, deletions, and compliant purging while ensuring scalability and improved user experience.

Situation

Our Machine Learning data platform was experiencing performance issues during peak usage hours. The platform stored large datasets with annotations as documents and allowed versioning by requiring users to download, modify, and re-upload datasets. This method led to degraded performance and increased latency, particularly with increased annotation updates and row deletions. Support for performant updates and deletions was being pilot-tested by a few key partners whose datasets pushed the platform's capacity.

Task

I was tasked with improving the platform’s performance and scalability to handle the growing volume of data and operations. Our goal was to efficiently manage high volumes of read and write operations during peak hours while maintaining dataset integrity and accessibility. Budget constraints on compute power and the need for a sustainable chargeback strategy added to the challenge.

Action

Analyze User Interaction Patterns:

  1. Identify Issues: High volumes of concurrent write operations were causing timeouts and slower response times.

  2. Gather Feedback: Users preferred storage options like S3 due to difficulties in handling large datasets within the platform, especially when updating annotations, appending new data, or deleting rows.

Formulate Hypotheses:

  1. Horizontal Sharding: Distribute rows of data across multiple shards to manage high volumes of read and write operations. This allows for parallelized operations, leading to improved performance and reduced latency.

  2. Vertical Sharding: Split the dataset into different tables or shards, separating metadata (e.g., annotations, versioning information) from the main data (e.g., raw data rows). This improves performance for update-heavy operations without affecting main data storage.

  3. Partitioned S3 Data: Achieve equal or better performance for teams using partitioned S3 data. This approach scales shards independently, which is beneficial given the nature of our data.

Validate Hypotheses:

  1. Pilot Testing: Collaborated with 12 teams with datasets that pushed the data platform's limits. Engineers from both sides observed ingestion attempts, monitored metrics, and met weekly to fine-tune implementations.

  2. Collaborative Communication: Partner teams shared their experiences during a monthly product open house. I created and shared a case study highlighting our collaborative efforts and impact.

Result

  1. Improved Performance: The platform maintained low latency and high performance for both read and write operations, even during peak usage times.

  2. Scalability: The system easily scaled out by adding more shards as data volume and user demand grew, ensuring continued reliability and efficiency.

  3. Operational Efficiency: Clear segregation of data types and optimized query handling reduced contention and improved overall data management.

  4. User Growth: Increased new dataset submissions followed partner teams sharing their success stories and our collaborative efforts.

Key Takeaways

  1. Horizontal and Vertical Sharding: Effective sharding strategies can significantly improve platform performance and scalability.

  2. User-Centric Feedback: Continuous user feedback helps identify pain points and preferences, driving meaningful improvements.

  3. Collaborative Testing: Partnering with key stakeholders for pilot testing ensures practical insights and effective solutions.

  4. Sustainable Scalability: Efficient data management practices enable sustainable platform growth and reliability.

Implementing these strategies has been transformative for our data platform, ensuring it remains performant and scalable while meeting advanced data governance needs. This case study illustrates how thoughtful data management and collaborative efforts can drive innovation and maintain platform competitiveness.

Previous
Previous

Go To Market Planning: Feature Re-launch

Next
Next

Simplifying Rapid Prototyping with Streamlit