Data Platform for ML Legal Review Pipeline Integration and Automation

Mar 1

Introduction

As a Product Manager, streamlining workflows and improving efficiency are key aspects of your role. In this case study, we will explore how I, as an Engineering Product Manager at Apple, tackled the challenge of automating the legal review process for third-party datasets used in machine learning (ML) research. This initiative not only alleviated the workload on the legal team but also enhanced transparency and efficiency for research teams.

Situation

Apple’s small product legal team dedicated to ML data was inundated with requests for third-party dataset reviews. These reviews, critical to ensure only quality, known data was used within ML research, were conducted manually and often avoided by teams due to the delays they caused. The datasets in question were often purchased, highly regulated, and came with strict terms of use.

Key challenges included:

Lack of visibility into the timeline of dataset reviews.
Inability to determine if a dataset was previously reviewed and tiered.
Lack of clarity on reasons for dataset rejections.
Repetitive reviews of the same datasets by the legal team without a standardized process for minor changes.
Integration issues with the data platform for ML, which aimed to be Apple’s central repository for research and off-device model training data.

Task

As the Engineering Product Manager, my task was to streamline and automate the dataset review process to:

Alleviate the legal team's workload.
Improve transparency and efficiency for research teams.
Integrate the process within the existing data platform, solidifying its position as the go-to repository for research and model training data at Apple.

Action

To address these issues, I initiated the following actions:

Requirements Gathering and Stakeholder Engagement:

Conducted meetings with the product legal team to understand their workflow, pain points, and requirements for dataset reviews.
Gathered feedback from research teams to identify their needs for transparency and efficiency.
Collaborated with engineering teams to assess the technical feasibility of automating the review process within the data platform.

Designing an Automated Review System:

Automated Workflow: Developed an automated workflow that initiated legal reviews automatically upon dataset upload, with notifications sent to the legal team.
Review Tracking: Implemented a tracking system providing status updates on dataset reviews, including estimated completion times based on dataset complexity.
Version Control and Tiering: Integrated version control to track changes to datasets and a tiering system to categorize datasets based on review status (e.g., pending, approved, required changes).
Reuse of Reviews: Enabled a mechanism to flag and reuse reviews for previously reviewed datasets, reducing duplicate efforts.
Change Detection: Established guidelines for when datasets should be re-reviewed. This was not automated but provided clear standards.
Decision Transparency: Created a set list of reasons for dataset declines.
Educational Guides: Developed guides on appropriate data use and opportunities for teams to self-educate on Apple’s data tiering schema and guidelines.

Implementation and Testing:

Conducted pilot testing with a subset of datasets to ensure the system worked as intended and gathered feedback for improvements.
Trained the legal and research teams on using the new system, providing documentation and support.
Held "lunch and learn" sessions for teams to ask questions directly to the legal team.
Discussed potential algorithms for detecting sensitive data and explored their inclusion in our roadmap.

Launch and Monitoring:

Rolled out the automated review system to all teams using the data platform.
Set up monitoring tools to track the performance and usage of the system, ensuring it met stakeholder needs.
Established a feedback loop to continuously gather input and make iterative improvements.

Result

The integration and automation of the legal review pipeline yielded significant improvements:

Efficiency: Reduced the workload on the legal team by automating the initial review process and eliminating redundant reviews of previously assessed datasets.
Transparency: Provided research teams with real-time updates on the status of their dataset reviews and estimated completion times, reducing uncertainty and delays.
Standardization: Established a clear and consistent process for dataset reviews, including criteria for when minor changes required new reviews.
Adoption: Increased the adoption of the data platform as the central repository for research and off-device model training data, aligning with the platform's strategic goals.
Satisfaction: Improved satisfaction among both the legal and research teams, as the streamlined process allowed them to focus on higher-value tasks and reduced administrative burdens.

Key Takeaways

Stakeholder Engagement is Crucial: Engage with all relevant stakeholders early to understand their pain points and requirements. This ensures the solution addresses the real issues faced by users.
Automate to Alleviate Workload: Automating repetitive tasks can significantly reduce the workload on teams, allowing them to focus on higher-value activities.
Transparency and Communication: Providing real-time updates and clear communication on the status of tasks can greatly improve user satisfaction and reduce uncertainty.
Standardization: Establishing clear and consistent processes ensures that all stakeholders have a shared understanding of workflows, reducing confusion and inefficiencies.
Continuous Improvement: Implementing a feedback loop for continuous improvement allows for iterative enhancements, ensuring the solution evolves to meet changing needs.

By effectively managing the integration and automation of the dataset review process, I ensured that Apple’s legal team could handle the growing volume of review requests efficiently while providing research teams with the transparency and support they needed to continue their innovative work without unnecessary delays. This project not only streamlined operations but also reinforced the data platform's role as a vital resource for Apple's ML research endeavors.

Margie Henry