Automated tools to produce and test AI models

The previous post on model drift highlighted the need to rebuild models as frequently as needed. How would this be done in practice? The most obvious path would be to have the data scientist rebuild the model, including generating the training dataset, model fitting, testing and comparison. Once the new model is completed, the next step would be to swap out the old model with the new model in a production environment.

Imagine now if the model needs to be refreshed on a weekly basis. Does that mean one data scientist is needed to maintain one model? After several rounds of model refresh and deploy, she would probably want to automate the process. Here's where automated tools come in handy.

Breaking down the steps

The first step is to divide up the model refresh into discrete steps. A simple version could like figure 1.


Figure 1 - Four steps in model refresh

The first step is usually generating the dataset needed for rebuilding. Putting aside the possibility of new data for the moment, automating dataset generation would likely involve running some query scripts on a compute engine. The output should be stored along with the metadata associated with the dataset.

The second step is the rebuild the model. Most modern models utilize ensemble methods, so many algorithms are used. All the usual steps such as feature engineering and hyper-parameter tuning occur here. Similar to the dataset generation, modeling output should be stored along with metadata associated.

The third step is a model comparison step. Several model fitness criteria may be used to determine the "best" model for the application. As described in the previous post, it may be a regression between predicted and actual; or more broad measures such as accuracy and precision. The output of the step is a determination of which model to use for production scoring.

The final step is a propagation to the production line. Once the champion model has been selected, it must to used to generate production scores. If deployment is a real time scorer, then the real time scoring code should be generated and tested in a standard CI framework. A more batched process still requires a testing step, where a batch of scores must be produced with both the current and new champion model.

The almighty scheduler and its pitfalls

The key to automation is a fault tolerant job scheduler that can manage each of the steps.  If dataset generation can be scheduled to run on a daily or ad-hoc basis, that would alleviate the manual processes of dataset generation. The same logic applies to the other jobs. Most commonly, teams start with cron jobs. We recommend against this approach, for reasons detailed later in this post.

Another common approach is to leverage the scheduler that is built into different tools separately.  For example Google has various scheduling functions built right into BigQuery.  Other modeling tools like RStudio Connect have their own scheduling capabilities.  We also recommend against this approach, for reasons detailed later as well.    

Sonamine experiences have highlighted these key scheduler functionalities for modeling automation:

  • DAG based job chaining. A dataset generation job could be followed by a model building job, which in turn could be followed by a model comparison job. Common condition based triggers should be supported. For example if the current model is "kept" in the model comparison step, then there would be no need to propagate the newly created model.  Using a tools-specific scheduler makes it difficult to manage DAG jobs that span different tools.  The same reasoning applies to cron.
  • Ad hoc job initiation with custom parameters. Quite often a modeler may need to reinitiate a job for various reasons. You may need to additional dates in the modeling period, for example. Custom job parameters like "data start date" that can be altered via a web UI would be a preferred solution to having to alter a parameters file.
  • Logging and auditing for job runs is critical for debugging purposes.  Especially in the age of explainability and transparency, the job logs should be stored together with the job output, including datasets, models, and champion models.
  • External job triggering. Sometimes a model rebuild or dataset generation job may be triggered due to external conditions. For example, if the production scoring data appears to be deviating from the training dataset, an external process could trigger the job scheduler to start a model rebuild.  This is another space where cron and tool specific schedulers may fall short.
  • Job creation, monitoring, logging, initiation, stopping and cancellation should all be available via a web interface. While this may seem to be a luxury feature, having a web interface allows ops or support teams to manage modeling jobs. Such a division of labor allows modelers to focus on their specialty instead of data management.  
  • Configurable job retries and alerts on failures.  Things go wrong, and an automated modeling environment should be resilient to failures.  Dataset queries may timeout while model build jobs might run out of memory.  Needless to say, each job should be configured to leave data and models in a consistent state regardless of final status.  

Automation separates the winners from the rest, according to a McKinsey survey of AI in 2020. Automation is a core foundational principle behind the Sonamine predictive scoring service, allowing our customers to use AI in CRM without a large initial investment. What's your automation strategy? Contact us to compare notes!