Basic Serving of Models in PySpark
This topic describes how PySpark models are handled inside the Docker containers deployed on Seldon Core.
Preprocessing
All data are first converted into a Pandas DataFrame with column names attached, before being converted to a Spark DataFrame, and then being processed by the model. If the user passes in the list of names, then it will be applied to the columns of the DataFrame. For PySpark models specifically, the user must pass in the list of names.
df = pd.DataFrame(X, columns=names) data = spark_session.createDataFrame(df)

The choice of whether to return the final predicted labels or the probability of each class is determined based on the meta passed in through the request. That is, the meta dictionary will be checked for the method keyvalue:
If the value of method is predict then the final predicted labels will be returned
If the value of method is predict_proba then the probability of each class will be returned
The default method is always predict.
method = self.default_method # The default is "predict" if meta and isinstance(meta.get("method", 0), str): method = meta["method"]

Predictions
Based on the method chosen by the user, the results of the model will be returned in different ways. In either case, the final result will always be reshaped and returned as a 2D numpy array.
preds = self.model.transform(data) if method == "predict_proba": res = [x.probability for x in preds.select("probability").collect()] else: res = [x.prediction for x in preds.select("prediction").collect()] res_np = np.asarray(res)
