In Spark 2.0, IPYTHON and IPYTHONOPTS are removed and pyspark fails to launchis set in the user’s environment. Instead, users should set PYSPARKDRIVERPYTHON to use IPython and set PYSPARK_DRIVER_PYTHON_OPTS to pass options when starting (e.g. PYSPARK_DRIVER_PYTHON_OPTS=’notebook’). This supports full customization and executor Python executables.
import numpy as np from pyspark.mllib.stat import Statistics
mat = sc.parallelize( [np.array([1.0, 20.0, 300.0]), np.array([3.0, 10.0, 200.0]), np.array([2.0, 30.0, 100.0])],10 ) # an RDD of Vectors
# Compute column summary statistics. summary = Statistics.colStats(mat) print(summary.mean()) # a dense vector containing the mean value for each column print(summary.variance()) # column-wise variance print(summary.numNonzeros()) # number of nonzeros in each column