-
Nice!
I was able to run your script:
+--------+------------------+ |database|COUNT_OF_NEW_PAGES| +--------+------------------+ | enwiki| 6696| +--------+------------------+
A couple of minor things:
-
spark.run
should be changed tospark.sql
(run
is not a SparkSession method). -
pandas_df
is actually a Spark DataFrame. Conceptually its the same object as Pandas, but it has different semantics. You can cast to Pandas using thetoPandas
method (pandas_df = pandas_df.toPandas()
), but our best practices discourage this type of collect-on-the-driver patterns. -
pandas_df
does not have arename
method. To renamed columns of an existing Spark DataFrame, you should dopandas_df = pandas_df.withColumnRenamed('domain', 'DOMAIN').withColumnRenamed('count(1)', 'COUNT_OF_NEW_PAGES')
instead.
-