Data Exploration Using Spark
Interactive Analysis Let's now use Spark to do some order statistics on the data set. First, launch the Spark shell: /root/spark/bin/spark-shell /root/spark/bin/pyspark The prompt should appear within a few seconds. Note: You may need to hit [Enter] Warm up by creating an RDD (Resilient Distributed Dataset) named pagecounts from the input files. In the Spark shell, the SparkContext is already created for you as variable sc >>> sc
>>> pagecounts = sc.textFile("/wiki/pagecounts") 13/02/01 05:30:43 INFO mapred.FileInputFormat: Total input paths to process : 74 >>> pagecounts Let's take a peek at the data. You can use the take operation of an RDD to get the first K records. Here, K = 10. scala> pagecounts.take(10) ... res: Array[String] = Array(20090505-000000 aa.b ?71G4Bo1cAdWyg 1 14463, 20090505-000000 aa.b Special:Statistics 1 840, 20090505-000000 aa.b Special:Whatlinkshere/MediaWiki:
Read full article from Data Exploration Using Spark
No comments:
Post a Comment