Wednesday, January 27, 2016

Introduction to Apache Spark with Examples and Use Cases View all articles



I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get started.


Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo! Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known cluster has over 8000 nodes. Indeed, Spark is a technology well worth taking note of and learning about.

apache spark tutorial
This article provides an introduction to Spark including use cases and examples. It contains information from the Apache Spark website as well as the book Learning Spark - Lightning-Fast Big Data Analysis.

What is Apache Spark? An Introduction


Spark is an Apache project advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment.


Spark provides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. Last year, Spark took over Hadoop by completing the 100 TB Daytona GraySort contest 3x faster on one tenth the number of machines and it also became the fastest open source engine for sorting a petabyte.


Spark also makes it possible to write code more quickly as you have over 80 high-level operators at your disposal. To demonstrate this, let’s have a look at the “Hello World!” of BigData: the Word Count example. Written in Java for MapReduce it has around 50 lines of code, whereas in Spark (and Scala) you can do it as simply as this:

sparkContext.textFile("hdfs://...")
            .flatMap(line => line.split(" "))
            .map(word => (word, 1)).reduceByKey(_ + _)
            .saveAsTextFile("hdfs://...")

Another important aspect when learning how to use Apache Spark is the interactive shell (REPL) which it provides out-of-the box. Using REPL, one can test the outcome of each line of code without first needing to code and execute the entire job. The path to working code is thus much shorter and ad-hoc data analysis is made possible.


Additional key features of Spark include:
  1. Currently provides APIs in Scala, Java, and Python, with support for other languages (such as R) on the way 
  2. Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.) 
  3. Can run on clusters managed by Hadoop YARN or Apache Mesos, and can also run standalone 


The Spark core is complemented by a set of powerful, higher-level libraries which can be seamlessly used in the same application. These libraries currently include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, each of which is further detailed in this article. Additional Spark libraries and extensions are currently under development as well.

spark libraries and extensions

Spark Core



Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:
memory management and fault recovery
scheduling, distributing and monitoring jobs on a cluster
interacting with storage systems


Spark introduces the concept of an RDD (Resilient Distributed Dataset), an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel. An RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.



RDDs support two types of operations:
  • Transformations are operations (such as map, filter, join, union, and so on) that are performed on an RDD and which yield a new RDD containing the result. 
  • Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD. 


Transformations in Spark are “lazy”, meaning that they do not compute their results right away. Instead, they just “remember” the operation to be performed and the dataset (e.g., file) to which the operation is to be performed. The transformations are only actually computed when an action is called and the result is returned to the driver program. This design enables Spark to run more efficiently. For example, if a big file was transformed in various ways and passed to first action, Spark would only process and return the result for the first line, rather than do the work for the entire file.


By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist or cache method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it.
SparkSQL


SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language. It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack. In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool. Below is an example of a Hive compatible query:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

Spark Streaming



Spark Streaming supports real time processing of streaming data, such as production web server log files (e.g. Apache Flume and HDFS/S3), social media like Twitter, and various messaging queues like Kafka. Under the hood, Spark Streaming receives the input data streams and divides the data into batches. Next, they get processed by the Spark engine and generate final stream of results in batches, as depicted below.

spark streaming


The Spark Streaming API closely matches that of the Spark Core, making it easy for programmers to work in the worlds of both batch and streaming data.

MLlib



MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on (check out Toptal’s article on machine learning for more information on that topic). Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering (and more on the way). Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.


Post BY RADEK OSTROWSKI

NB: This article was first featured in Toptal Engineering Blog.

76 comments:

  1. very precise and informational. good one..

    ReplyDelete
    Replies
    1. I have read your blog its very attractive and impressive. I like it your blog.

      Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

      Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

      Delete
    2. Java Training Institutes Java Training Institutes Java EE Training in Chennai Java EE Training in Chennai Java Spring Hibernate Training Institutes in Chennai J2EE Training Institutes in Chennai J2EE Training Institutes in Chennai Core Java Training Institutes in Chennai Core Java Training Institutes in Chennai

      Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training

      Hibernate Online Training Hibernate Online Training Spring Online Training Spring Online Training Spring Batch Training Online Spring Batch Training Online

      Delete
  2. wow really superb you had posted one nice information through this. Definitely it will be useful for many people. So please keep update like this.

    SEO Company in Chennai

    ReplyDelete
  3. Truly a very good article on how to handle the future technology. After reading your post,thanks for taking the time to discuss this, I feel happy about and I love learning more about this topic.


    SEO Company in Chennai

    ReplyDelete
  4. It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..

    J2ee Training in Chennai Adyar

    ReplyDelete
  5. the blog is very interesting and will be much useful for us. thank you for sharing the blog with us. please keep on updating.
    Informatica Training in Chennai Adyar

    ReplyDelete
  6. I had logged onto this site very recently and found it to be very useful and informative. SAP Simple Finance Training in Pune

    ReplyDelete
  7. nice blog too informative. looking and reading your points its so impressive. doing more blog like this. i really appreciated doing like this.
    Digital Marketing Course in Chennai

    ReplyDelete
  8. I have read your blog its very informative and impressive. Keep Updating.ERP software chennai|ERP in chennai

    ReplyDelete
  9. Very good write-up. I definitely appreciate this website. Continue the good work!
    Devops Online Training
    Adobe cq5 Training

    ReplyDelete
  10. This was so useful and informative. The article helped me to learn something new.
    PHP Training in Chennai

    ReplyDelete
  11. The blog gave me idea about Apache spark Thanks For sharing it
    Hadoop Training in Chennai

    ReplyDelete
  12. The excellent aspect is that your blog certainly informative thanks for your wonderful statistics!
    thank full for sharing information
    oracle fusion procurement online training
    oracle fusion procurement training

    ReplyDelete
  13. Thanks for sharing the useful information and good points were stated in this article which are very informative and for the further information visit us at
    Oracle Fusion Financials Training

    ReplyDelete
  14. I agreed, Spark has changed processing speed in Big data. Spark with Scala and Spark with Machine Learning have huge job opening in future in life science.

    IBM Training in Chennai |Integration Bus Training in Chennai | Websphere MQ Training in Chennai |IBM DataPower Training in Chennai | WebSphere Transformation Extender Training in Chennai

    ReplyDelete
  15. Look at spark-consulting-developers Active Wizards website. This is a team of data scientists and engineers, focused exclusively on dataprojects. Areas of core expertise include data science, data visualizations, big data engineering, and data intensive web applications development.

    ReplyDelete
  16. Nice and good article.. it is very useful for me to learn and understand easily.. thanks for sharing your valuable information and time.. please keep updating.

    Java Training in chennai | Java Training institute in chennai

    ReplyDelete
  17. Now a days Big data is one of the leading technology in IT sectors., you explain the concept of big data was very nice and interesting to read., keep updating such a great blog.. Software Testing Training in Chennai | Selenium Training in Chennai | ALM Training in Chennai

    ReplyDelete
  18. Thanks for your post, photoshop online Alternative is a perfect alternative to photoshop online photo editor. A free photos edit includes all the basic features as well as the extra bells and whistles that most people need to online photoshop free and enhance their photos and images photoshop alternative

    ReplyDelete
  19. Great creating content regularly is very tough. Your points are motivated me. Excellent blog after reading this I am impressed a lot.
    Oracle Fusion SCM Training

    ReplyDelete
  20. Thanks for this post... Very informative and check this post
    How to reduce weight

    ReplyDelete
  21. The website is looking bit flashy and it catches the visitors eyes. A design is pretty simple .
    Webdesign Deutschland

    ReplyDelete
  22. Really useful information. we are providing best data science online training from industry experts.

    ReplyDelete
  23. Best Digital Marketing company Anantapur

    helpful information, thanks for writing and share this information

    ReplyDelete
  24. I'm puzzled with lots of exercises. I was afraid I could not do the right time despite my hard work. I need a support person.
    http://run3play.com

    ReplyDelete
  25. I am looking to join big data analytics training in Hyderabad. Any suggestions please?

    ReplyDelete
  26. Thanks for sharing such a good content about Apache Spark. Its so much informative for the followers. I like the way you describe this post. Its really helpful for the users of this site.
    Apache Spark Tutorial

    ReplyDelete
  27. I applaud the publication of your article on introduction to apache spark. It's a good reminder to look on the Hadoop Training.

    big data hadoop training and certification

    ReplyDelete
  28. mytectra placement Portal is a Web based portal brings Potentials Employers and myTectra Candidates on a common platform for placement assistance.

    ReplyDelete
  29. Thank you for sharing your article. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
    Data Science Training in chennai at Credo Systemz | data science course fees in chennai | data science course in chennai velachery | data science course in chennai omr

    ReplyDelete
  30. Nice and informative article.Thanks for sharing such nice article, keep on updating.

    Apache Spark and Scala Training
    Cloud Training

    ReplyDelete
  31. You have shared an amazing information on big data. Its simply awesome and informative for us. Keep sharing. Big Data Hadoop Training in Pune

    ReplyDelete
  32. Hi...I am reading your post from the beginning, it was so interesting to read & thanks for sharing useful post. Warehouse Audit | CA Firms | Stock Audit


    ReplyDelete
  33. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.


    AWS Training in Velachery | Best AWS Course in Velachery,Chennai

    Best AWS Training in Chennai | AWS Training Institutes |Chennai,Velachery

    Amazon Web Services Training in Anna Nagar, Chennai |Best AWS Training in Anna Nagar, Chennai

    Amazon Web Services Training in OMR , Chennai | Best AWS Training in OMR,Chennai

    ReplyDelete
  34. Just stumbled across your blog and was instantly amazed with all the useful information that is on it. Great post, just what i was looking for and i am looking forward to reading your other posts soon!

    Java training in Chennai | Java training institute in Chennai | Java course in Chennai

    Java training in USA

    Java training in Bangalore | Java training in Indira nagar

    Java training in Bangalore | Java training in Rajaji nagar

    ReplyDelete
  35. I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.. Believe me I did wrote an post about tutorials for beginners with reference of your blog. 
    python course in pune
    python course in chennai
    python course in Bangalore

    ReplyDelete
  36. Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
    python Online training in chennai
    python Online training in bangalore
    python interview question and answers

    ReplyDelete
  37. I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.. Believe me I did wrote an post about tutorials for beginners with reference of your blog. 
    excel advanced excel training in bangalore | Devops Training in Chennai

    ReplyDelete
  38. I am really admired for the great info is visible in this blog that to lot of benefits for visiting the nice info in this website. Thanks a lot for using the nice info is visible in this blog.
    Java training in chennai | Data Science Training in Chennai | DevOps Training in Chennai

    ReplyDelete
  39. That is extremely fascinating; you are an exceptionally talented blogger. I have shared your site in my informal organizations! An exceptionally pleasant guide. I will take after these tips. Much obliged to you for sharing such point by point article. Duplicate Payment Audit
    Continuous Monitoring
    Duplicate Invoice Audit

    ReplyDelete

Any feedback, good or bad is most welcome.

Name

Email *

Message *