Prepare for Spark Databricks exam in 24 hours

Yes, it is possible to crack Spark 3 Databricks (Pyspark) exam with 24 hours of study time under one of these two conditions.

  1. You are familiar with Pandas package
  2. You have some hands-on with Spark

If you are not coming from a pandas background, the least that you will need is basic python knowledge to accelerate your learning curve. If you already have some spark hands-on , you are better off starting your preparation with the official practice test. Identify your weak areas and then start with them.

Per official exam guide, there are 60 multiple-choice questions on the exam.

  • Apache Spark Architecture Concepts – 17% (10/60)
  • Apache Spark Architecture Applications – 11% (7/60)
  • Apache Spark DataFrame API Applications – 72% (43/60)

This really means that just knowing the DataFrame API thoroughly can get you a passing grade. It’s not a recommended approach, but I can’t stress enough the importance of knowing all the API definitions. I like this part of the exam, they really expect you to know their documentation very well, which is not something developers are generally known for.

Reading entire documentation sounds pretty tough, but it need not be. Apache Spark API documentation is available as a test aid, along with a scratch pad. But there is a catch, you cannot use search option on the test aid documentation. Don’t underestimate the importance of not having the search option, it can be a nerve wracking experience to scroll and wade through the pdf documentation under time pressure. Personally , I had to double check about 25 questions with the documentation. I would’ve lost out on at least 10 questions if not for the documentation.

Preparation Plan

Here is the 24 hour plan that you came looking for

Planned windowWhat to Focus?

2 hours
Spark fundamentals and Spark Architecture.
What is a cluster, node, driver, executor, Cluster Manager, Edge node
Execution Modes: Client, Cluster, Local
Deployment Modes: Standalone, YARN, Mesos, Kubernetes
What is a data frame?, What is spark SQL?
Adaptive Query Execution(AQE)
Dynamic Partition Pruning

2.5 hours
Parallelization, Partitions, Actions, Transformations, Lazy evaluations,
DAGs, Jobs, Stages, Slots/threads/cores
Cache, Persist, Storage levels
Coalesce and repartition

1.5 hours
Apply the Spark DataFrame API for:
Selecting, renaming and manipulating columns

2 hours
Apply the Spark DataFrame API for:
Filtering, dropping, sorting, and aggregating rows
1.5 hoursApply the Spark DataFrame API for:
Joining DataFrames and Spark SQL
2 hoursApache Spark API documentation
Reading, writing and partitioning DataFrames
1.5 hoursApache Spark API documentation
Date formatting and Date conversions

2 hours
Apply the Spark DataFrame API for:
Working with UDFs and Spark SQL functions
2.5 hoursSpark 3.0 Documentation. Go over multiple times and have a clear idea where each API definitions are on the PDF.
4.5 hoursPractice tests ( 3 Full tests is more than enough) Exam readiness: 75 % on each of them.
Look at the tips on taking practice tests, especially the last tip.
https://www.udemy.com/course/databricks-certified-developer-for-apache-spark-30-practice-exams/
2 hoursTopics in the special attention topics

Preparation Tips

  • Architecture related questions can be tricky if you don’t have the complete grasp of all the elements involved. Refer to the first chapter from Spark in Action by Manning for good explanations.
  • Go over the API documentation thoroughly. May be write the outline of headings and individual API calls on a piece of paper as an outline. Do this at least for important topics.
  • Passively reading the material or going over a course cannot be a substitute for actual hands-on.
  • Use a local pyspark set up or google colab to tryout all the dataframe operations as you learn them. I have uploaded few of my preparation notebooks to my github account for reference. Let me know in the comments if you are looking for anything specific.
  • There were several straightforward and simple questions on some of the topics. Pay special attention to the following APIs
    • Add a new column and renaming a column
    • Broadcast variables
    • Coalesce vs repartition
    • Cache, Persist along with their storage levels
    • Distinct vs dropDuplicates
    • union and unionByName
    • date and timestamp conversions and formatting
    • String manipulation operations such as split
    • Explode
    • Sample
    • dropna and fillna
  • Ambiguity on the type of parameter that an API accepts is a common thread that the exam tries to capitalize on. For eg: Will the Spark3 sort API accept both String and Column datatypes ? Should it be chained for sorting multiple columns ?
  • Spark has various ways to read and write files using spark.read and spark.write operations. Having multiple ways to do the same operation is certainly a good question topic

Using Practice tests effectively

  • Simulate the actual exam environment as much as possible. Take the entire 60 questions at a stretch to test your patience and mental preparation. Sometimes solving question and after question can becoming too monotonous that your mind will begin to wander and end up thinking about your next tasty meal.
  • Use the API documentation during practice tests without the search option. This is a great way to prepare and simulate the exam.
  • Majority of the questions expects you to find errors in a dataframe code block or fill in the blanks in the right order to build a dataframe code block given a specific scenario. Practice using the notepad during practice tests. The codeblocks can get a little long and hairy occasionally
  • This is a critical tip and can ease your exam experience considerably. Use the review option effectively. Don’t get into the API documentation until you finish all the sixty questions; when in doubt mark the question for later review and then on the notepad aid; mark the question number along with the API that you are not sure about.
  • Why should you go for this roundabout approach? Once you finish the entire sixty questions, you would have the complete list of APIs that you want to verify. Now even before going into the documentation you can categorize the doubtful ones together. You will eventually find a bunch of API definitions together when you know the entire list of APIs to be verified , with a lot less scrolling and a lot less head scratching if you were to search them as they come .

Thank you for reading

This is not a tough exam by any means. If you follow all the tips, guidelines ,references, this should be a breezy experience. All the best for your exam. If you want to just one takeaway from this whole blog, it should be “Don’t trust your memory on API definitions“.

Please do let me know if this post helped you in anyway, and if anything else can be added to make it more effective.


Subscribe to my newsletter

Leave a comment