Introduction#

Warning

This book is under construction.

This Government Analysis Function guidance is produced by the Data Access Platform capability team in the Methods and Quality directorate of the Office for National Statistics.

This book has been written to fill the gap between Spark text books, official documentation and online examples as the go-to guide for analysts, data scientists and data engineers using PySpark and sparklyr.

The material is a living document, so will be updated over time. We are keen to hear your thoughts and feedback on the book, you can get in touch via GitHub Issues. This book is also open to contributions from anyone across UK Government and beyond.

ONS staff should also look at our GitLab pages for organisation and platform-specific information.

How to use the book#

The materials in this book assume the reader is familiar with Python or R and is looking to process big data using PySpark or sparklyr. There are Python and R code cells throughout the book to help explain the topics.

The Spark Overview section introduces some basic concepts like distributed computing, when to use Spark, creating Spark sessions and simple interaction with data storage. There is then a section for Introduction to PySpark and Introduction to sparklyr that covers the basics.

The next part of your Spark journey is to learn more functions so you can perform more interesting analysis. These are covered in the Spark Functions section. As you build more experience you will begin to think more like a Spark developer, choosing certain strategies that suits how Spark processes data.

The most advanced section of the book is Understanding and Optimising Spark. It begins with a short summary page before diving into more complex optimisation topics.

The latter two sections, Testing and Debugging and Ancillary Topics, are included for completeness.

Further resources#

Here are some examples of other free resources we recommend:

Accessibility statement#

This accessibility statement applies to the Spark at the ONS. Please note that this does not include third-party content that is referenced from this guidance.

The website is managed by the Quality and Improvement division of the Office for National Statistics. We would like this guidance to be accessible for as many people as possible. This means that you should be able to:

  • zoom in up to 300% without the text spilling off the screen

  • navigate most of the website using just a keyboard

  • navigate most of the website using speech recognition software

  • listen to most of the website using a screen reader (including the most recent versions of JAWS, NVDA and VoiceOver)

The book may be viewed in your browser or using fullscreen mode by clicking on the fullscreen icon in the top left menu (beiside the Contents navigation).

For keyboard navigation, Up Arrow and Down Arrow keys can be used to scroll up and down on the current page. Left Arrow and Right Arrow keys can be used to move forwards and backwards through the pages of the book. Tabbed content (including code example) can be focused using the Tab key. Left Arrow and Right Arrow keys are then used to focus the required tab option, where Enter can be used to select that option and display the associated content.

To zoom in or out hold Ctrl and use the mouse scroll wheel or hold Ctrl and press + to zoom in and - to zoom out. To enlarge diagrams click on them and apply the same commands for zooming in and out a previously mentioned. Return to the main content via the back button on your browser.

To hide the main menu (left side menu) click the arrow in the top left of the screen. To reopen it, click the 3 bars.

This book may can also be download in ethier Markdown or PDF format. To do this use the down arrow located in the top left menu (beside the Contents navigation). From here select the format you wish to download the book as.

Feedback and reporting accessibility problems#

We are always looking to improve the accessibility of our guidance. If you find any problems not listed on this page or think that we are not meeting accessibility requirements, please contact us using GitHub Issues. Please also get in touch if you are unable to access any part of this guidance, or require the content in a different format.

We will consider your request and will get back to you as soon as we can.

Enforcement procedure#

The Equality and Human Rights Commission (EHRC) is responsible for enforcing the Public Sector Bodies (Websites and Mobile Applications) (No. 2) Accessibility Regulations 2018 (the ‘accessibility regulations’). If you’re not happy with how we respond to your complaint, you should contact the Equality Advisory and Support Service (EASS).