{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Pydoop: HDFS to pandas\n", "\n", "The usual way to interact with data stored in the Hadoop Distributed File System (HDFS) is to use Spark.\n", "\n", "Some datasets are small enough that they can be easily handled with pandas. One method is to start a Spark session, read in the data as PySpark DataFrame with [`spark.read.csv()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html), then convert to a pandas DataFrame with [`.toPandas()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html).\n", "\n", "The [Pydoop](https://crs4.github.io/pydoop/)\n", " package allows you to bypass Spark and read in the data directly to a pandas DataFrame. Remember that your data will have to be able to fit into the driver memory, so do not use this for big datasets. Guidance on when to use Spark and when to consider alternatives is in the [When To Use Spark](../spark-overview/when-to-use-spark) article.\n", "\n", "### Pydoop Setup\n", "\n", "Pydoop can be installed in the same way as any other package, e.g. with `pip install pydoop`. If using CDSW you need to use `pip3 install` to ensure that Python 3 is being used.\n", "\n", "Then import `hdfs` from Pydoop, as well as pandas; note that PySpark is not being imported:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pydoop.hdfs as hdfs\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading files\n", "\n", "This example will use a CSV stored in the ONS training area on HDFS. You can read in other file types that are supported by pandas, e.g. [json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) or [Excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html).\n", "\n", "Reading in the data is then a two stage process; first open the file with `hdfs.open()`, then read in as a pandas DataFrame with [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). If a `with` statement is used you do not need to explicitly close the file with `f.close()`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "file_path = \"/training/animal_rescue.csv\"\n", "with hdfs.open(file_path, \"r\") as f:\n", " pandas_df = pd.read_csv(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas_df` is now a pandas DataFrame loaded in the driver memory and all the usual methods will work.\n", "\n", "e.g. we can preview the first five rows and columns of the DataFrame with [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) and [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | IncidentNumber | \n", "DateTimeOfCall | \n", "CalYear | \n", "FinYear | \n", "TypeOfIncident | \n", "
---|---|---|---|---|---|
0 | \n", "139091 | \n", "01/01/2009 03:01 | \n", "2009 | \n", "2008/09 | \n", "Special Service | \n", "
1 | \n", "275091 | \n", "01/01/2009 08:51 | \n", "2009 | \n", "2008/09 | \n", "Special Service | \n", "
2 | \n", "2075091 | \n", "04/01/2009 10:07 | \n", "2009 | \n", "2008/09 | \n", "Special Service | \n", "
3 | \n", "2872091 | \n", "05/01/2009 12:27 | \n", "2009 | \n", "2008/09 | \n", "Special Service | \n", "
4 | \n", "3553091 | \n", "06/01/2009 15:23 | \n", "2009 | \n", "2008/09 | \n", "Special Service | \n", "