Big Data Analysis

Data analysis with Pyspark

Big data
Pyspark
Python
matplotlib
Author

Kunal Khurana

Published

March 16, 2024

Installation

!pip install pyspark
Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.0/317.0 MB 2.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=5a72176bdbb100fe9611d828be35fe034a7c320431f0d36ecee1247bc4eb9a7e
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1

Installing libraries

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc, col, max, struct
import matplotlib.pyplot as plt
spark = SparkSession.builder.appName('spark_app').getOrCreate()
traffic_collision_data = '/content/traffic-collision-data-from-2010-to-present.csv'
# prompt: load traffic_collision_data with spark

df = spark.read.csv(traffic_collision_data, header=True, inferSchema=True)
df.show()
+---------+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
|DR Number|      Date Reported|       Date Occurred|Time Occurred|Area ID|  Area Name|Reporting District|Crime Code|Crime Code Description|            MO Codes|Victim Age|Victim Sex|Victim Descent|Premise Code|Premise Description|             Address|        Cross Street|            Location|      Zip Codes|   Census Tracts|Precinct Boundaries|   LA Specific Plans|Council Districts|Neighborhood Councils (Certified)|
+---------+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
|191323054|2019-11-30 00:00:00|2019-11-30T00:00:...|         0130|     13|     Newton|              1385|       997|     TRAFFIC COLLISION|                NULL|      NULL|         F|             H|         101|             STREET|SAN PEDRO        ...|SLAUSON          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            22352|                              786|
|192020666|2019-11-30 00:00:00|2019-11-30T00:00:...|         0015|     20|    Olympic|              2054|       997|     TRAFFIC COLLISION|                NULL|        40|         M|             W|         101|             STREET|OLYMPIC          ...|KINGSLEY         ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            22723|                              617|
|191616992|2019-11-30 00:00:00|2019-11-30T00:00:...|         0230|     16|   Foothill|              1669|       997|     TRAFFIC COLLISION|                NULL|        18|         M|             W|         101|             STREET|TUJUNGA CANYON   ...|LA TUNA CANYON   ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             3222|                                5|
|191824082|2019-11-30 00:00:00|2019-11-30T00:00:...|         0730|     18|  Southeast|              1802|       997|     TRAFFIC COLLISION|                0605|        23|         M|             H|         101|             STREET|88TH             ...|MAIN             ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            22352|                              800|
|191616980|2019-11-30 00:00:00|2019-11-30T00:00:...|         0720|     16|   Foothill|              1689|       997|     TRAFFIC COLLISION|                NULL|      NULL|         M|             H|         101|             STREET|            CROCKETT|             SUNLAND|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19335|                              166|
|191824078|2019-11-30 00:00:00|2019-11-30T00:00:...|         1050|     18|  Southeast|              1836|       997|     TRAFFIC COLLISION|4025 3037 3004 30...|        54|         F|             B|         101|             STREET|COMPTON          ...|109TH            ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            24354|                              817|
|190417458|2019-11-30 00:00:00|2019-11-30T00:00:...|         0130|     04| Hollenbeck|              0411|       997|     TRAFFIC COLLISION|                0605|        33|         F|             B|         101|             STREET|            BROADWAY|PASADENA         ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23448|                              492|
|191616985|2019-11-30 00:00:00|2019-11-30T00:00:...|         0700|     16|   Foothill|              1687|       997|     TRAFFIC COLLISION|                NULL|        35|         M|             H|         101|             STREET|TUJUNGA          ...|PENROSE          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19335|                              162|
|191718751|2019-11-30 00:00:00|2019-11-30T00:00:...|         1230|     17| Devonshire|              1775|       997|     TRAFFIC COLLISION|                NULL|        51|         M|             O|         101|             STREET|                WISH|           VINCENNES|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18514|                              104|
|191718743|2019-11-30 00:00:00|2019-11-30T00:00:...|         0010|     17| Devonshire|              1737|       997|     TRAFFIC COLLISION|                NULL|        23|         F|             H|         101|             STREET|         HAYVENHURST|SAN FERNANDO MISSION|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19329|                               78|
|191824080|2019-11-30 00:00:00|2019-11-30T00:00:...|         0945|     18|  Southeast|              1842|       997|     TRAFFIC COLLISION|4025 0101 3028 30...|        26|         M|             H|         101|             STREET|111TH            ...|MAIN             ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            24355|                              812|
|190720157|2019-11-29 00:00:00|2019-11-29T00:00:...|         1115|     07|   Wilshire|              0766|       997|     TRAFFIC COLLISION|                NULL|        17|         M|             H|         101|             STREET|VENICE           ...|LA FAYETTE       ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23080|                              651|
|190518783|2019-11-29 00:00:00|2019-11-29T00:00:...|         0650|     05|     Harbor|              0515|       997|     TRAFFIC COLLISION|4025 3036 3004 30...|        44|         F|             H|         101|             STREET|AVALON           ...|R                ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|             3350|                              952|
|192119165|2019-11-29 00:00:00|2019-11-29T00:00:...|         1005|     21|    Topanga|              2143|       997|     TRAFFIC COLLISION|                NULL|        58|         M|             A|         101|             STREET|22800    VICTORY ...|                NULL|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19347|                              300|
|191018309|2019-11-29 00:00:00|2019-11-29T00:00:...|         0710|     10|West Valley|              1035|       997|     TRAFFIC COLLISION|                NULL|        99|         X|             X|         101|             STREET|VICTORY          ...|WHITE OAK        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4286|                             NULL|
|192119188|2019-11-29 00:00:00|2019-11-29T00:00:...|         1800|     21|    Topanga|              2126|       997|     TRAFFIC COLLISION|                NULL|        22|         M|             H|         101|             STREET|DE SOTO          ...|STRATHERN        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4282|                              276|
|191018335|2019-11-29 00:00:00|2019-11-29T00:00:...|         1600|     10|West Valley|              1023|       997|     TRAFFIC COLLISION|                NULL|        29|         M|             H|         101|             STREET|6600    TAMPA    ...|                NULL|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18909|                              269|
|191616964|2019-11-29 00:00:00|2019-11-29T00:00:...|         0645|     16|   Foothill|              1602|       997|     TRAFFIC COLLISION|                NULL|        28|         M|             H|         101|             STREET|FILMORE          ...|DRONFIELD        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18907|                               22|
|190222525|2019-11-29 00:00:00|2019-11-29T00:00:...|         0545|     02|    Rampart|              0237|       997|     TRAFFIC COLLISION|                NULL|        38|         F|             H|         101|             STREET|3RD              ...|WITMER           ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23444|                              550|
|190518807|2019-11-30 00:00:00|2019-11-29T00:00:...|         2220|     05|     Harbor|              0515|       997|     TRAFFIC COLLISION|                NULL|        44|         F|             H|         101|             STREET|               FRIES|PACIFIC COAST    ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|             3350|                              954|
+---------+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
only showing top 20 rows

Preprocessing

# prompt: drop the DR number column

df = df.drop('DR Number')
df.show()
+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
|      Date Reported|       Date Occurred|Time Occurred|Area ID|  Area Name|Reporting District|Crime Code|Crime Code Description|            MO Codes|Victim Age|Victim Sex|Victim Descent|Premise Code|Premise Description|             Address|        Cross Street|            Location|      Zip Codes|   Census Tracts|Precinct Boundaries|   LA Specific Plans|Council Districts|Neighborhood Councils (Certified)|
+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0130|     13|     Newton|              1385|       997|     TRAFFIC COLLISION|                NULL|      NULL|         F|             H|         101|             STREET|SAN PEDRO        ...|SLAUSON          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            22352|                              786|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0015|     20|    Olympic|              2054|       997|     TRAFFIC COLLISION|                NULL|        40|         M|             W|         101|             STREET|OLYMPIC          ...|KINGSLEY         ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            22723|                              617|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0230|     16|   Foothill|              1669|       997|     TRAFFIC COLLISION|                NULL|        18|         M|             W|         101|             STREET|TUJUNGA CANYON   ...|LA TUNA CANYON   ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             3222|                                5|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0730|     18|  Southeast|              1802|       997|     TRAFFIC COLLISION|                0605|        23|         M|             H|         101|             STREET|88TH             ...|MAIN             ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            22352|                              800|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0720|     16|   Foothill|              1689|       997|     TRAFFIC COLLISION|                NULL|      NULL|         M|             H|         101|             STREET|            CROCKETT|             SUNLAND|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19335|                              166|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         1050|     18|  Southeast|              1836|       997|     TRAFFIC COLLISION|4025 3037 3004 30...|        54|         F|             B|         101|             STREET|COMPTON          ...|109TH            ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            24354|                              817|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0130|     04| Hollenbeck|              0411|       997|     TRAFFIC COLLISION|                0605|        33|         F|             B|         101|             STREET|            BROADWAY|PASADENA         ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23448|                              492|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0700|     16|   Foothill|              1687|       997|     TRAFFIC COLLISION|                NULL|        35|         M|             H|         101|             STREET|TUJUNGA          ...|PENROSE          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19335|                              162|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         1230|     17| Devonshire|              1775|       997|     TRAFFIC COLLISION|                NULL|        51|         M|             O|         101|             STREET|                WISH|           VINCENNES|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18514|                              104|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0010|     17| Devonshire|              1737|       997|     TRAFFIC COLLISION|                NULL|        23|         F|             H|         101|             STREET|         HAYVENHURST|SAN FERNANDO MISSION|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19329|                               78|
|2019-11-30 00:00:00|2019-11-30T00:00:...|         0945|     18|  Southeast|              1842|       997|     TRAFFIC COLLISION|4025 0101 3028 30...|        26|         M|             H|         101|             STREET|111TH            ...|MAIN             ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|            24355|                              812|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         1115|     07|   Wilshire|              0766|       997|     TRAFFIC COLLISION|                NULL|        17|         M|             H|         101|             STREET|VENICE           ...|LA FAYETTE       ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23080|                              651|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         0650|     05|     Harbor|              0515|       997|     TRAFFIC COLLISION|4025 3036 3004 30...|        44|         F|             H|         101|             STREET|AVALON           ...|R                ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|             3350|                              952|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         1005|     21|    Topanga|              2143|       997|     TRAFFIC COLLISION|                NULL|        58|         M|             A|         101|             STREET|22800    VICTORY ...|                NULL|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19347|                              300|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         0710|     10|West Valley|              1035|       997|     TRAFFIC COLLISION|                NULL|        99|         X|             X|         101|             STREET|VICTORY          ...|WHITE OAK        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4286|                             NULL|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         1800|     21|    Topanga|              2126|       997|     TRAFFIC COLLISION|                NULL|        22|         M|             H|         101|             STREET|DE SOTO          ...|STRATHERN        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4282|                              276|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         1600|     10|West Valley|              1023|       997|     TRAFFIC COLLISION|                NULL|        29|         M|             H|         101|             STREET|6600    TAMPA    ...|                NULL|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18909|                              269|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         0645|     16|   Foothill|              1602|       997|     TRAFFIC COLLISION|                NULL|        28|         M|             H|         101|             STREET|FILMORE          ...|DRONFIELD        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18907|                               22|
|2019-11-29 00:00:00|2019-11-29T00:00:...|         0545|     02|    Rampart|              0237|       997|     TRAFFIC COLLISION|                NULL|        38|         F|             H|         101|             STREET|3RD              ...|WITMER           ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23444|                              550|
|2019-11-30 00:00:00|2019-11-29T00:00:...|         2220|     05|     Harbor|              0515|       997|     TRAFFIC COLLISION|                NULL|        44|         F|             H|         101|             STREET|               FRIES|PACIFIC COAST    ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '33....|             3350|                              954|
+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
only showing top 20 rows
df.summary
pyspark.sql.dataframe.DataFrame.summary
def summary(*statistics: str) -> 'DataFrame'
Computes specified statistics for numeric and string columns. Available statistics are:
- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (e.g., 75%)

If no statistics are given, this function computes count, mean, stddev, min,
approximate quartiles (percentiles at 25%, 50%, and 75%), and max.

.. versionadded:: 2.3.0

.. versionchanged:: 3.4.0
    Supports Spark Connect.

Parameters
----------
statistics : str, optional
     Column names to calculate statistics by (default All columns).

Returns
-------
:class:`DataFrame`
    A new DataFrame that provides statistics for the given DataFrame.

Notes
-----
This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting
:class:`DataFrame`.

Examples
--------
>>> df = spark.createDataFrame(
...     [("Bob", 13, 40.3, 150.5), ("Alice", 12, 37.8, 142.3), ("Tom", 11, 44.1, 142.2)],
...     ["name", "age", "weight", "height"],
... )
>>> df.select("age", "weight", "height").summary().show()
+-------+----+------------------+-----------------+
|summary| age|            weight|           height|
+-------+----+------------------+-----------------+
|  count|   3|                 3|                3|
|   mean|12.0| 40.73333333333333|            145.0|
| stddev| 1.0|3.1722757341273704|4.763402145525822|
|    min|  11|              37.8|            142.2|
|    25%|  11|              37.8|            142.2|
|    50%|  12|              40.3|            142.3|
|    75%|  13|              44.1|            150.5|
|    max|  13|              44.1|            150.5|
+-------+----+------------------+-----------------+

>>> df.select("age", "weight", "height").summary("count", "min", "25%", "75%", "max").show()
+-------+---+------+------+
|summary|age|weight|height|
+-------+---+------+------+
|  count|  3|     3|     3|
|    min| 11|  37.8| 142.2|
|    25%| 11|  37.8| 142.2|
|    75%| 13|  44.1| 150.5|
|    max| 13|  44.1| 150.5|
+-------+---+------+------+

See Also
--------
DataFrame.display
# prompt: print Schema

df.printSchema()
root
 |-- Date Reported: timestamp (nullable = true)
 |-- Date Occurred: string (nullable = true)
 |-- Time Occurred: string (nullable = true)
 |-- Area ID: string (nullable = true)
 |-- Area Name: string (nullable = true)
 |-- Reporting District: string (nullable = true)
 |-- Crime Code: integer (nullable = true)
 |-- Crime Code Description: string (nullable = true)
 |-- MO Codes: string (nullable = true)
 |-- Victim Age: string (nullable = true)
 |-- Victim Sex: string (nullable = true)
 |-- Victim Descent: string (nullable = true)
 |-- Premise Code: string (nullable = true)
 |-- Premise Description: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Cross Street: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Zip Codes: string (nullable = true)
 |-- Census Tracts: string (nullable = true)
 |-- Precinct Boundaries: string (nullable = true)
 |-- LA Specific Plans: string (nullable = true)
 |-- Council Districts: string (nullable = true)
 |-- Neighborhood Councils (Certified): string (nullable = true)
# prompt: see the shape of a dataframe

print(f"Number of rows: {df.count()}")
print(f"Number of columns: {len(df.columns)}")
Number of rows: 294684
Number of columns: 23
# prompt: select two columns area id and victim age

df.select('Area ID', 'Victim Age').show()
+-------+----------+
|Area ID|Victim Age|
+-------+----------+
|     13|      NULL|
|     20|        40|
|     16|        18|
|     18|        23|
|     16|      NULL|
|     18|        54|
|     04|        33|
|     16|        35|
|     17|        51|
|     17|        23|
|     18|        26|
|     07|        17|
|     05|        44|
|     21|        58|
|     10|        99|
|     21|        22|
|     10|        29|
|     16|        28|
|     02|        38|
|     05|        44|
+-------+----------+
only showing top 20 rows
# prompt: select those records where the victin age  is 17

df.select('Area ID', 'Victim Age').where(df['Victim Age'] == 17).show()
+-------+----------+
|Area ID|Victim Age|
+-------+----------+
|     07|        17|
|     09|        17|
|     09|        17|
|     03|        17|
|     19|        17|
|     20|        17|
|     15|        17|
|     03|        17|
|     17|        17|
|     21|        17|
|     03|        17|
|     17|        17|
|     08|        17|
|     13|        17|
|     21|        17|
|     13|        17|
|     09|        17|
|     07|        17|
|     15|        17|
|     06|        17|
+-------+----------+
only showing top 20 rows

df.select('*').where(df['Victim Age'] == 17).show()
+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
|      Date Reported|       Date Occurred|Time Occurred|Area ID|  Area Name|Reporting District|Crime Code|Crime Code Description|            MO Codes|Victim Age|Victim Sex|Victim Descent|Premise Code|Premise Description|             Address|        Cross Street|            Location|      Zip Codes|   Census Tracts|Precinct Boundaries|   LA Specific Plans|Council Districts|Neighborhood Councils (Certified)|
+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
|2019-11-29 00:00:00|2019-11-29T00:00:...|         1115|     07|   Wilshire|              0766|       997|     TRAFFIC COLLISION|                NULL|        17|         M|             H|         101|             STREET|VENICE           ...|LA FAYETTE       ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23080|                              651|
|2019-11-29 00:00:00|2019-11-28T00:00:...|         2341|     09|   Van Nuys|              0935|       997|     TRAFFIC COLLISION|                0605|        17|         M|             O|         101|             STREET|            VAN NUYS|VICTORY          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19729|                              236|
|2019-11-20 00:00:00|2019-11-20T00:00:...|         1640|     09|   Van Nuys|              0929|       997|     TRAFFIC COLLISION|                NULL|        17|         F|             W|         101|             STREET|VICTORY          ...|FULTON           ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19729|                              184|
|2019-11-18 00:00:00|2019-11-18T00:00:...|         1625|     03|  Southwest|              0333|       997|     TRAFFIC COLLISION|4025 3036 3004 30...|        17|         F|             B|         101|             STREET|CRENSHAW         ...|30TH             ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23079|                              668|
|2019-11-15 00:00:00|2019-11-15T00:00:...|         1820|     19|    Mission|              1956|       997|     TRAFFIC COLLISION|                NULL|        17|         M|             W|         101|             STREET|             WOODMAN|          DEVONSHIRE|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19330|                               72|
|2019-11-14 00:00:00|2019-11-14T00:00:...|         2222|     20|    Olympic|              2064|       997|     TRAFFIC COLLISION|3003 3026 3029 30...|        17|         F|             H|         101|             STREET|HOOVER           ...|8TH              ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            22723|                              618|
|2019-11-14 00:00:00|2019-11-14T00:00:...|         0935|     15|N Hollywood|              1566|       997|     TRAFFIC COLLISION|                NULL|        17|         M|             B|         101|             STREET|VINELAND         ...|RIVERSIDE        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             8489|                              343|
|2019-11-11 00:00:00|2019-11-11T00:00:...|         1500|     03|  Southwest|              0311|       997|     TRAFFIC COLLISION|4025 3037 3004 30...|        17|         M|             B|         101|             STREET|              HAUSER|                GEER|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23077|                              666|
|2019-05-09 00:00:00|2019-05-09T00:00:...|         2030|     17| Devonshire|              1701|       997|     TRAFFIC COLLISION|3004 3030 3028 31...|        17|         M|             W|         101|             STREET|PORTER RANCH     ...|RINALDI          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            18905|                               66|
|2019-05-09 00:00:00|2019-05-09T00:00:...|         2000|     21|    Topanga|              2113|       997|     TRAFFIC COLLISION|3004 3037 3029 30...|        17|         F|             W|         101|             STREET|SATICOY          ...|TOPANGA CANYON   ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4279|                              277|
|2019-05-05 00:00:00|2019-05-05T00:00:...|         1240|     03|  Southwest|              0338|       997|     TRAFFIC COLLISION|4003 3036 3008 30...|        17|         M|             H|         101|             STREET|WALTON           ...|30TH             ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            22724|                              685|
|2019-04-28 00:00:00|2019-04-28T00:00:...|         2040|     17| Devonshire|              1752|       997|     TRAFFIC COLLISION|3036 3101 3003 30...|        17|         M|             H|         101|             STREET|CANOGA           ...|LASSEN           ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4284|                             NULL|
|2019-04-26 00:00:00|2019-04-26T00:00:...|         2030|     08|    West LA|              0801|       997|     TRAFFIC COLLISION|3004 3028 3030 30...|        17|         M|             A|         101|             STREET|CASTELLAMMARE    ...|TRAMONTO         ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            25066|                             2332|
|2019-11-06 00:00:00|2019-11-06T00:00:...|         0700|     13|     Newton|              1344|       997|     TRAFFIC COLLISION|3101 3401 3701 30...|        17|         F|             H|         101|             STREET|41ST             ...|MCKINLEY         ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            22727|                              713|
|2019-11-07 00:00:00|2019-11-06T00:00:...|         1740|     21|    Topanga|              2136|       997|     TRAFFIC COLLISION|3003 3036 3029 30...|        17|         M|             H|         101|             STREET|SATICOY          ...|DEERING          ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             4278|                              276|
|2019-11-04 00:00:00|2019-11-04T00:00:...|         1220|     13|     Newton|              1343|       997|     TRAFFIC COLLISION|3401 3101 3701 30...|        17|         M|             H|         101|             STREET|JEFFERSON        ...|WADSWORTH        ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            22727|                              712|
|2019-11-01 00:00:00|2019-11-01T00:00:...|         1645|     09|   Van Nuys|              0943|       997|     TRAFFIC COLLISION|                NULL|        17|         F|             O|         101|             STREET|              OXNARD|           HAZELTINE|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            19729|                              243|
|2019-10-29 00:00:00|2019-10-29T00:00:...|         0810|     07|   Wilshire|              0749|       997|     TRAFFIC COLLISION|3003 3025 3029 30...|        17|         F|             H|         101|             STREET|CRENSHAW         ...|COUNTRY CLUB     ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23080|                              612|
|2019-10-27 00:00:00|2019-10-27T00:00:...|         0945|     15|N Hollywood|              1581|       997|     TRAFFIC COLLISION|                0605|        17|         F|             W|         101|             STREET|COLDWATER CANYON ...|WOODBRIDGE       ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|             8492|                              346|
|2019-10-27 00:00:00|2019-10-27T00:00:...|         1305|     06|  Hollywood|              0639|       997|     TRAFFIC COLLISION|3004 3025 3035 30...|        17|         M|             W|         101|             STREET|HOLLYWOOD        ...|HOBART           ...|"{'longitude': '-...| ""city"": """"| ""state"": """"|    ""zip"": """"}'| 'latitude': '34....|            23445|                              419|
+-------------------+--------------------+-------------+-------+-----------+------------------+----------+----------------------+--------------------+----------+----------+--------------+------------+-------------------+--------------------+--------------------+--------------------+---------------+----------------+-------------------+--------------------+-----------------+---------------------------------+
only showing top 20 rows
# prompt: find out which area is most prone to crimes

most_crime_prone_area = df.groupBy('Area ID').agg(count('*').alias('total_crimes')).orderBy(desc('total_crimes')).first().asDict()['Area ID']
print(f"Most crime prone area: {most_crime_prone_area}")
Most crime prone area: 12
# prompt: find out top 10 crime codes

top_10_crime_codes = df.groupBy('Crime Code Description').agg(count('*').alias('total_crimes')).orderBy(desc('total_crimes')).limit(10).toPandas()['Crime Code Description'].tolist()

print("Top 10 crime codes:")
for code in top_10_crime_codes:
    print(f"\t- {code}")
Top 10 crime codes:
    - TRAFFIC COLLISION
    - 180
# prompt: find out top 10 victim age

top_10_victim_ages = df.groupBy('Victim Age').agg(count('*').alias('total_victims')).orderBy(desc('total_victims')).limit(10).toPandas()['Victim Age'].tolist()

print("Top 10 victim ages:")
for age in top_10_victim_ages:
    print(f"\t- {age}")
Top 10 victim ages:
    - None
    - 30
    - 25
    - 27
    - 24
    - 28
    - 26
    - 23
    - 35
    - 29

Visualization

# prompt: visualize the result of top 10 victim age in the form of a pie chart

# Get the top 10 victim ages and their counts
top_10_victim_ages = df.groupBy('Victim Age').agg(count('*').alias('total_victims')).orderBy(desc('total_victims')).limit(10).toPandas()

# Create a pie chart of the top 10 victim ages
plt.figure(figsize=(12, 6))
plt.pie(top_10_victim_ages['total_victims'], labels=top_10_victim_ages['Victim Age'], autopct='%1.1f%%')
plt.title('Top 10 Victim Ages')
plt.show()