{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Capstone Assignment 2: Approach Deliverable\n",
        "\n",
        "## Analyzing a Neo4j Graph Database via Cypher + GDS\n",
        "\n",
        "**Assignment:** FAERS (FDA Adverse Event Reporting System) Graph Database Analysis\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1. Introduction and Context\n",
        "\n",
        "This assignment involves analyzing a healthcare graph database built from the FDA's Adverse Event Reporting System (FAERS) data. The FAERS database tracks adverse events and medication errors reported for approved drugs and therapeutic biologics, serving as a critical tool for post-market safety surveillance.\n",
        "\n",
        "The graph contains eight node types:\n",
        "- **Drug**: Medications that potentially contributed to adverse events\n",
        "- **Case**: Patient demographic information\n",
        "- **Reaction**: Specific adverse reactions experienced by patients\n",
        "- **ReportSource**: Origin of the adverse event report\n",
        "- **Outcome**: Long-term outcomes (e.g., Death, Hospitalization, Disability)\n",
        "- **Therapy**: Description of prescribed therapy\n",
        "- **Manufacturer**: Pharmaceutical company names\n",
        "- **AgeGroup**: Patient age demographics\n",
        "\n",
        "The analysis will progress through three phases:\n",
        "1. **Exploratory Data Analysis (EDA)** using Cypher queries\n",
        "2. **Deeper Analytical Questions** using advanced Cypher\n",
        "3. **Graph Data Science (GDS)** analysis using Neo4j's GDS library"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Neo4j connection\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "import pandas as pd\n",
        "from neo4j import GraphDatabase\n",
        "\n",
        "# Set env\n",
        "URI = os.environ.get(\"NEO4J_URI\")\n",
        "USER = os.environ.get(\"NEO4J_USER\")\n",
        "PASSWORD = os.environ.get(\"NEO4J_PASSWORD\")\n",
        "\n",
        "driver = GraphDatabase.driver(URI, auth=(USER, PASSWORD))\n",
        "\n",
        "def run_query(query, parameters=None):\n",
        "    \"\"\"Run Cypher and return a pandas DataFrame. Use for MATCH, CALL gds.*, etc.\"\"\"\n",
        "    with driver.session() as session:\n",
        "        result = session.run(query, parameters or {})\n",
        "        return pd.DataFrame([dict(record) for record in result])\n",
        "\n",
        "# Show full content in DataFrames \n",
        "pd.set_option(\"display.max_colwidth\", None)\n",
        "pd.set_option(\"display.max_rows\", 100)\n",
        "pd.set_option(\"display.max_columns\", None)\n",
        "\n",
        "# hide warnings\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore')\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 2. Overall Methodology\n",
        "\n",
        "### 2.1 Iterative Discovery Approach\n",
        "\n",
        "Given that the complete graph schema is not fully documented, I will adopt an **iterative discovery methodology**:\n",
        "\n",
        "1. **Start Broad, Then Narrow**: Begin with high-level queries to understand graph structure, then drill down into specific relationships and properties\n",
        "2. **Validate Assumptions**: Each query will inform the next, allowing me to validate my understanding of node properties and relationship patterns\n",
        "3. **Document as I Go**: Maintain clear documentation of discovered schema elements, property names, and relationship types to inform subsequent analyses\n",
        "\n",
        "### 2.2 Query Development Strategy\n",
        "\n",
        "- **Incremental Complexity**: Start with simple `MATCH` and `RETURN` statements, then progressively add `WHERE` clauses, aggregations, and ordering\n",
        "- **Performance Awareness**: For large result sets, use `LIMIT` clauses during exploration, then remove them for final results\n",
        "- **Result Validation**: Cross-reference query results with domain knowledge about healthcare data to ensure logical consistency"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## EDA: Graph Database Queries\n",
        "\n",
        "The following exploratory queries establish the structure and content of the Healthcare (FAERS) graph."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 1) Total number of nodes\n",
        "\n",
        "**Query logic:** I match every node in the graph with `MATCH (n)` and return the count. This gives the total number of entities (nodes) in the Healthcare graph.\n",
        "\n",
        "**Explanation:** Knowing the total node count establishes the scale of the database and provides a baseline for understanding how many drugs, cases, reactions, outcomes, and other entity types are represented overall.\n",
        "\n",
        "**Interpretation of results:** The graph contains **11,381 nodes**, indicating a moderate-sized FAERS dataset suitable for exploratory and GDS analysis. This scale supports both Cypher-based queries and in-memory graph projections."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>total_nodes</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>11381</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   total_nodes\n",
              "0        11381"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (n)\n",
        "RETURN count(n) AS total_nodes;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 2) All distinct graph labels\n",
        "\n",
        "**Query logic:** The procedure `db.labels()` returns every node label (type) present in the database.\n",
        "\n",
        "**Explanation:** Labels define the kinds of entities in the graph (e.g., Drug, Case, Reaction). Listing them confirms we have the expected eight FAERS entity types and reveals the exact label names to use in later queries.\n",
        "\n",
        "**Interpretation of results:** The graph has **eight distinct node types** (AgeGroup, Case, Drug, Manufacturer, Outcome, Reaction, ReportSource, Therapy), matching the expected FAERS schema. These labels are used throughout the analytical and GDS sections."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>label</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>AgeGroup</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Case</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Drug</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Manufacturer</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Outcome</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>Reaction</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>ReportSource</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>Therapy</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "          label\n",
              "0      AgeGroup\n",
              "1          Case\n",
              "2          Drug\n",
              "3  Manufacturer\n",
              "4       Outcome\n",
              "5      Reaction\n",
              "6  ReportSource\n",
              "7       Therapy"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "CALL db.labels() YIELD label\n",
        "RETURN label\n",
        "ORDER BY label;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 3) Total number of relationships\n",
        "\n",
        "**Query logic:** I match any pattern of the form source node to relationship to target node with `MATCH ()-[r]->()` and return the count of relationships. Each directed edge is counted once.\n",
        "\n",
        "**Explanation:** The relationship count shows how densely the graph is connected. Together with the node count, it helps assess data volume and the ratio of links to entities, which is useful for understanding connectivity and planning analytical queries.\n",
        "\n",
        "**Interpretation of results:** With **61,453 relationships**, the graph has roughly 5.4 edges per node on average, indicating moderate connectivity that supports path-based and GDS analyses."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>total_relationships</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>61453</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   total_relationships\n",
              "0                61453"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH ()-[r]->()\n",
        "RETURN count(r) AS total_relationships;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 4) All distinct relationship types\n",
        "\n",
        "**Query logic:** I traverse all relationships with `MATCH ()-[r]->()`, use `DISTINCT type(r)` to get unique relationship type names, and order them alphabetically.\n",
        "\n",
        "**Explanation:** Relationship types define how nodes are linked (e.g., Case-to-Reaction, Drug-to-Case). Listing them is necessary to write correct patterns in later Cypher queries and to understand how cases, drugs, reactions, outcomes, and other entities are connected in the FAERS model.\n",
        "\n",
        "**Interpretation of results:** The **11 relationship types** include HAS_REACTION (Case→Reaction), IS_PRIMARY_SUSPECT, IS_SECONDARY_SUSPECT, IS_CONCOMITANT, IS_INTERACTING (Case→Drug), RESULTED_IN (Case→Outcome), and REGISTERED (Manufacturer→Case). These are central to the analytical and GDS queries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>relationship_type</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>FALLS_UNDER</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>HAS_REACTION</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>IS_CONCOMITANT</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>IS_INTERACTING</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>IS_PRIMARY_SUSPECT</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>IS_SECONDARY_SUSPECT</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>PRESCRIBED</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>RECEIVED</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>REGISTERED</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>REPORTED_BY</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>RESULTED_IN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "       relationship_type\n",
              "0            FALLS_UNDER\n",
              "1           HAS_REACTION\n",
              "2         IS_CONCOMITANT\n",
              "3         IS_INTERACTING\n",
              "4     IS_PRIMARY_SUSPECT\n",
              "5   IS_SECONDARY_SUSPECT\n",
              "6             PRESCRIBED\n",
              "7               RECEIVED\n",
              "8             REGISTERED\n",
              "9            REPORTED_BY\n",
              "10           RESULTED_IN"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH ()-[r]->()\n",
        "RETURN DISTINCT type(r) AS relationship_type\n",
        "ORDER BY relationship_type;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 5) Unique properties for each node type (label)\n",
        "\n",
        "**Query logic:** The procedure `db.schema.nodeTypeProperties()` returns each node label together with the property names defined on that label. I aggregate by label with `collect(DISTINCT propertyName)` so that all properties for a given node type appear in one row.\n",
        "\n",
        "**Explanation:** Each label can have different properties (e.g., Case might have `gender`, `age`; Reaction might have `description`). Knowing the exact property names per label is required to write correct filters and return values in the analytical and GDS sections that follow.\n",
        "\n",
        "**Interpretation of results:** Case nodes include **primaryid**, **age**, **gender**, **reporterOccupation**, and **reportDate**, needed for demographic and community-based analysis. Reaction uses **description** for adverse-event text; Outcome uses **outcome**; Drug uses **name**. These property names are used in all subsequent queries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 63,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Received notification from DBMS server: <GqlStatusObject gql_status='01N62', status_description='warn: procedure or function execution warning. Execution of the procedure db.schema.nodeTypeProperties() generated the warning The field `propertyTypes` will change output format in the next major version.', position=<SummaryInputPosition line=2, column=1, offset=1>, raw_classification='GENERIC', classification=<NotificationClassification.GENERIC: 'GENERIC'>, raw_severity='WARNING', severity=<NotificationSeverity.WARNING: 'WARNING'>, diagnostic_record={'_classification': 'GENERIC', '_severity': 'WARNING', '_position': {'offset': 1, 'line': 2, 'column': 1}, 'OPERATION': '', 'OPERATION_CODE': '0', 'CURRENT_SCHEMA': '/'}> for query: '\\nCALL db.schema.nodeTypeProperties()\\nYIELD nodeLabels, propertyName\\nRETURN nodeLabels, collect(DISTINCT propertyName) AS properties\\nORDER BY nodeLabels;\\n'\n"
          ]
        },
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>nodeLabels</th>\n",
              "      <th>properties</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>[AgeGroup]</td>\n",
              "      <td>[ageGroup]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>[Case]</td>\n",
              "      <td>[primaryid, age, ageUnit, gender, eventDate, reportDate, reporterOccupation]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>[Drug]</td>\n",
              "      <td>[name, primarySubstabce]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>[Manufacturer]</td>\n",
              "      <td>[manufacturerName]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>[Outcome]</td>\n",
              "      <td>[code, outcome]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>[Reaction]</td>\n",
              "      <td>[description]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>[ReportSource]</td>\n",
              "      <td>[name, code]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>[Therapy]</td>\n",
              "      <td>[primaryid]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "       nodeLabels  \\\n",
              "0      [AgeGroup]   \n",
              "1          [Case]   \n",
              "2          [Drug]   \n",
              "3  [Manufacturer]   \n",
              "4       [Outcome]   \n",
              "5      [Reaction]   \n",
              "6  [ReportSource]   \n",
              "7       [Therapy]   \n",
              "\n",
              "                                                                     properties  \n",
              "0                                                                    [ageGroup]  \n",
              "1  [primaryid, age, ageUnit, gender, eventDate, reportDate, reporterOccupation]  \n",
              "2                                                      [name, primarySubstabce]  \n",
              "3                                                            [manufacturerName]  \n",
              "4                                                               [code, outcome]  \n",
              "5                                                                 [description]  \n",
              "6                                                                  [name, code]  \n",
              "7                                                                   [primaryid]  "
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df=run_query(\"\"\"\n",
        "CALL db.schema.nodeTypeProperties()\n",
        "YIELD nodeLabels, propertyName\n",
        "RETURN nodeLabels, collect(DISTINCT propertyName) AS properties\n",
        "ORDER BY nodeLabels;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 6) Distinct genders in Case node 'gender' property\n",
        "\n",
        "**Query logic:** I match nodes with the `Case` label, filter to those where `gender` is not null, and return distinct values of `c.gender` ordered alphabetically.\n",
        "\n",
        "**Explanation:** Case nodes represent patients; the `gender` property is a demographic used in later analysis (e.g., community detection or risk by subgroup). Listing distinct values shows the coding scheme (e.g., Male/Female, M/F) and reveals any data-quality issues or unexpected values.\n",
        "\n",
        "**Interpretation of results:** The three gender values (**F**, **M**, **U**) use standard FAERS coding. Gender is populated and ready for demographic subgroup analysis and community detection."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>gender</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>F</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>M</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>U</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "  gender\n",
              "0      F\n",
              "1      M\n",
              "2      U"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (c:Case)\n",
        "WHERE c.gender IS NOT NULL\n",
        "RETURN DISTINCT c.gender AS gender\n",
        "ORDER BY gender;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 7) Number of distinct reactions\n",
        "\n",
        "**Query logic:** I match all `Reaction` nodes, restrict to those with a non-null `description` (the property that holds the reaction text in this graph), and return the count of distinct `description` values.\n",
        "\n",
        "**Explanation:** Reactions are the adverse events (e.g., \"Pain\", \"Nausea\") reported for cases. Counting distinct reaction descriptions gives the size of the \"vocabulary\" of adverse events in the data and supports later analyses such as most frequent reactions or patient similarity based on reaction profiles.\n",
        "\n",
        "**Interpretation of results:** The **2,701 distinct reaction descriptions** indicate a rich adverse-event vocabulary, supporting frequency analysis and patient-similarity comparisons in the GDS section."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 33,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>distinct_reactions</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>2701</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   distinct_reactions\n",
              "0                2701"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (r:Reaction)\n",
        "WHERE r.description IS NOT NULL\n",
        "RETURN count(DISTINCT r.description) AS distinct_reactions;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 8) All distinct Outcome description values\n",
        "\n",
        "**Query logic:** I match all `Outcome` nodes, filter to those with a non-null `outcome` property, and return distinct values of `o.outcome` in alphabetical order.\n",
        "\n",
        "**Explanation:** Outcomes describe the long-term result of the adverse event (e.g., \"Death\", \"Hospitalization\", \"Disability\"). Listing distinct outcome values shows the exact strings used in the database, which is needed to correctly filter for the four most severe outcomes (Death, Life-Threatening, Disability, Hospitalization) in the deeper analytical questions.\n",
        "\n",
        "**Interpretation of results:** The six outcome types include the four severe outcomes (**Death**, **Life-Threatening**, **Disability**, **Hospitalization - Initial or Prolonged**) required for the drug-severity analysis. The exact string for hospitalization must be used in the analytical queries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>outcome_description</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Congenital Anomaly</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Death</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Disability</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Hospitalization - Initial or Prolonged</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Life-Threatening</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>Other Serious (Important Medical Event)</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                       outcome_description\n",
              "0                       Congenital Anomaly\n",
              "1                                    Death\n",
              "2                               Disability\n",
              "3   Hospitalization - Initial or Prolonged\n",
              "4                         Life-Threatening\n",
              "5  Other Serious (Important Medical Event)"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (o:Outcome)\n",
        "WHERE o.outcome IS NOT NULL\n",
        "RETURN DISTINCT o.outcome AS outcome_description\n",
        "ORDER BY outcome_description;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## Deeper Analytical Questions\n",
        "\n",
        "The following queries use the graph structure discovered in EDA: node labels; relationship types (`HAS_REACTION`, `RESULTED_IN`, `REGISTERED`, and suspect types from Case to Drug); and property names (`Reaction.description`, `Outcome.outcome`, `Drug.name`, `Manufacturer.manufacturerName`). Note that Manufacturer links to Case via `REGISTERED`, not to Drug."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 1) Top 20 most frequently occurring reactions\n",
        "\n",
        "**Query logic:** I use the EDA-discovered pattern: cases link to reactions with `(c:Case)-[:HAS_REACTION]->(r:Reaction)`, and the reaction text is in `r.description`. I match all such case–reaction pairs, restrict to non-null `description`, and count how many times each reaction appears (one count per pair). I return `r.description` as `reaction` and `count(*)` as `frequency`, order by frequency descending, and take the top 20. Thus \"frequency\" is the number of (Case, Reaction) pairs, so reactions reported in more cases rank higher.\n",
        "\n",
        "**Explanation:** This ranks adverse events by how often they are reported in the database. The result supports post-market surveillance (which reactions to monitor), labeling discussions, and comparison with other data sources. The top reactions reflect both true prevalence and reporting patterns.\n",
        "\n",
        "**Interpretation of results:** **Fatigue** (303), **Product dose omission issue** (285), **Headache** (272), **Nausea** (256), and **Pain** (253) lead the list. These are common, often non-specific adverse events that may reflect both true drug effects and reporting bias. COVID-19 (140) appears in the top 20, consistent with pandemic-era FAERS data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 35,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>reaction</th>\n",
              "      <th>frequency</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Fatigue</td>\n",
              "      <td>303</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Product dose omission issue</td>\n",
              "      <td>285</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Headache</td>\n",
              "      <td>272</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Nausea</td>\n",
              "      <td>256</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Pain</td>\n",
              "      <td>253</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>Dyspnoea</td>\n",
              "      <td>245</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>Pneumonia</td>\n",
              "      <td>229</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>Diarrhoea</td>\n",
              "      <td>219</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>Fall</td>\n",
              "      <td>198</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>Off label use</td>\n",
              "      <td>196</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>Pyrexia</td>\n",
              "      <td>184</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>Dizziness</td>\n",
              "      <td>178</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>Sinusitis</td>\n",
              "      <td>166</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>Urinary tract infection</td>\n",
              "      <td>158</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>Asthenia</td>\n",
              "      <td>150</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>Vomiting</td>\n",
              "      <td>146</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>16</th>\n",
              "      <td>Drug ineffective</td>\n",
              "      <td>144</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>Malaise</td>\n",
              "      <td>143</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18</th>\n",
              "      <td>COVID-19</td>\n",
              "      <td>140</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>Weight decreased</td>\n",
              "      <td>129</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                       reaction  frequency\n",
              "0                       Fatigue        303\n",
              "1   Product dose omission issue        285\n",
              "2                      Headache        272\n",
              "3                        Nausea        256\n",
              "4                          Pain        253\n",
              "5                      Dyspnoea        245\n",
              "6                     Pneumonia        229\n",
              "7                     Diarrhoea        219\n",
              "8                          Fall        198\n",
              "9                 Off label use        196\n",
              "10                      Pyrexia        184\n",
              "11                    Dizziness        178\n",
              "12                    Sinusitis        166\n",
              "13      Urinary tract infection        158\n",
              "14                     Asthenia        150\n",
              "15                     Vomiting        146\n",
              "16             Drug ineffective        144\n",
              "17                      Malaise        143\n",
              "18                     COVID-19        140\n",
              "19             Weight decreased        129"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (c:Case)-[:HAS_REACTION]->(r:Reaction)\n",
        "WHERE r.description IS NOT NULL\n",
        "RETURN r.description AS reaction, count(*) AS frequency\n",
        "ORDER BY frequency DESC\n",
        "LIMIT 20;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 2) Top 10 drugs associated with severe outcomes\n",
        "\n",
        "**Query logic:** The graph models suspect roles as **Case pointing to Drug** via `IS_PRIMARY_SUSPECT` or `IS_SECONDARY_SUSPECT` relationships (e.g. `(c:Case)-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT]->(d:Drug)`), and case outcome as `(c:Case)-[:RESULTED_IN]->(o:Outcome)` with the outcome label in `o.outcome` (EDA 8). I match drugs that are linked to cases as either primary or secondary suspects, where those cases have one of the four severe outcomes: `Death`, `Life-Threatening`, `Disability`, and `Hospitalization - Initial or Prolonged` (exact strings as in the database). I count **distinct cases** per drug so each patient is counted once even if multiple suspect relationships exist, then order by that count descending and limit to 10. Drug name is returned from `d.name`.\n",
        "\n",
        "**Explanation:** This ranks drugs by how often they appear in cases with the most serious outcomes. It supports signal detection for regulatory review, labeling, and risk management. The count reflects reported associations, not proven causation; follow-up analysis would account for confounders and exposure.\n",
        "\n",
        "**Interpretation of results:** **REVLIMID** (218 severe cases) leads by a wide margin, consistent with its use in high-risk oncology (multiple myeloma treatment). **NIVOLUMAB** (82 cases) and **ATEZOLIZUMAB** (77 cases) are checkpoint inhibitor immunotherapies used in oncology, which are known to have serious immune-related adverse events. **HUMAN NORMAL IMMUNOGLOBULIN; LIQUID** (66 cases) and **CUVITRU** (61 cases) are immunoglobulin replacement therapies, likely used in immunocompromised patients who are at higher risk for severe outcomes. **POMALYST** (65 cases) is another oncology drug (multiple myeloma), similar to REVLIMID. **DEXAMETHASONE** (65 cases) is a corticosteroid used in serious conditions including oncology and inflammatory diseases. **CYCLOPHOSPHAMIDE** (64 cases) is a chemotherapy agent. **REMODULIN** (57 cases) is used for pulmonary arterial hypertension, a serious cardiovascular condition. **Teduglutide** (53 cases) is used for short bowel syndrome. The top 10 is dominated by oncology and immunology drugs, reflecting the high-risk patient populations these medications treat. These associations warrant further investigation, not causal inference."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 49,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>drug_name</th>\n",
              "      <th>severe_cases</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>REVLIMID</td>\n",
              "      <td>218</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>NIVOLUMAB</td>\n",
              "      <td>82</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>ATEZOLIZUMAB</td>\n",
              "      <td>77</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>HUMAN NORMAL IMMUNOGLOBULIN; LIQUID</td>\n",
              "      <td>66</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>POMALYST</td>\n",
              "      <td>65</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>DEXAMETHASONE</td>\n",
              "      <td>65</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>CYCLOPHOSPHAMIDE</td>\n",
              "      <td>64</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>CUVITRU</td>\n",
              "      <td>61</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>REMODULIN</td>\n",
              "      <td>57</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>Teduglutide</td>\n",
              "      <td>53</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                             drug_name  severe_cases\n",
              "0                             REVLIMID           218\n",
              "1                            NIVOLUMAB            82\n",
              "2                         ATEZOLIZUMAB            77\n",
              "3  HUMAN NORMAL IMMUNOGLOBULIN; LIQUID            66\n",
              "4                             POMALYST            65\n",
              "5                        DEXAMETHASONE            65\n",
              "6                     CYCLOPHOSPHAMIDE            64\n",
              "7                              CUVITRU            61\n",
              "8                            REMODULIN            57\n",
              "9                          Teduglutide            53"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (d:Drug)<-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT]-(c:Case)\n",
        "MATCH (c)-[:RESULTED_IN]->(o:Outcome)\n",
        "WHERE o.outcome IN ['Death', 'Life-Threatening', 'Disability', 'Hospitalization - Initial or Prolonged']\n",
        "RETURN d.name AS drug_name, count(DISTINCT c) AS severe_cases\n",
        "ORDER BY severe_cases DESC\n",
        "LIMIT 10;\n",
        "\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 3) Top 10 manufacturers by number of drugs with reported side effects\n",
        "\n",
        "**Query logic:** Discovery showed that **Manufacturer** connects to **Case** via `(m:Manufacturer)-[:REGISTERED]->(c:Case)` (not to Drug); the manufacturer identifier is `m.manufacturerName`. I restrict to cases that have at least one reported reaction (`(c)-[:HAS_REACTION]->(r:Reaction)`), then for each such case we collect the drugs linked as any suspect type (`(d:Drug)<-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT|IS_CONCOMITANT|IS_INTERACTING]-(c)`). Per manufacturer we return `m.manufacturerName` and the count of **distinct drugs** across all their registered cases that have a reaction. Order by that count descending and limit to 10.\n",
        "\n",
        "**Explanation:** This ranks manufacturers by the diversity of drugs (distinct product names) that appear in their registered cases with at least one adverse reaction. It supports comparative oversight—which firms have the broadest set of products with reported side effects in this dataset—and can inform prioritization of inspections or requests for additional safety data. The metric reflects report volume and product mix, not necessarily severity.\n",
        "\n",
        "**Interpretation of results:** **PFIZER** (617 drugs) and **ROCHE** (595) lead, reflecting large portfolios and high report volume. **CELGENE** (452) and **NOVARTIS** (386) follow. These manufacturers have the widest product diversity in cases with reported reactions, which may warrant prioritization for safety surveillance."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>manufacturer</th>\n",
              "      <th>drugs_with_side_effects</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>PFIZER</td>\n",
              "      <td>617</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>ROCHE</td>\n",
              "      <td>595</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>CELGENE</td>\n",
              "      <td>452</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>NOVARTIS</td>\n",
              "      <td>386</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>TAKEDA</td>\n",
              "      <td>356</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>ABBVIE</td>\n",
              "      <td>352</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>BRISTOL MYERS SQUIBB</td>\n",
              "      <td>307</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>JOHNSON AND JOHNSON</td>\n",
              "      <td>218</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>GLAXOSMITHKLINE</td>\n",
              "      <td>200</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>AMGEN</td>\n",
              "      <td>191</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "           manufacturer  drugs_with_side_effects\n",
              "0                PFIZER                      617\n",
              "1                 ROCHE                      595\n",
              "2               CELGENE                      452\n",
              "3              NOVARTIS                      386\n",
              "4                TAKEDA                      356\n",
              "5                ABBVIE                      352\n",
              "6  BRISTOL MYERS SQUIBB                      307\n",
              "7   JOHNSON AND JOHNSON                      218\n",
              "8       GLAXOSMITHKLINE                      200\n",
              "9                 AMGEN                      191"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (m:Manufacturer)-[:REGISTERED]->(c:Case)-[:HAS_REACTION]->(r:Reaction)\n",
        "MATCH (d:Drug)<-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT|IS_CONCOMITANT|IS_INTERACTING]-(c)\n",
        "RETURN m.manufacturerName AS manufacturer, count(DISTINCT d) AS drugs_with_side_effects\n",
        "ORDER BY drugs_with_side_effects DESC\n",
        "LIMIT 10;\n",
        "\"\"\")\n",
        "display(df)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### 4) Top 5 drugs and their distinct side effects for the leading manufacturer (PFIZER)\n",
        "\n",
        "**Query logic:** I use the top manufacturer from Question 3 (**PFIZER**, `manufacturerName = 'PFIZER'`). The path is the same as in Q3: `(m:Manufacturer)-[:REGISTERED]->(c:Case)-[:HAS_REACTION]->(r:Reaction)` and `(d:Drug)<-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT|IS_CONCOMITANT|IS_INTERACTING]-(c)`. I filter to non-null `r.description`, then for each drug return `d.name`, `collect(DISTINCT r.description)` as `side_effects`, and `count(DISTINCT c)` as `case_count`. I order by case count descending and limit to 5, so we get the five drugs (in PFIZER’s registered cases with reactions) that appear in the most cases, each with its full list of distinct reported side effects.\n",
        "\n",
        "**Explanation:** This drills down from Q3’s leading manufacturer to show which of its drugs have the most case volume and what adverse reactions were reported for each. It supports manufacturer-level review and product-specific safety profiles. The `side_effects` list can be long; it is the full set of distinct reaction descriptions for that drug in this subset of cases.\n",
        "\n",
        "**Interpretation of results:**\n",
        "\n",
        "- **LYRICA** (120 cases) and **GENOTROPIN** (77) have the highest case counts for PFIZER. \n",
        "- LYRICA's side-effect profile includes pain, dizziness, and CNS effects; GENOTROPIN's includes device-related issues and metabolic effects. \n",
        "- **IBRANCE**, **XELJANZ XR**, and **XELJANZ** show oncology/immunology reaction patterns (neutropenia, infections, thrombosis). \n",
        "\n",
        "The long side-effect lists reflect diverse reported adverse events per drug."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 59,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>drug_name</th>\n",
              "      <th>side_effects</th>\n",
              "      <th>case_count</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>LYRICA</td>\n",
              "      <td>[Ageusia, Pain, Hypoacusis, Temporomandibular joint syndrome, Depressed level of consciousness, Intentional product use issue, Gingival disorder, Malaise, Anosmia, Viral infection, Deafness, Dry mouth, Confusional state, Memory impairment, Toothache, Panic attack, Anxiety, Hallucination, Dyspnoea, Withdrawal syndrome, Drug ineffective, Feeling abnormal, Feeling of despair, Illness, Condition aggravated, Chest pain, Irritability, Muscle spasms, Headache, Blood pressure increased, Abdominal pain upper, Intentional product misuse, Upper limb fracture, Fall, Onychoclasis, Insomnia, Pollakiuria, Product dispensing error, Nail growth abnormal, Paralysis, Impaired driving ability, Tinnitus, Dysphagia, Sedation, Back pain, Intervertebral disc protrusion, Visual impairment, Pain in extremity, Joint swelling, Arthralgia, Depression, Hypertension, Nasal congestion, Nasopharyngitis, Sinusitis, Product dose omission issue, Nervousness, Reading disorder, Cataract, Eye disorder, Fungal infection, Hot flush, Granuloma annulare, Somnolence, Pain in jaw, Arthritis, Angina pectoris, General physical health deterioration, Chills, Discomfort, Weight decreased, Musculoskeletal stiffness, Body temperature abnormal, Asthenia, Hypersomnia, Influenza like illness, Gait disturbance, Feeling cold, Eating disorder, Heart rate increased, Pyrexia, Nausea, Bone disorder, Movement disorder, Joint stiffness, Thinking abnormal, Migraine, Off label use, Feeling of body temperature change, Dizziness, Sepsis, Osteomyelitis, Prescribed overdose, Gait inability, Death, Sitting disability, Neuralgia, Tremor, Spinal cord compression, Panic reaction, ...]</td>\n",
              "      <td>120</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>GENOTROPIN</td>\n",
              "      <td>[Device breakage, Drug dose omission by device, Product prescribing error, Device use error, Device issue, Device physical property issue, Product physical issue, Wrong technique in device usage process, Device mechanical issue, Device use issue, Device leakage, Device power source issue, Blindness, Device information output issue, Incorrect dose administered by device, Poor quality device used, Device malfunction, Incorrect dose administered, Product dispensing error, Product dose omission issue, Injection site pain, Expired device used, Product quality issue, Product leakage, Needle issue, Increased appetite, Injection site haemorrhage, Insulin-like growth factor increased, Product storage error, Product colour issue, Poor quality product administered, Device delivery system issue, Off label use, Emotional disorder, Crying, Bone density increased, Device maintenance issue, Intentional device misuse, Bone density decreased, Mood altered, Product dose omission in error, Asthenia, Loss of personal independence in daily activities, Hip fracture, Fatigue, Device difficult to use, Device defective, Suicide attempt, Injection site mass, Weight increased, Headache, Eye disorder, Depression, Product odour abnormal, Disturbance in attention, Withdrawal syndrome, Judgement impaired, Hypoglycaemia, Product prescribing issue, Circumstance or information capable of leading to medication error, Drug level below therapeutic, Tinnitus, Feeling abnormal, Dizziness, Mood swings, Hypotension, Impaired driving ability, Malaise, Confusional state, Dyspnoea, Anxiety, Adrenal insufficiency, Memory impairment, Hepatic steatosis, Accidental overdose, Vitamin D deficiency, Throat tightness, Dysphonia, Amnesia, Chest discomfort, Exposure via skin contact, Senile osteoporosis, Oedema, Peripheral swelling, Fear of injection, Psychological trauma, Visual impairment, Condition aggravated, Nausea, Vomiting, Blood thyroid stimulating hormone increased, Globulins decreased, Protein total decreased, Aspartate aminotransferase increased, Cardiac failure, Pain in extremity, Gait disturbance, Nipple disorder, Device failure]</td>\n",
              "      <td>77</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>IBRANCE</td>\n",
              "      <td>[Product dose omission issue, Product prescribing error, Neoplasm progression, Neutropenia, Anaemia, Off label use, Thrombocytopenia, White blood cell count decreased, Foot fracture, Product complaint, Poor quality product administered, Diarrhoea, Blood count abnormal, Neutrophil count decreased, Dyspnoea exertional, Pulmonary embolism, Illness, Malaise, Amnesia, Cerebrovascular accident, Fatigue, Headache, Dizziness, Bone pain, Myalgia, Alopecia, Tooth fracture, Depression, Mobility decreased, Gait disturbance, Pulmonary mucormycosis, Pneumonia, Weight decreased, Swelling, Lymphoedema, Blood glucose increased, Febrile neutropenia, COVID-19, Neuropathy peripheral, Vision blurred, Peripheral swelling, Blood chloride decreased, Blood sodium decreased, Thrombosis, Amino acid level decreased, Renal disorder, Blood albumin decreased, Arthralgia, Nasopharyngitis, Vertigo, Oropharyngeal pain, Dysstasia, Feeling abnormal, Sleep disorder, Asthenia, Dyspnoea, Hordeolum, Energy increased, Increased appetite, Gastroenteritis viral, Platelet count decreased, Muscle strain, Constipation, Flatulence, Drug hypersensitivity, Cough, Dysphagia, Foreign body in respiratory tract, Myocardial infarction, Tinnitus, Joint swelling, Weight increased, Nail disorder, Nail growth abnormal, Paraesthesia, Product formulation issue, Choking, Foreign body in throat, Body height decreased, Stress, Sinusitis, Rheumatoid arthritis, Blood pressure increased, Hernia, Arthritis, Lymphadenopathy, Hot flush, Stomatitis, Immune system disorder, Ear discomfort, Vomiting, Back pain, Chills, Heart rate increased, Pyrexia, Palpitations, Nausea, Upper respiratory tract infection, Joint injury, Product use issue, ...]</td>\n",
              "      <td>48</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>XELJANZ XR</td>\n",
              "      <td>[Arthralgia, Feeling abnormal, Rheumatoid arthritis, Pain, Intentional product use issue, Peripheral swelling, COVID-19, Gout, Rotator cuff syndrome, Aphonia, Fall, Contusion, Muscle rupture, Hair growth abnormal, Musculoskeletal stiffness, Impaired healing, Confusional state, Angle closure glaucoma, Therapeutic response unexpected, Illness, Condition aggravated, Scar, Lumbar vertebral fracture, Perioral dermatitis, Infection, Rash, Haematochezia, Influenza, Fatigue, Diarrhoea, Pyrexia, Irritable bowel syndrome, Herpes zoster, Gastrointestinal haemorrhage, Oropharyngeal pain, Blister, Gastric disorder, Cerebrovascular accident, Crepitations, Femur fracture, Memory impairment, Nasal discomfort, Ocular discomfort, Hypoaesthesia, Transient ischaemic attack, Pneumonia, Back pain, Cough, Delirium, Asthenia, Chest pain, Loss of personal independence in daily activities, Deafness, Weight increased, Product prescribing error, Foot fracture, Renal disorder, Wound infection, Ear pain, Ear swelling, Dizziness, Drug ineffective, Post procedural discharge, Rhinovirus infection, Vomiting, Gastroenteritis viral, Intentional dose omission, Product dose omission issue, Pain in extremity, Arthritis, Movement disorder, Skin cancer, Joint swelling, Blood cholesterol increased, Blindness unilateral, Thrombosis, Sensitivity to weather change, Joint range of motion decreased, Gait disturbance, Headache, Feeling hot, Therapeutic product effect incomplete, Hypertension, Pain of skin, Dyspnoea, Cellulitis, Off label use, Urinary tract infection, Foot deformity, Insomnia, Anxiety, Malaise, Nasal congestion, Glycosylated haemoglobin increased, Disability, Epistaxis, Blood glucose increased, Red blood cell count decreased, Intentional product misuse, Decreased immune responsiveness, ...]</td>\n",
              "      <td>47</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>XELJANZ</td>\n",
              "      <td>[Rheumatoid arthritis, Lymphocyte count decreased, White blood cell count decreased, Pain, Off label use, Bronchitis, Hypokinesia, Autoimmune disorder, Therapeutic response unexpected, Malaise, Spinal fracture, Illness, Pain in extremity, Product dose omission issue, Arthralgia, Joint swelling, Pruritus, Nasopharyngitis, Influenza, Localised infection, Cough, Dyspnoea, Urinary tract infection, Rhinorrhoea, Weight increased, Drug hypersensitivity, Sinus disorder, Prescribed overdose, Insomnia, Sensitivity to weather change, Intentional product misuse, Spinal disorder, Thrombosis, Visual impairment, Loss of personal independence in daily activities, Asthenia, Deafness, Death, Drug ineffective, Condition aggravated, Infection, Skin cancer, Deep vein thrombosis, Pulmonary embolism, Tremor, Confusional state, Herpes zoster, Respiratory tract infection, Neoplasm malignant, Dizziness, Feeling abnormal, Peripheral swelling, Joint injury, Muscle strain, Immune system disorder, Diplopia, Barrett's oesophagus, Sunburn, Eye disorder, Fall, Body height decreased, Bronchitis chronic, Epistaxis, Laboratory test abnormal, Limb injury, Vertigo, Chills, Balance disorder, Weight decreased, Tendonitis, Photosensitivity reaction, Conjunctivitis, Sinusitis, Vomiting, Cystitis, Diverticulitis, Gait disturbance, Neuropathy peripheral, Viral infection, Pneumonia, Muscle spasms, Arthropathy, Rash, Diarrhoea, Blood cholesterol increased, Intentional product use issue, Diabetes mellitus, Memory impairment, Cerebrovascular accident, Mental impairment, Aphasia, Cellulitis, Antipsychotic drug level increased, Mental disorder, Hordeolum, Product use issue, Staphylococcal infection, Swelling, Vitamin D increased, Loss of consciousness, ...]</td>\n",
              "      <td>32</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    drug_name  \\\n",
              "0      LYRICA   \n",
              "1  GENOTROPIN   \n",
              "2     IBRANCE   \n",
              "3  XELJANZ XR   \n",
              "4     XELJANZ   \n",
              "\n",
              "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    side_effects  \\\n",
              "0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       [Ageusia, Pain, Hypoacusis, Temporomandibular joint syndrome, Depressed level of consciousness, Intentional product use issue, Gingival disorder, Malaise, Anosmia, Viral infection, Deafness, Dry mouth, Confusional state, Memory impairment, Toothache, Panic attack, Anxiety, Hallucination, Dyspnoea, Withdrawal syndrome, Drug ineffective, Feeling abnormal, Feeling of despair, Illness, Condition aggravated, Chest pain, Irritability, Muscle spasms, Headache, Blood pressure increased, Abdominal pain upper, Intentional product misuse, Upper limb fracture, Fall, Onychoclasis, Insomnia, Pollakiuria, Product dispensing error, Nail growth abnormal, Paralysis, Impaired driving ability, Tinnitus, Dysphagia, Sedation, Back pain, Intervertebral disc protrusion, Visual impairment, Pain in extremity, Joint swelling, Arthralgia, Depression, Hypertension, Nasal congestion, Nasopharyngitis, Sinusitis, Product dose omission issue, Nervousness, Reading disorder, Cataract, Eye disorder, Fungal infection, Hot flush, Granuloma annulare, Somnolence, Pain in jaw, Arthritis, Angina pectoris, General physical health deterioration, Chills, Discomfort, Weight decreased, Musculoskeletal stiffness, Body temperature abnormal, Asthenia, Hypersomnia, Influenza like illness, Gait disturbance, Feeling cold, Eating disorder, Heart rate increased, Pyrexia, Nausea, Bone disorder, Movement disorder, Joint stiffness, Thinking abnormal, Migraine, Off label use, Feeling of body temperature change, Dizziness, Sepsis, Osteomyelitis, Prescribed overdose, Gait inability, Death, Sitting disability, Neuralgia, Tremor, Spinal cord compression, Panic reaction, ...]   \n",
              "1  [Device breakage, Drug dose omission by device, Product prescribing error, Device use error, Device issue, Device physical property issue, Product physical issue, Wrong technique in device usage process, Device mechanical issue, Device use issue, Device leakage, Device power source issue, Blindness, Device information output issue, Incorrect dose administered by device, Poor quality device used, Device malfunction, Incorrect dose administered, Product dispensing error, Product dose omission issue, Injection site pain, Expired device used, Product quality issue, Product leakage, Needle issue, Increased appetite, Injection site haemorrhage, Insulin-like growth factor increased, Product storage error, Product colour issue, Poor quality product administered, Device delivery system issue, Off label use, Emotional disorder, Crying, Bone density increased, Device maintenance issue, Intentional device misuse, Bone density decreased, Mood altered, Product dose omission in error, Asthenia, Loss of personal independence in daily activities, Hip fracture, Fatigue, Device difficult to use, Device defective, Suicide attempt, Injection site mass, Weight increased, Headache, Eye disorder, Depression, Product odour abnormal, Disturbance in attention, Withdrawal syndrome, Judgement impaired, Hypoglycaemia, Product prescribing issue, Circumstance or information capable of leading to medication error, Drug level below therapeutic, Tinnitus, Feeling abnormal, Dizziness, Mood swings, Hypotension, Impaired driving ability, Malaise, Confusional state, Dyspnoea, Anxiety, Adrenal insufficiency, Memory impairment, Hepatic steatosis, Accidental overdose, Vitamin D deficiency, Throat tightness, Dysphonia, Amnesia, Chest discomfort, Exposure via skin contact, Senile osteoporosis, Oedema, Peripheral swelling, Fear of injection, Psychological trauma, Visual impairment, Condition aggravated, Nausea, Vomiting, Blood thyroid stimulating hormone increased, Globulins decreased, Protein total decreased, Aspartate aminotransferase increased, Cardiac failure, Pain in extremity, Gait disturbance, Nipple disorder, Device failure]   \n",
              "2                                                                                                                                                                                                                                                                                                                                                                                                                                           [Product dose omission issue, Product prescribing error, Neoplasm progression, Neutropenia, Anaemia, Off label use, Thrombocytopenia, White blood cell count decreased, Foot fracture, Product complaint, Poor quality product administered, Diarrhoea, Blood count abnormal, Neutrophil count decreased, Dyspnoea exertional, Pulmonary embolism, Illness, Malaise, Amnesia, Cerebrovascular accident, Fatigue, Headache, Dizziness, Bone pain, Myalgia, Alopecia, Tooth fracture, Depression, Mobility decreased, Gait disturbance, Pulmonary mucormycosis, Pneumonia, Weight decreased, Swelling, Lymphoedema, Blood glucose increased, Febrile neutropenia, COVID-19, Neuropathy peripheral, Vision blurred, Peripheral swelling, Blood chloride decreased, Blood sodium decreased, Thrombosis, Amino acid level decreased, Renal disorder, Blood albumin decreased, Arthralgia, Nasopharyngitis, Vertigo, Oropharyngeal pain, Dysstasia, Feeling abnormal, Sleep disorder, Asthenia, Dyspnoea, Hordeolum, Energy increased, Increased appetite, Gastroenteritis viral, Platelet count decreased, Muscle strain, Constipation, Flatulence, Drug hypersensitivity, Cough, Dysphagia, Foreign body in respiratory tract, Myocardial infarction, Tinnitus, Joint swelling, Weight increased, Nail disorder, Nail growth abnormal, Paraesthesia, Product formulation issue, Choking, Foreign body in throat, Body height decreased, Stress, Sinusitis, Rheumatoid arthritis, Blood pressure increased, Hernia, Arthritis, Lymphadenopathy, Hot flush, Stomatitis, Immune system disorder, Ear discomfort, Vomiting, Back pain, Chills, Heart rate increased, Pyrexia, Palpitations, Nausea, Upper respiratory tract infection, Joint injury, Product use issue, ...]   \n",
              "3                                                                                                                                                                                                                                                                                                                                                  [Arthralgia, Feeling abnormal, Rheumatoid arthritis, Pain, Intentional product use issue, Peripheral swelling, COVID-19, Gout, Rotator cuff syndrome, Aphonia, Fall, Contusion, Muscle rupture, Hair growth abnormal, Musculoskeletal stiffness, Impaired healing, Confusional state, Angle closure glaucoma, Therapeutic response unexpected, Illness, Condition aggravated, Scar, Lumbar vertebral fracture, Perioral dermatitis, Infection, Rash, Haematochezia, Influenza, Fatigue, Diarrhoea, Pyrexia, Irritable bowel syndrome, Herpes zoster, Gastrointestinal haemorrhage, Oropharyngeal pain, Blister, Gastric disorder, Cerebrovascular accident, Crepitations, Femur fracture, Memory impairment, Nasal discomfort, Ocular discomfort, Hypoaesthesia, Transient ischaemic attack, Pneumonia, Back pain, Cough, Delirium, Asthenia, Chest pain, Loss of personal independence in daily activities, Deafness, Weight increased, Product prescribing error, Foot fracture, Renal disorder, Wound infection, Ear pain, Ear swelling, Dizziness, Drug ineffective, Post procedural discharge, Rhinovirus infection, Vomiting, Gastroenteritis viral, Intentional dose omission, Product dose omission issue, Pain in extremity, Arthritis, Movement disorder, Skin cancer, Joint swelling, Blood cholesterol increased, Blindness unilateral, Thrombosis, Sensitivity to weather change, Joint range of motion decreased, Gait disturbance, Headache, Feeling hot, Therapeutic product effect incomplete, Hypertension, Pain of skin, Dyspnoea, Cellulitis, Off label use, Urinary tract infection, Foot deformity, Insomnia, Anxiety, Malaise, Nasal congestion, Glycosylated haemoglobin increased, Disability, Epistaxis, Blood glucose increased, Red blood cell count decreased, Intentional product misuse, Decreased immune responsiveness, ...]   \n",
              "4                                                                                                                                                                                                                                                                                                                                                                                                      [Rheumatoid arthritis, Lymphocyte count decreased, White blood cell count decreased, Pain, Off label use, Bronchitis, Hypokinesia, Autoimmune disorder, Therapeutic response unexpected, Malaise, Spinal fracture, Illness, Pain in extremity, Product dose omission issue, Arthralgia, Joint swelling, Pruritus, Nasopharyngitis, Influenza, Localised infection, Cough, Dyspnoea, Urinary tract infection, Rhinorrhoea, Weight increased, Drug hypersensitivity, Sinus disorder, Prescribed overdose, Insomnia, Sensitivity to weather change, Intentional product misuse, Spinal disorder, Thrombosis, Visual impairment, Loss of personal independence in daily activities, Asthenia, Deafness, Death, Drug ineffective, Condition aggravated, Infection, Skin cancer, Deep vein thrombosis, Pulmonary embolism, Tremor, Confusional state, Herpes zoster, Respiratory tract infection, Neoplasm malignant, Dizziness, Feeling abnormal, Peripheral swelling, Joint injury, Muscle strain, Immune system disorder, Diplopia, Barrett's oesophagus, Sunburn, Eye disorder, Fall, Body height decreased, Bronchitis chronic, Epistaxis, Laboratory test abnormal, Limb injury, Vertigo, Chills, Balance disorder, Weight decreased, Tendonitis, Photosensitivity reaction, Conjunctivitis, Sinusitis, Vomiting, Cystitis, Diverticulitis, Gait disturbance, Neuropathy peripheral, Viral infection, Pneumonia, Muscle spasms, Arthropathy, Rash, Diarrhoea, Blood cholesterol increased, Intentional product use issue, Diabetes mellitus, Memory impairment, Cerebrovascular accident, Mental impairment, Aphasia, Cellulitis, Antipsychotic drug level increased, Mental disorder, Hordeolum, Product use issue, Staphylococcal infection, Swelling, Vitamin D increased, Loss of consciousness, ...]   \n",
              "\n",
              "   case_count  \n",
              "0         120  \n",
              "1          77  \n",
              "2          48  \n",
              "3          47  \n",
              "4          32  "
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "MATCH (m:Manufacturer {manufacturerName: 'PFIZER'})-[:REGISTERED]->(c:Case)-[:HAS_REACTION]->(r:Reaction)\n",
        "MATCH (d:Drug)<-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT|IS_CONCOMITANT|IS_INTERACTING]-(c)\n",
        "WHERE r.description IS NOT NULL\n",
        "RETURN d.name AS drug_name,\n",
        "       collect(DISTINCT r.description) AS side_effects,\n",
        "       count(DISTINCT c) AS case_count\n",
        "ORDER BY case_count DESC\n",
        "LIMIT 5;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "## GDS: Graph Data Science Analysis\n",
        "\n",
        "The following analyses use Neo4j's Graph Data Science (GDS) library to identify similar patient journeys and patient clusters based on drug and reaction patterns discovered in EDA."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### GDS 1) Node Similarity — Similar Patient Journeys\n",
        "\n",
        "**Objective:** Identify similar patient journeys where patients experienced identical or highly overlapping sequences of reactions after taking the same drugs.\n",
        "\n",
        "**Query logic:** I create an in-memory graph projection with Case, Drug, and Reaction nodes, and relationships: Case–Drug (IS_PRIMARY_SUSPECT, IS_SECONDARY_SUSPECT, IS_CONCOMITANT) and Case–Reaction (HAS_REACTION), all undirected. A Case's neighbors are thus its drugs and reactions. I run Node Similarity with Jaccard metric; Jaccard = 1 when two patients have identical neighbor sets (same drugs + same reactions). I filter to Case-to-Case pairs only, then enrich with shared reactions and shared drugs via MATCH. Case identifiers use `primaryid`.\n",
        "\n",
        "**Explanation:** Jaccard similarity captures both drug overlap and reaction overlap. Pairs with score 1.0 have identical drug–reaction profiles; lower scores indicate partial overlap. This supports finding patients with similar adverse-event journeys for signal detection and cohort analysis.\n",
        "\n",
        "**Interpretation of results:** \n",
        "\n",
        "Pairs with similarity 1.0 share identical drug and reaction sets. Many top pairs involve **REVLIMID** with reactions such as **Pneumonia** or **Neutropenia**, consistent with oncology use and known adverse-event profiles. Pairs with multiple shared drugs (e.g., REVLIMID + DEXAMETHASONE) and shared reactions (e.g., Atrial fibrillation) represent cohorts with similar treatment and adverse-event journeys, useful for signal detection and cohort studies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 39,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>nodeProjection</th>\n",
              "      <th>relationshipProjection</th>\n",
              "      <th>graphName</th>\n",
              "      <th>nodeCount</th>\n",
              "      <th>relationshipCount</th>\n",
              "      <th>projectMillis</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>{'Drug': {'properties': {}, 'label': 'Drug'}, ...</td>\n",
              "      <td>{'IS_CONCOMITANT': {'orientation': 'UNDIRECTED...</td>\n",
              "      <td>patient-journey</td>\n",
              "      <td>9508</td>\n",
              "      <td>73816</td>\n",
              "      <td>908</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                      nodeProjection  \\\n",
              "0  {'Drug': {'properties': {}, 'label': 'Drug'}, ...   \n",
              "\n",
              "                              relationshipProjection        graphName  \\\n",
              "0  {'IS_CONCOMITANT': {'orientation': 'UNDIRECTED...  patient-journey   \n",
              "\n",
              "   nodeCount  relationshipCount  projectMillis  \n",
              "0       9508              73816            908  "
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df=run_query(\n",
        "\"\"\"\n",
        "CALL gds.graph.project(\n",
        "  'patient-journey',\n",
        "  ['Case','Drug','Reaction'],\n",
        "  {\n",
        "    IS_PRIMARY_SUSPECT: { orientation: 'UNDIRECTED' },\n",
        "    IS_SECONDARY_SUSPECT: { orientation: 'UNDIRECTED' },\n",
        "    IS_CONCOMITANT: { orientation: 'UNDIRECTED' },\n",
        "    HAS_REACTION: { orientation: 'UNDIRECTED' }\n",
        "  }\n",
        ");\n",
        "\"\"\"\n",
        ")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 40,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>case1</th>\n",
              "      <th>case2</th>\n",
              "      <th>similarity</th>\n",
              "      <th>num_shared_reactions</th>\n",
              "      <th>sample_reactions</th>\n",
              "      <th>shared_drugs</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>111530912</td>\n",
              "      <td>158574932</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Pneumonia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>111530912</td>\n",
              "      <td>124902822</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Pneumonia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>111140142</td>\n",
              "      <td>147242912</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Neutropenia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>111140142</td>\n",
              "      <td>194926201</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Neutropenia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>111530912</td>\n",
              "      <td>164981372</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Pneumonia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>111530912</td>\n",
              "      <td>124977452</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Pneumonia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>109837324</td>\n",
              "      <td>125979602</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Atrial fibrillation]</td>\n",
              "      <td>[REVLIMID, DEXAMETHASONE]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>111140142</td>\n",
              "      <td>147065782</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Neutropenia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>111140142</td>\n",
              "      <td>194955072</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Neutropenia]</td>\n",
              "      <td>[REVLIMID]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>112550422</td>\n",
              "      <td>126241282</td>\n",
              "      <td>1.0</td>\n",
              "      <td>1</td>\n",
              "      <td>[Septic shock]</td>\n",
              "      <td>[Carfilzomib, DEXAMETHASONE, REVLIMID]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "       case1      case2  similarity  num_shared_reactions  \\\n",
              "0  111530912  158574932         1.0                     1   \n",
              "1  111530912  124902822         1.0                     1   \n",
              "2  111140142  147242912         1.0                     1   \n",
              "3  111140142  194926201         1.0                     1   \n",
              "4  111530912  164981372         1.0                     1   \n",
              "5  111530912  124977452         1.0                     1   \n",
              "6  109837324  125979602         1.0                     1   \n",
              "7  111140142  147065782         1.0                     1   \n",
              "8  111140142  194955072         1.0                     1   \n",
              "9  112550422  126241282         1.0                     1   \n",
              "\n",
              "        sample_reactions                            shared_drugs  \n",
              "0            [Pneumonia]                              [REVLIMID]  \n",
              "1            [Pneumonia]                              [REVLIMID]  \n",
              "2          [Neutropenia]                              [REVLIMID]  \n",
              "3          [Neutropenia]                              [REVLIMID]  \n",
              "4            [Pneumonia]                              [REVLIMID]  \n",
              "5            [Pneumonia]                              [REVLIMID]  \n",
              "6  [Atrial fibrillation]               [REVLIMID, DEXAMETHASONE]  \n",
              "7          [Neutropenia]                              [REVLIMID]  \n",
              "8          [Neutropenia]                              [REVLIMID]  \n",
              "9         [Septic shock]  [Carfilzomib, DEXAMETHASONE, REVLIMID]  "
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\n",
        "\"\"\"\n",
        "CALL gds.nodeSimilarity.stream('patient-journey', {\n",
        "  similarityMetric: 'JACCARD',\n",
        "  similarityCutoff: 0.2\n",
        "})\n",
        "YIELD node1, node2, similarity\n",
        "WITH gds.util.asNode(node1) AS c1, gds.util.asNode(node2) AS c2, similarity\n",
        "WHERE c1:Case AND c2:Case AND elementId(c1) < elementId(c2)\n",
        "WITH c1, c2, similarity\n",
        "ORDER BY similarity DESC\n",
        "LIMIT 10\n",
        "MATCH (c1)-[:HAS_REACTION]->(r:Reaction)<-[:HAS_REACTION]-(c2)\n",
        "WITH c1, c2, similarity, collect(DISTINCT r.description) AS shared_reactions\n",
        "OPTIONAL MATCH (c1)-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT|IS_CONCOMITANT]->(d:Drug)<-[:IS_PRIMARY_SUSPECT|IS_SECONDARY_SUSPECT|IS_CONCOMITANT]-(c2)\n",
        "WITH c1, c2, similarity, shared_reactions, collect(DISTINCT d.name) AS shared_drugs\n",
        "RETURN \n",
        "  c1.primaryid AS case1,\n",
        "  c2.primaryid AS case2,\n",
        "  similarity,\n",
        "  size(shared_reactions) AS num_shared_reactions,\n",
        "  shared_reactions[0..5] AS sample_reactions,\n",
        "  shared_drugs AS shared_drugs\n",
        "ORDER BY similarity DESC;\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### GDS 2) Community detection — patient sub-phenotypes\n",
        "\n",
        "**Assignment question:** Use a community detection algorithm to cluster patients based on shared demographic characteristics and shared adverse reactions, with the objective of identifying sub-phenotypes of patients who may be at higher risk for specific drug–reaction combinations—i.e., do patients sharing certain demographic characteristics appear to be more likely to suffer from certain adverse reactions?\n",
        "\n",
        "**In-memory design (no database writes).**\n",
        "\n",
        "I created the demographic structure—**Gender** virtual nodes and actual **AgeGroup** and **Outcome** nodes—only within a GDS in-memory graph projection. The stored Neo4j database was not modified: no new nodes or relationships were written. **Gender** nodes are virtual (since Gender is a property on Case nodes, not a separate node type), while **AgeGroup** and **Outcome** nodes are actual nodes from the database connected via `FALLS_UNDER` and `RESULTED_IN` relationships respectively. The projection also included **Reaction** and **Drug** so that community structure reflected both demographics and adverse events.\n",
        "\n",
        "**Weights.** Relationship weights were applied only in the projection: primary suspect (Case–Drug) = 2, secondary suspect = 1.5, and concomitant/interacting = 1. This emphasized primary-suspect drug involvement when clustering. Demographic (Gender, AgeGroup), Case–Reaction, and Case–Outcome edges used weight 1 (or default).\n",
        "\n",
        "**Algorithm choice: Leiden.**\n",
        "\n",
        "I used the **Leiden** algorithm because:\n",
        "\n",
        "- It directly supported the goal of finding patient clusters by shared demographics and reactions, by grouping nodes (Cases, demographic nodes, reactions, drugs) that were densely connected in the projected graph.\n",
        "- It was well-suited to weighted, multi-type graphs and typically yielded better connected, more interpretable communities than Louvain.\n",
        "- It required no prior labels or seed communities, which fit the exploratory aim of discovering sub-phenotypes.\n",
        "\n",
        "I used the following steps: I created the in-memory projection, ran Leiden in stream mode (no write-back), and then described each community by demographics and reactions to answer the assignment question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "Empty DataFrame\n",
              "Columns: []\n",
              "Index: []"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "run_query(\"CALL gds.graph.drop('patient-communities', false) YIELD graphName RETURN graphName\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>graphName</th>\n",
              "      <th>nodeCount</th>\n",
              "      <th>relationshipCount</th>\n",
              "      <th>projectMillis</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>patient-communities</td>\n",
              "      <td>9522</td>\n",
              "      <td>99968</td>\n",
              "      <td>1644</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "             graphName  nodeCount  relationshipCount  projectMillis\n",
              "0  patient-communities       9522              99968           1644"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "df = run_query(\"\"\"\n",
        "CALL () {\n",
        "  MATCH (c:Case)-[r]->(d:Drug)\n",
        "  WHERE type(r) IN ['IS_PRIMARY_SUSPECT','IS_SECONDARY_SUSPECT','IS_CONCOMITANT','IS_INTERACTING']\n",
        "  WITH c, d, r ORDER BY elementId(c), elementId(d), type(r)\n",
        "  RETURN c AS source, d AS target, labels(c) AS sl, labels(d) AS tl, type(r) AS rt,\n",
        "    CASE type(r) WHEN 'IS_PRIMARY_SUSPECT' THEN 2.0 WHEN 'IS_SECONDARY_SUSPECT' THEN 1.5 ELSE 1.0 END AS w\n",
        "  UNION ALL\n",
        "  MATCH (c:Case)-[r:HAS_REACTION]->(rxn:Reaction)\n",
        "  WITH c, rxn ORDER BY elementId(c), elementId(rxn)\n",
        "  RETURN c AS source, rxn AS target, labels(c) AS sl, labels(rxn) AS tl, 'HAS_REACTION' AS rt, 1.0 AS w\n",
        "  UNION ALL\n",
        "  MATCH (c:Case) WHERE c.gender IS NOT NULL\n",
        "  WITH DISTINCT c.gender AS g ORDER BY g\n",
        "  WITH collect(g) AS genders\n",
        "  UNWIND range(0, size(genders)-1) AS i\n",
        "  WITH i, genders[i] AS g, 1000000 + i AS vid\n",
        "  MATCH (c:Case) WHERE c.gender = g\n",
        "  WITH c, vid ORDER BY elementId(c)\n",
        "  RETURN c AS source, vid AS target, labels(c) AS sl, ['Gender'] AS tl, 'HAS_GENDER' AS rt, 1.0 AS w\n",
        "  UNION ALL\n",
        "  MATCH (c:Case)-[r:FALLS_UNDER]->(ag:AgeGroup)\n",
        "  WITH c, ag ORDER BY elementId(c), elementId(ag)\n",
        "  RETURN c AS source, ag AS target, labels(c) AS sl, labels(ag) AS tl, 'FALLS_UNDER' AS rt, 1.0 AS w\n",
        "  UNION ALL\n",
        "  MATCH (c:Case)-[r:RESULTED_IN]->(o:Outcome)\n",
        "  WITH c, o ORDER BY elementId(c), elementId(o)\n",
        "  RETURN c AS source, o AS target, labels(c) AS sl, labels(o) AS tl, 'RESULTED_IN' AS rt, 1.0 AS w\n",
        "}\n",
        "WITH gds.graph.project(\n",
        "  'patient-communities',\n",
        "  source,\n",
        "  target,\n",
        "  {\n",
        "    sourceNodeLabels: sl,\n",
        "    targetNodeLabels: tl,\n",
        "    relationshipType: rt,\n",
        "    relationshipProperties: { weight: w }\n",
        "  },\n",
        "  { undirectedRelationshipTypes: ['*'], consecutiveIds: true }\n",
        ") AS g\n",
        "RETURN g.graphName AS graphName, g.nodeCount AS nodeCount, g.relationshipCount AS relationshipCount, g.projectMillis AS projectMillis\n",
        "\"\"\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Describe sub-phenotypes by community\n",
        "\n",
        "I map the streamed Leiden results back to **Case** nodes, then for each community compute: **case count**, **top genders**, **top age groups**, **top outcomes**, and **top adverse reactions**. This answers whether certain demographics associate with specific adverse reactions and outcomes across sub-phenotypes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>communityId</th>\n",
              "      <th>case_count</th>\n",
              "      <th>top_genders</th>\n",
              "      <th>top_age_groups</th>\n",
              "      <th>top_reactions</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>0</td>\n",
              "      <td>680</td>\n",
              "      <td>[M, F]</td>\n",
              "      <td>[Elderly, Adult, Adolescent]</td>\n",
              "      <td>[Pneumonia, Diarrhoea, Fatigue, Dyspnoea, Rash]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>9</td>\n",
              "      <td>576</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Sinusitis, Headache, Fatigue, Product dose omission issue, Pneumonia]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>4</td>\n",
              "      <td>407</td>\n",
              "      <td>[F, M, U]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Pain, Fall, Drug ineffective, Dizziness, Feeling abnormal]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>10</td>\n",
              "      <td>404</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Alopecia, Prostate cancer, Pain, Breast cancer, Psychological trauma]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>20</td>\n",
              "      <td>374</td>\n",
              "      <td>[M, F, U]</td>\n",
              "      <td>[Elderly, Adult, Adolescent]</td>\n",
              "      <td>[Febrile neutropenia, Sepsis, Pneumonia, Pyrexia, Cytokine release syndrome]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>22</td>\n",
              "      <td>353</td>\n",
              "      <td>[F, M, U]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Product dose omission issue, Drug dose omission by device, Device issue, Off label use, Wrong technique in product usage process]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>2</td>\n",
              "      <td>266</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[No adverse event, Product quality issue, Complication of device insertion, Complication of device removal, Device breakage]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>7</td>\n",
              "      <td>225</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Elderly, Adult, Child]</td>\n",
              "      <td>[Death, Dyspnoea, Hypotension, Pyrexia, Hypoxia]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>17</td>\n",
              "      <td>210</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Elderly, Adult, Child]</td>\n",
              "      <td>[Nausea, Fatigue, Diarrhoea, Vomiting, Dizziness]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>13</td>\n",
              "      <td>204</td>\n",
              "      <td>[F, M, U]</td>\n",
              "      <td>[Elderly, Adult, Adolescent]</td>\n",
              "      <td>[Chronic kidney disease, Acute kidney injury, Renal failure, Gastrooesophageal reflux disease, End stage renal disease]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>24</td>\n",
              "      <td>130</td>\n",
              "      <td>[F, M, U]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Dyspnoea, Headache, Syncope, Fluid retention, Vascular device infection]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>5</td>\n",
              "      <td>92</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Infant]</td>\n",
              "      <td>[COVID-19 pneumonia, Headache, Fatigue, Multiple sclerosis, COVID-19]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>8</td>\n",
              "      <td>71</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Child]</td>\n",
              "      <td>[Diarrhoea, Product dose omission issue, Dehydration, Weight decreased, Hospitalisation]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>25</td>\n",
              "      <td>62</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Elderly, Adult, Adolescent]</td>\n",
              "      <td>[Crohn's disease, Hypertension, Diarrhoea, Fall, Constipation]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>6</td>\n",
              "      <td>57</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Infant, Elderly]</td>\n",
              "      <td>[Alanine aminotransferase increased, Aspartate aminotransferase increased, Pyrexia, Dyspnoea, Fatigue]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>12</td>\n",
              "      <td>56</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Hereditary angioedema, Inappropriate schedule of product administration, Fatigue, Nausea, Diarrhoea]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>16</th>\n",
              "      <td>28</td>\n",
              "      <td>38</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Child]</td>\n",
              "      <td>[Embolism, Blindness, Uveitis, Eye irritation, Cholecystitis acute]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>3</td>\n",
              "      <td>37</td>\n",
              "      <td>[M, F]</td>\n",
              "      <td>[Adolescent, Adult, Child]</td>\n",
              "      <td>[Febrile neutropenia, Pyrexia, Vomiting, Dehydration, Hypoglycaemia]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18</th>\n",
              "      <td>21</td>\n",
              "      <td>33</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly]</td>\n",
              "      <td>[Product solubility abnormal, Headache, Product substitution issue, Bone density decreased, Drug ineffective]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>30</td>\n",
              "      <td>32</td>\n",
              "      <td>[F, M]</td>\n",
              "      <td>[Adult, Elderly, Adolescent]</td>\n",
              "      <td>[Off label use, Product dose omission issue, Pain, Fall, White blood cell count decreased]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "    communityId  case_count top_genders                top_age_groups  \\\n",
              "0             0         680      [M, F]  [Elderly, Adult, Adolescent]   \n",
              "1             9         576      [F, M]  [Adult, Elderly, Adolescent]   \n",
              "2             4         407   [F, M, U]  [Adult, Elderly, Adolescent]   \n",
              "3            10         404      [F, M]  [Adult, Elderly, Adolescent]   \n",
              "4            20         374   [M, F, U]  [Elderly, Adult, Adolescent]   \n",
              "5            22         353   [F, M, U]  [Adult, Elderly, Adolescent]   \n",
              "6             2         266      [F, M]  [Adult, Elderly, Adolescent]   \n",
              "7             7         225      [F, M]       [Elderly, Adult, Child]   \n",
              "8            17         210      [F, M]       [Elderly, Adult, Child]   \n",
              "9            13         204   [F, M, U]  [Elderly, Adult, Adolescent]   \n",
              "10           24         130   [F, M, U]  [Adult, Elderly, Adolescent]   \n",
              "11            5          92      [F, M]      [Adult, Elderly, Infant]   \n",
              "12            8          71      [F, M]       [Adult, Elderly, Child]   \n",
              "13           25          62      [F, M]  [Elderly, Adult, Adolescent]   \n",
              "14            6          57      [F, M]      [Adult, Infant, Elderly]   \n",
              "15           12          56      [F, M]  [Adult, Elderly, Adolescent]   \n",
              "16           28          38      [F, M]       [Adult, Elderly, Child]   \n",
              "17            3          37      [M, F]    [Adolescent, Adult, Child]   \n",
              "18           21          33      [F, M]              [Adult, Elderly]   \n",
              "19           30          32      [F, M]  [Adult, Elderly, Adolescent]   \n",
              "\n",
              "                                                                                                                         top_reactions  \n",
              "0                                                                                      [Pneumonia, Diarrhoea, Fatigue, Dyspnoea, Rash]  \n",
              "1                                                               [Sinusitis, Headache, Fatigue, Product dose omission issue, Pneumonia]  \n",
              "2                                                                          [Pain, Fall, Drug ineffective, Dizziness, Feeling abnormal]  \n",
              "3                                                               [Alopecia, Prostate cancer, Pain, Breast cancer, Psychological trauma]  \n",
              "4                                                         [Febrile neutropenia, Sepsis, Pneumonia, Pyrexia, Cytokine release syndrome]  \n",
              "5   [Product dose omission issue, Drug dose omission by device, Device issue, Off label use, Wrong technique in product usage process]  \n",
              "6         [No adverse event, Product quality issue, Complication of device insertion, Complication of device removal, Device breakage]  \n",
              "7                                                                                     [Death, Dyspnoea, Hypotension, Pyrexia, Hypoxia]  \n",
              "8                                                                                    [Nausea, Fatigue, Diarrhoea, Vomiting, Dizziness]  \n",
              "9              [Chronic kidney disease, Acute kidney injury, Renal failure, Gastrooesophageal reflux disease, End stage renal disease]  \n",
              "10                                                           [Dyspnoea, Headache, Syncope, Fluid retention, Vascular device infection]  \n",
              "11                                                               [COVID-19 pneumonia, Headache, Fatigue, Multiple sclerosis, COVID-19]  \n",
              "12                                            [Diarrhoea, Product dose omission issue, Dehydration, Weight decreased, Hospitalisation]  \n",
              "13                                                                      [Crohn's disease, Hypertension, Diarrhoea, Fall, Constipation]  \n",
              "14                              [Alanine aminotransferase increased, Aspartate aminotransferase increased, Pyrexia, Dyspnoea, Fatigue]  \n",
              "15                               [Hereditary angioedema, Inappropriate schedule of product administration, Fatigue, Nausea, Diarrhoea]  \n",
              "16                                                                 [Embolism, Blindness, Uveitis, Eye irritation, Cholecystitis acute]  \n",
              "17                                                                [Febrile neutropenia, Pyrexia, Vomiting, Dehydration, Hypoglycaemia]  \n",
              "18                       [Product solubility abnormal, Headache, Product substitution issue, Bone density decreased, Drug ineffective]  \n",
              "19                                          [Off label use, Product dose omission issue, Pain, Fall, White blood cell count decreased]  "
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "# Map Leiden results to Case nodes and describe each community.\n",
        "\n",
        "df_communities = run_query(\"\"\"\n",
        "CALL gds.leiden.stream(\n",
        "  'patient-communities',\n",
        "  { relationshipWeightProperty: 'weight', randomSeed: 9 ,concurrency: 1}\n",
        ")\n",
        "YIELD nodeId, communityId\n",
        "WITH gds.util.asNode(nodeId) AS n, communityId\n",
        "WHERE n:Case\n",
        "WITH communityId, collect(n) AS cases\n",
        "WITH communityId, size(cases) AS case_count, cases\n",
        "\n",
        "// Store node IDs for re-collecting cases after aggregations\n",
        "WITH communityId, case_count, cases, [c IN cases | elementId(c)] AS case_ids\n",
        "\n",
        "// Top Reactions\n",
        "UNWIND cases AS c\n",
        "MATCH (c)-[:HAS_REACTION]->(r:Reaction)\n",
        "WHERE r.description IS NOT NULL\n",
        "WITH communityId, case_count, case_ids, r.description AS reaction\n",
        "WITH communityId, case_count, case_ids, reaction, count(*) AS rcnt\n",
        "ORDER BY communityId, rcnt DESC\n",
        "WITH communityId, case_count, case_ids, collect(reaction)[0..5] AS top_reactions\n",
        "\n",
        "// Re-collect cases using stored node IDs\n",
        "WITH communityId, case_count, top_reactions, case_ids\n",
        "UNWIND case_ids AS cid\n",
        "MATCH (c) WHERE elementId(c) = cid\n",
        "WITH communityId, case_count, top_reactions, case_ids, collect(c) AS cases\n",
        "\n",
        "// Top Genders\n",
        "UNWIND cases AS c\n",
        "WITH communityId, case_count, top_reactions, case_ids, c\n",
        "WHERE c.gender IS NOT NULL\n",
        "WITH communityId, case_count, top_reactions, case_ids, c.gender AS gender\n",
        "WITH communityId, case_count, top_reactions, case_ids, gender, count(*) AS gcnt\n",
        "ORDER BY communityId, gcnt DESC\n",
        "WITH communityId, case_count, top_reactions, case_ids, collect(gender)[0..3] AS top_genders\n",
        "\n",
        "// Re-collect cases using stored node IDs\n",
        "WITH communityId, case_count, top_reactions, top_genders, case_ids\n",
        "UNWIND case_ids AS cid\n",
        "MATCH (c) WHERE elementId(c) = cid\n",
        "WITH communityId, case_count, top_reactions, top_genders, case_ids, collect(c) AS cases\n",
        "\n",
        "// Top Age Groups\n",
        "UNWIND cases AS c\n",
        "MATCH (c)-[:FALLS_UNDER]->(ag:AgeGroup)\n",
        "WITH communityId, case_count, top_reactions, top_genders, case_ids, ag.ageGroup AS age_group\n",
        "WITH communityId, case_count, top_reactions, top_genders, case_ids, age_group, count(*) AS agcnt\n",
        "ORDER BY communityId, agcnt DESC\n",
        "WITH communityId, case_count, top_reactions, top_genders, case_ids, collect(age_group)[0..3] AS top_age_groups\n",
        "\n",
        "// Re-collect cases using stored node IDs\n",
        "WITH communityId, case_count, top_reactions, top_genders, top_age_groups, case_ids\n",
        "UNWIND case_ids AS cid\n",
        "MATCH (c) WHERE elementId(c) = cid\n",
        "WITH communityId, case_count, top_reactions, top_genders, top_age_groups, case_ids, collect(c) AS cases\n",
        "\n",
        "\n",
        "RETURN communityId,\n",
        "       case_count,\n",
        "       top_genders,\n",
        "       top_age_groups,\n",
        "       top_reactions\n",
        "ORDER BY case_count DESC;\n",
        "\n",
        "\"\"\")\n",
        "df_communities = df_communities.sort_values('case_count', ascending=False).reset_index(drop=True)\n",
        "display(df_communities)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "**Commentary:** \n",
        "\n",
        "The table above summarizes each Leiden community by: \n",
        "- number of cases, \n",
        "- most frequent genders, \n",
        "- most frequent age groups , and \n",
        "- most frequent adverse reactions. \n",
        "\n",
        "Rows are ordered by *case_count* descending so the largest communities appear first. Communities with a strong over-representation of one gender or age group and a small set of dominant reactions can be interpreted as sub-phenotypes (e.g., a cluster of predominantly elderly cases with similar reaction profiles). Comparing *top_genders*, *top_age_groups*, and *top_reactions* across communities helps answer whether patients sharing certain demographic characteristics appear more likely to suffer from specific adverse reactions; further analysis can drill into drug–reaction combinations for high-risk communities. \n",
        "\n",
        "\n",
        "\n",
        " **Interpretation of results:** \n",
        "\n",
        "The Leiden algorithm identified 20 distinct patient sub-phenotypes based on shared demographics and adverse reactions:\n",
        "\n",
        "- **Community 0** (680 cases) is the largest community: predominantly **Elderly, Adult and Adolescent** patients with mixed gender (**M, F**) showing reactions **Pneumonia, Diarrhoea, Fatigue, Dyspnoea, Rash**—consistent with respiratory and gastrointestinal adverse events in older adults, potentially related to immunosuppressive or oncology treatments.\n",
        "\n",
        "- **Community 9** (576 cases) shows **Adult,Elderly and Adolescent** patients (**F, M**) with **Sinusitis, Headache, Fatigue, Product dose omission issue, Pneumonia**—likely a respiratory/infection cluster with medication adherence issues, possibly related to chronic conditions requiring regular medication.\n",
        "\n",
        "- **Community 4** (407 cases) includes **Adult,Elderly and Adolescent** patients across all genders (**F, M, U**) with **Pain, Fall, Drug ineffective, Dizziness, Feeling abnormal**—suggests a general adverse event cluster with non-specific symptoms, possibly related to polypharmacy or medication interactions.\n",
        "\n",
        "- **Community 10** (404 cases) is **Adult,Elderly and Adolescent** (**F, M**) with **Alopecia, Prostate cancer, Pain, Breast cancer, Psychological trauma**—clearly an oncology sub-phenotype with cancer-specific reactions and treatment-related side effects (alopecia from chemotherapy).\n",
        "\n",
        "- **Community 20** (374 cases) shows **Elderly, Adult and Adolescent** patients (**M, F, U**) with **Febrile neutropenia, Sepsis, Pneumonia, Pyrexia, Cytokine release syndrome**—a severe infection/immunocompromised sub-phenotype, likely related to oncology or immunosuppressive therapy where patients are at high risk for serious infections.\n",
        "\n",
        "- **Community 22** (353 cases) includes **Adult,Elderly and Adolescent** (**F, M, U**) with device and medication adherence issues: **Product dose omission issue, Drug dose omission by device, Device issue, Off label use, Wrong technique in product usage process**—a device/medication management sub-phenotype, possibly related to complex medication delivery systems or patient education needs.\n",
        "\n",
        "- **Community 2** (266 cases) shows **Adult,Elderly and Adolescent** (**F, M**) with **No adverse event, Product quality issue, Complication of device insertion, Complication of device removal, Device breakage**—a device safety reporting cluster, focusing on device-related complications rather than drug reactions.\n",
        "\n",
        "- **Community 7** (225 cases) includes **Elderly, Adult, and Child** patients (**F, M**) with severe reactions: **Death, Dyspnoea, Hypotension, Pyrexia, Hypoxia**—a critical care sub-phenotype with life-threatening adverse events, warranting urgent safety surveillance.\n",
        "\n",
        "- **Community 17** (210 cases) shows **Elderly, Adult, and Child** (**F, M**) with gastrointestinal symptoms: **Nausea, Fatigue, Diarrhoea, Vomiting, Dizziness**—a common adverse event cluster, likely related to medications with well-known GI side effects.\n",
        "\n",
        "- **Community 13** (204 cases) includes **Elderly, Adult, and Adolescent** patients (**F, M, U**) with renal complications: **Chronic kidney disease, Acute kidney injury, Renal failure, Gastrooesophageal reflux disease, End stage renal disease**—a renal sub-phenotype, potentially related to nephrotoxic medications or pre-existing renal conditions.\n",
        "\n",
        "- **Smaller communities** (24, 5, 8, 25, 6, 12, 28, 3, 21, 30) range from 32–130 cases and show specialized patterns:\n",
        "  - **Community 5** (92 cases): **COVID-19 pneumonia, Headache, Fatigue, Multiple sclerosis, COVID-19**—pandemic-era cluster with COVID-related adverse events.\n",
        "  - **Community 6** (57 cases): **Alanine aminotransferase increased, Aspartate aminotransferase increased**—hepatotoxicity sub-phenotype.\n",
        "  - **Community 12** (56 cases): **Hereditary angioedema**—rare disease cluster.\n",
        "  - **Community 28** (38 cases): **Embolism, Blindness, Uveitis**—vascular and ocular complications.\n",
        "  - **Community 3** (37 cases): **Febrile neutropenia, Pyrexia, Vomiting, Dehydration, Hypoglycaemia**—pediatric/adolescent oncology cluster.\n",
        "\n",
        "**Key findings:**\n",
        "- **Age-group patterns** more clearly differentiate risk profiles than gender (which is generally mixed across communities).\n",
        "- **Geriatric sub-phenotypes** (Elderly-dominated communities) show higher rates of respiratory infections (Pneumonia), sepsis, and severe outcomes.\n",
        "- **Oncology-related communities** (Communities 10, 20) show characteristic reactions like alopecia, febrile neutropenia, and cytokine release syndrome.\n",
        "- **Device-related communities** (Communities 22, 2) highlight medication delivery and device safety issues.\n",
        "- The algorithm successfully identified clinically meaningful sub-phenotypes that could inform targeted safety surveillance and patient risk stratification."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 7. Conclusion\n",
        "\n",
        "This approach leverages an iterative discovery methodology, starting with foundational EDA to understand the graph structure, progressing through analytical queries that provide healthcare insights, and culminating in advanced GDS analyses that identify patient risk patterns. By anticipating challenges around schema discovery, query performance, and algorithm selection, I can proactively address potential issues and ensure a thorough, well-documented analysis.\n",
        "\n",
        "### Use of AI tools and LLMs\n",
        "\n",
        "For this assignment, I used large language models (LLMs) as a reference to compare available similarity and community detection algorithms in Neo4j’s GDS library. I used their summaries of pros and cons to help select Jaccard‑based Node Similarity and the Leiden community detection algorithm, and then validated these choices against the FAERS use case and official Neo4j documentation.\n",
        "\n",
        "The transcript to ChatGPT is https://chatgpt.com/share/69a4a840-46b4-8007-958c-f2b19b50679d and the transcript from Cursor is below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---\n",
        "\n",
        "**User**\n",
        "\n",
        "can you compare the available node similarity algorithms in neo4j\n",
        "\n",
        "---\n",
        "\n",
        "**Cursor**\n",
        "\n",
        "### High-level comparison\n",
        "\n",
        "Neo4j’s Graph Data Science (GDS) library doesn’t have a huge list of totally separate “node similarity algorithms”; instead it has:\n",
        "\n",
        "- a **Node Similarity** algorithm (neighbors-based)\n",
        "- a **K-Nearest Neighbors (KNN)** algorithm (property / vector-based)\n",
        "- several **similarity functions** (Jaccard, Overlap, Cosine, Pearson, etc.) that these algorithms can use under the hood\n",
        "\n",
        "Below is a concise comparison focused on how you’d choose between them.\n",
        "\n",
        "---\n",
        "\n",
        "### 1. Node Similarity (neighbors-based)\n",
        "\n",
        "- **What it does**: Compares nodes based on **their neighbors** (e.g. two users are similar if they interact with many of the same items / people).\n",
        "- **Typical use-case**: “Users who viewed similar products”, “authors with similar co-authors”, “papers citing similar papers”.\n",
        "- **How it works**:  \n",
        "  - Builds neighbor sets (optionally weighted) from relationships.  \n",
        "  - Computes pairwise similarity, typically sparsified (top-k or threshold).\n",
        "- **Available similarity metrics**:\n",
        "  - **Jaccard** (set-based): intersection / union of neighbor sets.\n",
        "  - **Overlap coefficient**: intersection / min(|A|, |B|); favors nodes with smaller degree but strong shared neighbors.\n",
        "  - **Cosine similarity**: treats neighbor weights as vectors; best when relationships have **weights** (e.g. rating counts, interaction frequency).\n",
        "- **When to use**:\n",
        "  - Your notion of similarity is “share many neighbors”.\n",
        "  - You already have a bipartite or general interaction graph.\n",
        "  - You want relationships like `(:User)-[:SIMILAR_TO]->(:User)` based on shared items.\n",
        "\n",
        "---\n",
        "\n",
        "### 2. K-Nearest Neighbors (KNN) (property/vector-based)\n",
        "\n",
        "- **What it does**: Compares nodes by **numeric properties or embedding vectors** (e.g. feature vectors, FastRP / Node2Vec / GraphSAGE embeddings).\n",
        "- **Typical use-case**: “Find similar products by embedding vector”, “similar documents by text embedding”, “similar users by feature vector”.\n",
        "- **How it works**:\n",
        "  - Reads one or more numeric properties (often an embedding array).\n",
        "  - Uses a distance/similarity function to find each node’s **k nearest neighbors**.\n",
        "  - Creates relationships like `(:Item)-[:SIMILAR_TO {score}]->(:Item)`.\n",
        "- **Available similarity/distance measures** (via similarity functions):\n",
        "  - **Cosine similarity** (most common for embeddings).\n",
        "  - **Euclidean distance / Euclidean similarity**.\n",
        "  - **Pearson** (for correlation-style similarity on numeric vectors).\n",
        "- **When to use**:\n",
        "  - You already have or plan to compute **embeddings** (FastRP, Node2Vec, GraphSAGE, external ML).\n",
        "  - Similarity should reflect **latent structure** or rich features, not just shared neighbors.\n",
        "  - You need a fixed **k** similar nodes per node, e.g. for recommendation or ANN-style queries.\n",
        "\n",
        "---\n",
        "\n",
        "### 3. Standalone similarity functions (arrays / properties)\n",
        "\n",
        "These are lower-level **functions** that can be used directly or inside custom pipelines:\n",
        "\n",
        "- **Categorical / set-based**:\n",
        "  - **Jaccard**: for overlap between sets (IDs, categories, tags).\n",
        "  - **Overlap coefficient**: emphasizes strong overlap when one set is much smaller.\n",
        "- **Numeric / vector-based**:\n",
        "  - **Cosine**: angle between vectors; ideal for high-dimensional embeddings.\n",
        "  - **Pearson**: correlation; good when you care about shape/trend rather than absolute values.\n",
        "  - **Euclidean distance / similarity**: direct geometric distance.\n",
        "\n",
        "They’re the building blocks behind `gds.nodeSimilarity` and `gds.knn`.\n",
        "\n",
        "---\n",
        "\n",
        "### Choosing between them\n",
        "\n",
        "- **You have a graph of interactions / co-occurrences** (e.g. users–items, authors–papers):\n",
        "  - **Use `gds.nodeSimilarity`** with:\n",
        "    - **Jaccard / Overlap** for unweighted edges and pure “co-occurrence” style similarity.\n",
        "    - **Cosine** if relationships are **weighted**.\n",
        "\n",
        "- **You have feature vectors or embeddings on nodes**:\n",
        "  - **Use `gds.knn`** on a vector property, with:\n",
        "    - **Cosine** for embeddings (most standard).\n",
        "    - **Euclidean** when your feature space is naturally Euclidean.\n",
        "\n",
        "- **You’re building a custom scoring logic in Cypher or a GDS pipeline**:\n",
        "  - Use **similarity functions** (Jaccard, Cosine, Pearson, etc.) directly on arrays or vectors.\n",
        "\n",
        "---\n",
        "\n",
        "If you tell me your specific use-case (e.g. “I have students and courses and I want similar students” or “I have article embeddings”), I can recommend a precise algorithm + config and sketch the Cypher / GDS calls for it.\n",
        "\n",
        "---\n",
        "\n",
        "**User**\n",
        "\n",
        "great work, lets also compare community detection algorithms\n",
        "\n",
        "---\n",
        "\n",
        "**Cursor**\n",
        "\n",
        "### Overview\n",
        "\n",
        "Neo4j GDS has several core **community detection** algorithms, each with different trade-offs in speed, quality, and assumptions about communities.\n",
        "\n",
        "---\n",
        "\n",
        "### 1. Leiden\n",
        "\n",
        "- **Goal**: High-quality, well-connected communities via **modularity optimization**, improving on Louvain.\n",
        "- **Strengths**:\n",
        "  - **Better community quality** and fewer “broken” communities than Louvain.\n",
        "  - Handles large graphs; hierarchical (can give communities at multiple resolutions).\n",
        "- **Weaknesses**:\n",
        "  - More complex than Label Propagation; a bit heavier than LPA.\n",
        "- **Use when**:  \n",
        "  - You care about **community quality** and **well-connectedness**.\n",
        "  - You’re okay with a bit more compute for better structure.\n",
        "\n",
        "---\n",
        "\n",
        "### 2. Louvain\n",
        "\n",
        "- **Goal**: Detect communities by **maximizing modularity**.\n",
        "- **Strengths**:\n",
        "  - Well-known, widely used; good quality vs performance.\n",
        "  - Hierarchical: gives multiple levels of communities.\n",
        "- **Weaknesses**:\n",
        "  - Can produce communities that are **not well connected internally** (one reason Leiden was introduced).\n",
        "- **Use when**:\n",
        "  - You want modularity-based clustering and don’t need Leiden’s extra guarantees.\n",
        "  - You want a good default and performance is more important than the last bit of quality.\n",
        "\n",
        "---\n",
        "\n",
        "### 3. Label Propagation (LPA)\n",
        "\n",
        "- **Goal**: Very fast, approximate communities by **propagating labels** among neighbors.\n",
        "- **Strengths**:\n",
        "  - **Very fast and scalable**, simple to run.\n",
        "  - No need to set an objective like modularity.\n",
        "- **Weaknesses**:\n",
        "  - More **unstable** (results can vary run-to-run).\n",
        "  - Community quality generally lower than Louvain/Leiden; can give very coarse or noisy partitions.\n",
        "- **Use when**:\n",
        "  - You need a **quick, cheap** community detection.\n",
        "  - You want a first-pass structure, then maybe refine with another method.\n",
        "\n",
        "---\n",
        "\n",
        "### 4. Connected Components (WCC / SCC)\n",
        "\n",
        "- **Goal**: Find **connected subgraphs**, not “communities” in the modularity sense.\n",
        "  - **WCC**: Weakly Connected Components (ignores edge direction).\n",
        "  - **SCC**: Strongly Connected Components (direction-respecting, each node reachable from each other).\n",
        "- **Strengths**:\n",
        "  - Extremely **fast and simple**.\n",
        "  - Gives a hard partition based solely on connectivity.\n",
        "- **Weaknesses**:\n",
        "  - Doesn’t consider density or modularity; large components may still contain many sub-communities.\n",
        "- **Use when**:\n",
        "  - You need the **basic connectivity structure**: which nodes are in the same “island”.\n",
        "  - As a **preprocessing** step (e.g., run Louvain/Leiden/LPA inside each WCC).\n",
        "\n",
        "---\n",
        "\n",
        "### How to choose (rule of thumb)\n",
        "\n",
        "- **Best overall quality**: **Leiden**  \n",
        "- **Balanced & classic**: **Louvain**\n",
        "- **Fastest approximate grouping**: **Label Propagation**\n",
        "- **Pure connectivity partitioning**: **Connected Components (WCC/SCC)**\n",
        "\n",
        "If you describe your graph (directed/undirected, size, and what “community” means in your assignment), I can suggest a specific algorithm and show the exact GDS calls you’d use.\n",
        "\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.13.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}