Skip to content
Snippets Groups Projects
Commit 134f6446 authored by Helene Coullon's avatar Helene Coullon
Browse files

correction TP3 et nouveau TP4

parent 370b0823
No related branches found
No related tags found
No related merge requests found
Source diff could not be displayed: it is too large. Options to address this: view the blob.
FROM quay.io/jupyter/datascience-notebook
USER root
ARG openjdk_version="17"
RUN apt-get update --yes && \
apt-get install --yes --no-install-recommends \
"openjdk-${openjdk_version}-jre-headless" \
ca-certificates-java \
default-libmysqlclient-dev \
build-essential \
pkg-config && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip
COPY requirements.txt /home/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /home/requirements.txt && \
fix-permissions "${CONDA_DIR}" && \
fix-permissions "/home/${NB_USER}"
USER ${NB_UID}
\ No newline at end of file
version: '3.1'
services:
spark:
image: docker.io/bitnami/spark:3.5
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
- '7077:7077'
spark-worker:
image: docker.io/bitnami/spark:3.5
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G # can be changed
- SPARK_WORKER_CORES=1 # can be changed
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
minio:
image: minio/minio
container_name: minio
environment:
MINIO_ROOT_USER: root
MINIO_ROOT_PASSWORD: password
command: server /data --console-address ":9001"
ports:
- "19000:9000"
- "19001:9001"
notebook:
image: grosinosky/bigdata_fila3_jupyter
build:
context: .
dockerfile: Dockerfile
container_name: jupyter
ports:
- "8888:8888"
- "4040:4040"
environment:
JUPYTER_ENABLE_LAB: yes
command: start-notebook.py --NotebookApp.token=''
\ No newline at end of file
#pandas==2.2.3 # ajouter en local
mysqlclient==2.2.4
jupysql==0.10.14
#seaborn==0.13.2 # ajouter en local
grpcio==1.59.0
pymongo==4.10.1
pyspark==3.5.3
hdfs==2.7.3
minio==7.2.10
docker==7.1.0
kafka-python==2.0.2; python_version < '3.12'
kafka-python @ git+https://github.com/dpkp/kafka-python.git ; python_version >= '3.12'
\ No newline at end of file
# Map reduce et Spark
Vous aurez à retirer les modules Docker des précédents TP. Vous pouvez le faire avec la commande `docker compose down` dans les répertoires appropriés ou en vous servant du plugin Docker de VsCode.
Dans ce TP en premier lieu nous allons utiliser les fonctionnalités Map Reduce natives de Python dans le notebook [tp-python](tp-python.ipynb). Le code des fonctions décrites dans le cours y est également disponible.
La deuxième partie du TP concerne Spark, dans le notebook [tp-spark](tp-spark.ipynb) (à faire après la présentation correspondante du cours !). Outre l'illustration de PageRank du cours, vous y retrouverez quelques exercices pour s'initier à l'utilisation de Spark, à l'aide de la bibliothèque [PySpark](https://spark.apache.org/docs/latest/api/python/index.html).
%% Cell type:markdown id: tags:
# Map reduce
Le code disponible ci-dessous est celui du cours. Bien entendu dans ce cas on est limités par la puissance et la mémoire de la machine utilisée.
%% Cell type:code id: tags:
``` python
from itertools import chain
from functools import reduce
```
%% Cell type:markdown id: tags:
## Map
%% Cell type:code id: tags:
``` python
# Define flat_map using map and itertools.chain
def flat_map(func, iterable):
return list(chain.from_iterable(map(func, iterable)))
# Initial list of words
words = ["The", "Dark", "Knight", "Rises"]
# Map to get the length of each word
lengths = list(map(len, words))
print("Lengths:", lengths)
# Map to get each word as a list of characters
list_of_chars = list(map(list, words))
print("List of characters:", list_of_chars)
# Map to get the ASCII value of the first character of each word
list_of_asciis = list(map(lambda word: ord(word[0]), words))
print("List of ASCII values:", list_of_asciis)
# FlatMap to flatten all characters
chars = flat_map(list, words)
print("Flattened characters:", chars)
# Map to increment each word length by 1
incs = list(map(lambda length: length + 1, lengths))
print("Incremented lengths:", incs)
```
%% Cell type:markdown id: tags:
## Reduce
%% Cell type:code id: tags:
``` python
words = ["The", "Dark", "Knight", "Rises"]
lengths = list(map(len, words)) # Creates a list of lengths: [3, 4, 6, 5]
# Concatenates all elements in `words` into a single string.
res1 = reduce(lambda x, y: x + y, words) if words else None
print("res1:", res1)
# Concatenates all elements in `words`, then adds "AndFalls" at the end.
res2 = reduce(lambda x, y: x + y, words + ["AndFalls"])
print("res2:", res2)
# Concatenates "NaNa" at the beginning, then adds all elements in `words`.
res3 = reduce(lambda x, y: x + y, ["NaNa"] + words)
print("res3:", res3)
# Takes the first letter of each word in `words` and concatenates them.
res4 = reduce(lambda x, y: x + y, map(lambda word: word[0], words))
print("res4:", res4)
# Sums up all the elements in `lengths`, which represents the total length of all words.
res5 = reduce(lambda x, y: x + y, lengths)
print("res5:", res5)
```
%% Cell type:markdown id: tags:
## Exercice
%% Cell type:markdown id: tags:
Ecrire un programme qui calcule la distance totale entre une série de points en 2D, connectés séquentiellement comme un trajet.
On vous donne une liste de points en 2D, chaque point étant représenté par un tuple (x, y).
Utilisez map pour calculer la distance entre chaque paire de points consécutifs.
Utilisez ensuite reduce pour calculer la distance totale du trajet reliant tous les points.
```python
points = [(0, 0), (3, 4), (7, 1), (10, 10)]
```
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment