# # PPCES - Task: Regression with California Housing Dataset
# In this task, participants should apply preprocessing and regression techniques to the California housing (real world) dataset from scikit-learn, which lists several houses, specific attributes of the houses and their prices. The task involves
# - Loading the dataset
# - Applying preprocessing techniques such as feature standardization
# - Training and evaluating the regression model
# - Visualization results and scores
# %% [markdown]
# ### Step 1: Load desired Python modules
# %%
importtime
importmatplotlib.pyplotasplt
fromsklearnimportdatasets,preprocessing,metrics
fromsklearn.ensembleimportRandomForestRegressor
fromsklearn.svmimportSVR
fromsklearn.model_selectionimporttrain_test_split
# %% [markdown]
# ### Step 2: Loading the dataset
# Load the dataset and print out the following information
# - Description of the dataset
# - Array shape of the feature data that is used for training the model
# - Array shape of the label / target data
# - List of the feature names
# - List of the target names
# %%
# load the dataset
dataset=datasets.fetch_california_housing()
# print desired information
print(f"Data set description:\n{dataset.DESCR}")
print(f"Shape of feature / training data: {dataset.data.shape}")
print(f"Shape of label data: {dataset.target.shape}")
print(f"Feature names: {dataset.feature_names}")
print(f"Target names: {dataset.target_names}")
# %% [markdown]
# ### Step 3: Data preprocessing
# Run the following:
# - Determine the min, max and avg values per feature
# - You should see that the value ranges are quite different per feature
# - Some models work better if all features have a similar order of magnitude
# - Split the data into train and test splits
# - Apply standardization to the training data split
# - Note: of course you also need to apply the same standardization to the test split later!
print(f"Scaler mean values (based on training set):\n{scaler.mean_}")
print(f"Scaler scale values (based on training set):\n{scaler.scale_}")
# dont forget to apply the same scaler to the test data split
X_test=scaler.transform(X_test)
# %% [markdown]
# ### Step 4: Train a Support Vector Regression (SVR) or RandomForest Regression model
# For SVR, use the following parameters:
# - `C=1.0` (regularization parameter)
# - `epsilon=0.2` (specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value)
#
# For RandomForest, use the following parameters:
# - `n_estimators=10` (Number of different trees)
# - `random_state=42`
# - `max_depth=None` (depth of the trees, `None` will figure it out on its own)
# - You can also play around with number of trees and depth
# %%
# create and intialize the model
model=SVR(C=1.0,epsilon=0.2)
# model = RandomForestRegressor(n_estimators=10, max_depth=None, random_state=42)
# train / fit the model
elapsed_time=time.time()
model=model.fit(X_train,y_train)
elapsed_time=time.time()-elapsed_time
print(f"Elapsed time for training: {elapsed_time} sec")
# %% [markdown]
# ### Step 5: Evaluate the model using typical scoring functions
# To evaluate the model performance, do the following:
# - Predict the housing prices for the test split with the trained model (applying on unseen data)
# - Plot both original (ground truth) and predicted values
# - Determine `r2_score`, `mean_absolute_error` and `mean_absolute_percentage_error` for the prediction
# - Note: There are multiple scoring functions for different purposes. You can find more information here: https://scikit-learn.org/stable/modules/model_evaluation.html
# %%
# predict housing prices for test split
y_pred=model.predict(X_test)
# Plot both original (ground truth) and predicted values. Limited to the first 100 values to see differences