-
Notifications
You must be signed in to change notification settings - Fork 328
Samples Revamp: ML - Batch Sentiment Analysis #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 66 commits
c07f36d
4813b69
ace0991
7a76d5c
b4dc557
eee2219
e45fb2a
118f660
e951fa7
b8219f3
3b281cb
052c536
ea34fcd
308bdbb
b76d1c7
05226f6
3aba245
46403c0
6110d61
3e23889
8ff093a
77d7a76
99252e8
d79d3c7
b540a41
bf53a34
0a16ba4
e0f8cc5
70e649c
e7674e0
5b33662
efff9bd
2c21224
798804d
dd6a227
6a4583a
36683a3
7998fda
db654bc
eacf199
3f361fc
25a31a0
0d79f89
d6c1045
5541fd8
25f0745
26d8365
a0ea3f1
a42384d
535b071
491fa26
8e91c8f
fef02e3
9417459
4112ac4
d923e3d
674d614
f033d1e
13f4b12
50d65fc
d1a54a3
d6c97e5
1df4a7d
ef8bce5
cfb25f8
2e1735f
91b1f5b
f74b373
878db57
817f6fe
3becbda
0a3b8c5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
// Licensed to the .NET Foundation under one or more agreements. | ||
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
|
||
using System; | ||
using System.Collections.Generic; | ||
using Microsoft.ML; | ||
using Microsoft.ML.Data; | ||
using Microsoft.Spark.Sql; | ||
|
||
namespace Microsoft.Spark.Examples.MachineLearning.Sentiment | ||
{ | ||
/// <summary> | ||
/// Example of using ML.NET + .NET for Apache Spark | ||
/// for sentiment analysis. | ||
/// </summary> | ||
//public class Program | ||
internal sealed class Program : IExample | ||
{ | ||
public void Run(string[] args) | ||
{ | ||
if (args.Length != 2) | ||
{ | ||
Console.Error.WriteLine( | ||
"Usage: <path to yelptest.csv> <path to MLModel.zip>"); | ||
Environment.Exit(1); | ||
} | ||
|
||
SparkSession spark = SparkSession | ||
.Builder() | ||
.AppName(".NET for Apache Spark Sentiment Analysis") | ||
.GetOrCreate(); | ||
|
||
// Read in and display Yelp reviews | ||
DataFrame df = spark | ||
.Read() | ||
.Option("header", true) | ||
.Option("inferSchema", true) | ||
.Csv(args[0]); | ||
df.Show(); | ||
|
||
// Use ML.NET in a UDF to evaluate each review | ||
spark.Udf().Register<string, bool>( | ||
"MLudf", | ||
(text) => Sentiment(text, args[1])); | ||
|
||
// Use Spark SQL to call ML.NET UDF | ||
// Display results of sentiment analysis on reviews | ||
df.CreateOrReplaceTempView("Reviews"); | ||
DataFrame sqlDf = spark.Sql("SELECT ReviewText, MLudf(ReviewText) FROM Reviews"); | ||
sqlDf.Show(); | ||
|
||
// Print out first 20 rows of data | ||
// Prevent data getting cut off by setting truncate = 0 | ||
sqlDf.Show(20, 0, false); | ||
|
||
spark.Stop(); | ||
} | ||
|
||
// Method to call ML.NET code for sentiment analysis | ||
// Code primarily comes from ML.NET Model Builder | ||
public static bool Sentiment(string text, string modelPath) | ||
{ | ||
MLContext mlContext = new MLContext(); | ||
bamurtaugh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
ITransformer mlModel = mlContext | ||
.Model | ||
.Load(modelPath, out var modelInputSchema); | ||
|
||
PredictionEngine<Review, ReviewPrediction> predEngine = mlContext | ||
.Model | ||
bamurtaugh marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
.CreatePredictionEngine<Review, ReviewPrediction>(mlModel); | ||
|
||
ReviewPrediction result = predEngine.Predict( | ||
new Review { ReviewText = text }); | ||
|
||
// Returns true for positive review, false for negative | ||
return result.Prediction; | ||
} | ||
|
||
// Class to represent each review | ||
public class Review | ||
{ | ||
// Column name must match input file | ||
[LoadColumn(0)] | ||
public string ReviewText; | ||
} | ||
|
||
// Class resulting from ML.NET code including predictions about review | ||
public class ReviewPrediction : Review | ||
{ | ||
[ColumnName("PredictedLabel")] | ||
public bool Prediction { get; set; } | ||
|
||
public float Probability { get; set; } | ||
|
||
public float Score { get; set; } | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
# Sentiment Analysis with Big Data | ||
|
||
In this sample, you'll see how to use [.NET for Apache Spark](https://dotnet.microsoft.com/apps/data/spark) | ||
and [ML.NET](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet) to determine if | ||
statements are positive or negative, a task known as **sentiment analysis**. | ||
|
||
## Problem | ||
|
||
Our goal here is to determine if online reviews are positive or negative. We'll be using .NET for Apache Spark to read in a dataset of reviews and ML.NET to perform **binary classification** since categorizing reviews involves choosing one of two groups: positive or negative. You can read more about the problem through the [ML.NET documentation](https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/sentiment-analysis). | ||
|
||
## Dataset | ||
|
||
We'll be using a set of **Yelp reviews** as the input data for this example. We've divided the set of reviews into two smaller datasets: [yelptrain.csv](./Resources/yelptrain.csv) for training the sentiment analysis model, and [yelptest.csv](./Resources/yelptest.csv) for testing in our Spark + ML app. | ||
|
||
For the specific ML training/predictions in this app (i.e. when using Model Builder), it helps to have a header for the data, and we thus also introduced headers into the Yelp training and testing datasets. **ReviewText** holds the review itself, and **Sentiment** holds either a 0 to indicate negative sentiment or a 1 to indicate positive sentiment. | ||
|
||
You can [download the original Yelp data](https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip) from the [UCI Sentiment Labeled Sentences Dataset]( https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). | ||
|
||
## Solution | ||
|
||
We'll first train an ML model using ML.NET, and then we'll create a new application that uses both .NET for Apache Spark and ML.NET. | ||
|
||
> **Note:** All of the necessary files (trained ML model, .NET for Spark + ML.NET application code, training and testing datasets) have been included in this project. You can follow the steps below to understand how the project/files were developed, recreate them yourself, and then adapt the steps to future applications. | ||
|
||
## ML.NET | ||
|
||
[ML.NET](dot.net/ml) is an open source and cross-platform machine learning framework that allows .NET developers to easily integrate ML into .NET apps without any prior ML experience. | ||
|
||
### 1. Download Model Builder | ||
|
||
We'll use ML.NET to build and train a model through [Model Builder](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet/model-builder), a Visual Studio extension that provides an easy to understand visual interface to build, train, and deploy machine learning models. Model Builder can be downloaded [here](https://marketplace.visualstudio.com/items?itemName=MLNET.07). | ||
|
||
 | ||
|
||
> **Note:** You can also develop a model without Model Builder. Model Builder just provides an easier way to develop a model that doesn't require prior ML experience. | ||
|
||
### 2. Build and Train Your Model | ||
|
||
Follow the [Model Builder Getting Started Guide](https://dotnet.microsoft.com/learn/machinelearning-ai/ml-dotnet-get-started-tutorial/intro) to train your model using the sentiment analysis scenario. | ||
|
||
Follow the steps to: | ||
|
||
* Create a new C# Console App | ||
* Pick the **Sentiment Analysis** scenario | ||
* Train using the **yelptrain.csv** dataset | ||
|
||
 | ||
|
||
### 3. Generate Model and Code | ||
|
||
In the last step of using Model Builder, you'll produce a zip file containing the trained ML.NET model. In the image below, you can see it's contained in **MLModel.zip.** | ||
|
||
 | ||
|
||
You'll also generate C# code you can use to consume your model in other .NET apps, like the Spark app we'll be creating. | ||
|
||
 | ||
|
||
### 4. Add ML.NET to .NET for Apache Spark App | ||
|
||
You have a few options to start creating a .NET for Apache Spark app that uses this ML.NET code and trained model. Make sure that in any app you develop, you've added the [Microsoft.ML NuGet Package](https://www.nuget.org/packages/Microsoft.ML). | ||
|
||
Depending upon the algorithm Model Builder chooses for your ML model, you may need to add an additional nuget reference in [Microsoft.Spark.CSharp.Examples.csproj](../../Microsoft.Spark.CSharp.Examples.csproj). For instance, if you get an error message that Microsoft.ML.FastTree cannot be found when running your Spark app, you need to add that nuget to your csproj file: | ||
|
||
`<PackageReference Include="Microsoft.ML.FastTree" Version="1.3.1" />` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this seems different from what we are referencing in our csproj? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Steve and I learned that depending on what algorithm is used in the model, you may have to add a different reference in your .csproj. Originally we needed |
||
|
||
#### Option 1: Add Projects | ||
|
||
One option is to use Model Builder's *Add Projects* feature, which will result in 3 projects: | ||
|
||
* Your original app (**myMLApp**) | ||
* A console app that allows you to build/train/test the model (**myMLAppML.ConsoleApp**) | ||
* A .NET Standard class library that contains model input/output and your trained model in a zip file (**myMLAppML.Model**) | ||
|
||
You would begin writing your .NET for Apache Spark code (and paste in the code generated from Model Builder, shown in step 3 above) in the original app **myMLApp.** | ||
|
||
 | ||
|
||
#### Option 2: Create a new console app (shown in this repo) | ||
|
||
Rather than working with the projects/files produced by Model Builder's Add Projects, you can create a new, separate C# console app. You just need to copy over your model's zip file to a directory your new console app can access. In this repo, a trained model **MLModel.zip** has already been included for you in the [Resources](./Resources) folder. | ||
|
||
As we create the logic for our Spark app, we'll paste in the code generated from Model Builder and include some other class definitions. | ||
|
||
## .NET for Spark | ||
|
||
Now that we've trained an ML.NET model for sentiment analysis, we can begin writing the .NET for Spark code that will read in our Yelp test data, pass each review to our ML.NET model, and predict whether reviews are positive or negative. | ||
|
||
### 1. Create a Spark Session | ||
|
||
In any Spark application, we need to establish a new SparkSession, which is the entry point to programming Spark with the Dataset and | ||
DataFrame API. | ||
|
||
```CSharp | ||
SparkSession spark = SparkSession | ||
.Builder() | ||
.AppName(".NET for Apache Spark Sentiment Analysis") | ||
.GetOrCreate(); | ||
``` | ||
|
||
### 2. Read Input File into a DataFrame | ||
|
||
We trained our model with the yelptrain.csv data, so let's test how well the model performs by testing it with the yelptest.csv dataset. | ||
|
||
```CSharp | ||
DataFrame df = spark.Read().Csv(<Path to yelp testing data set>); | ||
``` | ||
|
||
If we want to specify some other aspects of our data, such as whether it has a header and how we want to deal with its schema, we can set some other options when reading in our data: | ||
|
||
```CSharp | ||
DataFrame df = spark | ||
.Read() | ||
.Option("header", true) | ||
.Option("inferSchema", true) | ||
.Csv(<Path to yelp testing data set>); | ||
``` | ||
|
||
### 3. Use a UDF to Access ML.NET | ||
|
||
We create a User Defined Function (UDF) that calls the *Sentiment* method on each Yelp review. | ||
|
||
```CSharp | ||
spark.Udf().Register<string, bool>("MLudf", (text) => Sentiment(text)); | ||
``` | ||
|
||
*Sentiment* is where we'll call our ML.NET code that was generated from the final step of Model Builder. The initial code in *Sentiment* sets up the necessary context for ML.NET to perform its prediction: | ||
|
||
```CSharp | ||
MLContext mlContext = new MLContext(); | ||
|
||
ITransformer mlModel = mlContext | ||
.Model | ||
.Load(modelPath, out var modelInputSchema); | ||
|
||
PredictionEngine<Review, ReviewPrediction> predEngine = mlContext | ||
.Model | ||
.CreatePredictionEngine<Review, ReviewPrediction>(mlModel); | ||
``` | ||
bamurtaugh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You may notice the use of *Review* and *ReviewPrediction.* These are classes we define in our project to represent the review data we're evaluating: | ||
|
||
```CSharp | ||
public class Review | ||
{ | ||
// Represents the input review text | ||
[LoadColumn(0)] | ||
public string ReviewText; | ||
} | ||
``` | ||
|
||
```CSharp | ||
public class ReviewPrediction : Review | ||
{ | ||
[ColumnName("PredictedLabel")] | ||
public bool Prediction { get; set; } | ||
|
||
public float Probability { get; set; } | ||
|
||
public float Score { get; set; } | ||
} | ||
``` | ||
|
||
The latter part of *Sentiment* passes the review from **yelptest.csv** to the ML model and returns a prediction (either *true* for positive sentiment or *false* for negative): | ||
|
||
```CSharp | ||
ReviewPrediction result = predEngine.Predict(new Review { ReviewText = text }); | ||
return result.Prediction; | ||
``` | ||
|
||
### 4. Spark SQL and Running Your Code | ||
|
||
Now that you've read in your data and incorporated ML, use Spark SQL to call the UDF that will run sentiment analysis on each row of your DataFrame: | ||
|
||
```CSharp | ||
DataFrame sqlDf = spark.Sql("SELECT ReviewText, MLudf(ReviewText) FROM Reviews"); | ||
``` | ||
|
||
Once you run your code, you'll be performing sentiment analysis with ML.NET and .NET for Spark! | ||
|
||
## Running Your App | ||
|
||
There are a few steps you'll need to follow to build and run your app: | ||
|
||
* Move to your app's root folder (i.e. *Sentiment*) | ||
* Clean and publish your app | ||
* Move to your app's `publish` folder | ||
* `spark-submit` your app from within the `publish` folder | ||
|
||
#### Windows Example: | ||
|
||
```powershell | ||
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local /path/to/microsoft-spark-<version>.jar Microsoft.Spark.CSharp.Examples.exe MachineLearning.Sentiment.Program /path/to/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/Resources | ||
``` | ||
|
||
> **Note:** Be sure to update the above command with the actual paths to your Microsoft Spark jar file and the Resources folder containing your MLModel.zip and yelptest.csv. | ||
|
||
## Next Steps | ||
|
||
Check out the [full coding example](./Program.cs). You can also view a live video explanation of this app and combining ML.NET + .NET for Spark in the [.NET for Apache Spark 101 video series](https://www.youtube.com/watch?v=i1AaZXzZsFY&list=PLdo4fOcmZ0oXklB5hhg1G1ZwOJTEjcQ5z&index=6&t=2s). | ||
|
||
Rather than performing batch processing (analyzing data that's already been stored), we can adapt our Spark + ML.NET app to instead perform real-time processing with structured streaming. | ||
|
||
Check out SentimentAnalysisStream in the [MachineLearning folder](../) to see the adapted version of the sentiment analysis program that will determine the sentiment of text live as it's typed into a terminal. |
Uh oh!
There was an error while loading. Please reload this page.