-
Notifications
You must be signed in to change notification settings - Fork 328
Samples Revamp: ML - Batch Sentiment Analysis #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
72 commits
Select commit
Hold shift + click to select a range
c07f36d
Merge pull request #1 from dotnet/master
bamurtaugh 4813b69
Merge pull request #2 from dotnet/master
bamurtaugh ace0991
Add readmes
bamurtaugh 7a76d5c
Add sentiment analysis project and ML.NET ref
bamurtaugh b4dc557
Add readme
bamurtaugh eee2219
Create general samples readme
bamurtaugh e45fb2a
Moving general readmes to other branch
bamurtaugh 118f660
Move general readmes to other branch
bamurtaugh e951fa7
Merge branch 'master' into newsamples-mlbatch
bamurtaugh b8219f3
Merge branch 'master' into newsamples-mlbatch
bamurtaugh 3b281cb
Moved to general readmes PR
bamurtaugh 052c536
Add separate project
bamurtaugh ea34fcd
Improve separate project
bamurtaugh 308bdbb
Rearrange readme folders
bamurtaugh b76d1c7
Delete duplicate folder
bamurtaugh 05226f6
Remove unnecessary comments/code
bamurtaugh 3aba245
Improve comments to explain code
bamurtaugh 46403c0
Improve printing 20 rows
bamurtaugh 6110d61
Improve printing 20 rows
bamurtaugh 3e23889
Merge branch 'master' into newsamples-mlbatch
bamurtaugh 8ff093a
Update namespace
bamurtaugh 77d7a76
Update file name
bamurtaugh 99252e8
Add dataset section
bamurtaugh d79d3c7
Fix link
bamurtaugh b540a41
Add amazon data
bamurtaugh bf53a34
Update dataset explanation and links
bamurtaugh 0a16ba4
Remove invalid link
bamurtaugh e0f8cc5
Improve model builder explanation
bamurtaugh 70e649c
Add more images model builder
bamurtaugh e7674e0
Revamp ML.NET explanation
bamurtaugh 5b33662
Add disclaimer note
bamurtaugh efff9bd
Improve ML.NET explanation
bamurtaugh 2c21224
Improve Spark.NET explanation
bamurtaugh 798804d
Update spark code and commands
bamurtaugh dd6a227
Fix link
bamurtaugh 6a4583a
Add sentence explaining data contents
bamurtaugh 36683a3
Add link to 101 video
bamurtaugh 7998fda
Merge branch 'master' into newsamples-mlbatch
bamurtaugh db654bc
Return to IExample
bamurtaugh eacf199
Adjust class name
bamurtaugh 3f361fc
Remove sentiment csproj and solution
bamurtaugh 25a31a0
Grammar
bamurtaugh 0d79f89
Add model zip file
bamurtaugh d6c1045
Remove unnecessary nuget ref
bamurtaugh 5541fd8
Remove outdated code comments
bamurtaugh 25f0745
Display non-truncated data differently
bamurtaugh 26d8365
Merge branch 'master' into newsamples-mlbatch
bamurtaugh a0ea3f1
Update spark-submit
bamurtaugh a42384d
Replaces vars with concrete types
bamurtaugh 535b071
Switch to better column names
bamurtaugh 491fa26
Adjust file name and nuget usage
bamurtaugh 8e91c8f
Update Resources link in readme
bamurtaugh fef02e3
readme spacing
bamurtaugh 9417459
readme spacing
bamurtaugh 4112ac4
Remove extra space
bamurtaugh d923e3d
Update spark sql code in readme
bamurtaugh 674d614
Update/improve ML.NET code and explanation
bamurtaugh f033d1e
Explain nuget
bamurtaugh 13f4b12
Merge branch 'master' into newsamples-mlbatch
bamurtaugh 50d65fc
Update training/testing data
bamurtaugh d1a54a3
Update readme to just use yelp dataset
bamurtaugh d6c97e5
Update note format
bamurtaugh 1df4a7d
Update link
bamurtaugh ef8bce5
Fix submit command
bamurtaugh cfb25f8
Update readme code
bamurtaugh 2e1735f
Update paths
bamurtaugh 91b1f5b
Update spark-submit to reflect args change
bamurtaugh f74b373
Remove old commented code
bamurtaugh 878db57
Fix indentation
bamurtaugh 817f6fe
Use var type
bamurtaugh 3becbda
Merge branch 'master' into newsamples-mlbatch
bamurtaugh 0a3b8c5
Merge branch 'master' into newsamples-mlbatch
imback82 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
99 changes: 99 additions & 0 deletions
99
examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/Program.cs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
// Licensed to the .NET Foundation under one or more agreements. | ||
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
|
||
using System; | ||
using System.Collections.Generic; | ||
using Microsoft.ML; | ||
using Microsoft.ML.Data; | ||
using Microsoft.Spark.Sql; | ||
|
||
namespace Microsoft.Spark.Examples.MachineLearning.Sentiment | ||
{ | ||
/// <summary> | ||
/// Example of using ML.NET + .NET for Apache Spark | ||
/// for sentiment analysis. | ||
/// </summary> | ||
internal sealed class Program : IExample | ||
{ | ||
public void Run(string[] args) | ||
{ | ||
if (args.Length != 2) | ||
{ | ||
Console.Error.WriteLine( | ||
"Usage: <path to yelptest.csv> <path to MLModel.zip>"); | ||
Environment.Exit(1); | ||
} | ||
|
||
SparkSession spark = SparkSession | ||
.Builder() | ||
.AppName(".NET for Apache Spark Sentiment Analysis") | ||
.GetOrCreate(); | ||
|
||
// Read in and display Yelp reviews | ||
DataFrame df = spark | ||
.Read() | ||
.Option("header", true) | ||
.Option("inferSchema", true) | ||
.Csv(args[0]); | ||
df.Show(); | ||
|
||
// Use ML.NET in a UDF to evaluate each review | ||
spark.Udf().Register<string, bool>( | ||
"MLudf", | ||
(text) => Sentiment(text, args[1])); | ||
|
||
// Use Spark SQL to call ML.NET UDF | ||
// Display results of sentiment analysis on reviews | ||
df.CreateOrReplaceTempView("Reviews"); | ||
DataFrame sqlDf = spark.Sql("SELECT ReviewText, MLudf(ReviewText) FROM Reviews"); | ||
sqlDf.Show(); | ||
|
||
// Print out first 20 rows of data | ||
// Prevent data getting cut off by setting truncate = 0 | ||
sqlDf.Show(20, 0, false); | ||
|
||
spark.Stop(); | ||
} | ||
|
||
// Method to call ML.NET code for sentiment analysis | ||
// Code primarily comes from ML.NET Model Builder | ||
public static bool Sentiment(string text, string modelPath) | ||
{ | ||
var mlContext = new MLContext(); | ||
|
||
ITransformer mlModel = mlContext | ||
.Model | ||
.Load(modelPath, out var modelInputSchema); | ||
|
||
PredictionEngine<Review, ReviewPrediction> predEngine = mlContext | ||
.Model | ||
.CreatePredictionEngine<Review, ReviewPrediction>(mlModel); | ||
|
||
ReviewPrediction result = predEngine.Predict( | ||
new Review { ReviewText = text }); | ||
|
||
// Returns true for positive review, false for negative | ||
return result.Prediction; | ||
} | ||
|
||
// Class to represent each review | ||
public class Review | ||
{ | ||
// Column name must match input file | ||
[LoadColumn(0)] | ||
public string ReviewText; | ||
} | ||
|
||
// Class resulting from ML.NET code including predictions about review | ||
public class ReviewPrediction : Review | ||
{ | ||
[ColumnName("PredictedLabel")] | ||
public bool Prediction { get; set; } | ||
|
||
public float Probability { get; set; } | ||
|
||
public float Score { get; set; } | ||
} | ||
} | ||
} |
204 changes: 204 additions & 0 deletions
204
examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
# Sentiment Analysis with Big Data | ||
|
||
In this sample, you'll see how to use [.NET for Apache Spark](https://dotnet.microsoft.com/apps/data/spark) | ||
and [ML.NET](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet) to determine if | ||
statements are positive or negative, a task known as **sentiment analysis**. | ||
|
||
## Problem | ||
|
||
Our goal here is to determine if online reviews are positive or negative. We'll be using .NET for Apache Spark to read in a dataset of reviews and ML.NET to perform **binary classification** since categorizing reviews involves choosing one of two groups: positive or negative. You can read more about the problem through the [ML.NET documentation](https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/sentiment-analysis). | ||
|
||
## Dataset | ||
|
||
We'll be using a set of **Yelp reviews** as the input data for this example. We've divided the set of reviews into two smaller datasets: [yelptrain.csv](./Resources/yelptrain.csv) for training the sentiment analysis model, and [yelptest.csv](./Resources/yelptest.csv) for testing in our Spark + ML app. | ||
|
||
For the specific ML training/predictions in this app (i.e. when using Model Builder), it helps to have a header for the data, and we thus also introduced headers into the Yelp training and testing datasets. **ReviewText** holds the review itself, and **Sentiment** holds either a 0 to indicate negative sentiment or a 1 to indicate positive sentiment. | ||
|
||
You can [download the original Yelp data](https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip) from the [UCI Sentiment Labeled Sentences Dataset]( https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). | ||
|
||
## Solution | ||
|
||
We'll first train an ML model using ML.NET, and then we'll create a new application that uses both .NET for Apache Spark and ML.NET. | ||
|
||
> **Note:** All of the necessary files (trained ML model, .NET for Spark + ML.NET application code, training and testing datasets) have been included in this project. You can follow the steps below to understand how the project/files were developed, recreate them yourself, and then adapt the steps to future applications. | ||
|
||
## ML.NET | ||
|
||
[ML.NET](dot.net/ml) is an open source and cross-platform machine learning framework that allows .NET developers to easily integrate ML into .NET apps without any prior ML experience. | ||
|
||
### 1. Download Model Builder | ||
|
||
We'll use ML.NET to build and train a model through [Model Builder](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet/model-builder), a Visual Studio extension that provides an easy to understand visual interface to build, train, and deploy machine learning models. Model Builder can be downloaded [here](https://marketplace.visualstudio.com/items?itemName=MLNET.07). | ||
|
||
 | ||
|
||
> **Note:** You can also develop a model without Model Builder. Model Builder just provides an easier way to develop a model that doesn't require prior ML experience. | ||
|
||
### 2. Build and Train Your Model | ||
|
||
Follow the [Model Builder Getting Started Guide](https://dotnet.microsoft.com/learn/machinelearning-ai/ml-dotnet-get-started-tutorial/intro) to train your model using the sentiment analysis scenario. | ||
|
||
Follow the steps to: | ||
|
||
* Create a new C# Console App | ||
* Pick the **Sentiment Analysis** scenario | ||
* Train using the **yelptrain.csv** dataset | ||
|
||
 | ||
|
||
### 3. Generate Model and Code | ||
|
||
In the last step of using Model Builder, you'll produce a zip file containing the trained ML.NET model. In the image below, you can see it's contained in **MLModel.zip.** | ||
|
||
 | ||
|
||
You'll also generate C# code you can use to consume your model in other .NET apps, like the Spark app we'll be creating. | ||
|
||
 | ||
|
||
### 4. Add ML.NET to .NET for Apache Spark App | ||
|
||
You have a few options to start creating a .NET for Apache Spark app that uses this ML.NET code and trained model. Make sure that in any app you develop, you've added the [Microsoft.ML NuGet Package](https://www.nuget.org/packages/Microsoft.ML). | ||
|
||
Depending upon the algorithm Model Builder chooses for your ML model, you may need to add an additional nuget reference in [Microsoft.Spark.CSharp.Examples.csproj](../../Microsoft.Spark.CSharp.Examples.csproj). For instance, if you get an error message that Microsoft.ML.FastTree cannot be found when running your Spark app, you need to add that nuget to your csproj file: | ||
|
||
`<PackageReference Include="Microsoft.ML.FastTree" Version="1.3.1" />` | ||
|
||
#### Option 1: Add Projects | ||
|
||
One option is to use Model Builder's *Add Projects* feature, which will result in 3 projects: | ||
|
||
* Your original app (**myMLApp**) | ||
* A console app that allows you to build/train/test the model (**myMLAppML.ConsoleApp**) | ||
* A .NET Standard class library that contains model input/output and your trained model in a zip file (**myMLAppML.Model**) | ||
|
||
You would begin writing your .NET for Apache Spark code (and paste in the code generated from Model Builder, shown in step 3 above) in the original app **myMLApp.** | ||
|
||
 | ||
|
||
#### Option 2: Create a new console app (shown in this repo) | ||
|
||
Rather than working with the projects/files produced by Model Builder's Add Projects, you can create a new, separate C# console app. You just need to copy over your model's zip file to a directory your new console app can access. In this repo, a trained model **MLModel.zip** has already been included for you in the [Resources](./Resources) folder. | ||
|
||
As we create the logic for our Spark app, we'll paste in the code generated from Model Builder and include some other class definitions. | ||
|
||
## .NET for Spark | ||
|
||
Now that we've trained an ML.NET model for sentiment analysis, we can begin writing the .NET for Spark code that will read in our Yelp test data, pass each review to our ML.NET model, and predict whether reviews are positive or negative. | ||
|
||
### 1. Create a Spark Session | ||
|
||
In any Spark application, we need to establish a new SparkSession, which is the entry point to programming Spark with the Dataset and | ||
DataFrame API. | ||
|
||
```CSharp | ||
SparkSession spark = SparkSession | ||
.Builder() | ||
.AppName(".NET for Apache Spark Sentiment Analysis") | ||
.GetOrCreate(); | ||
``` | ||
|
||
### 2. Read Input File into a DataFrame | ||
|
||
We trained our model with the yelptrain.csv data, so let's test how well the model performs by testing it with the yelptest.csv dataset. | ||
|
||
```CSharp | ||
DataFrame df = spark.Read().Csv(<Path to yelp testing data set>); | ||
``` | ||
|
||
If we want to specify some other aspects of our data, such as whether it has a header and how we want to deal with its schema, we can set some other options when reading in our data: | ||
|
||
```CSharp | ||
DataFrame df = spark | ||
.Read() | ||
.Option("header", true) | ||
.Option("inferSchema", true) | ||
.Csv(<Path to yelp testing data set>); | ||
``` | ||
|
||
### 3. Use a UDF to Access ML.NET | ||
|
||
We create a User Defined Function (UDF) that calls the *Sentiment* method on each Yelp review. | ||
|
||
```CSharp | ||
spark.Udf().Register<string, bool>("MLudf", (text) => Sentiment(text)); | ||
``` | ||
|
||
*Sentiment* is where we'll call our ML.NET code that was generated from the final step of Model Builder. The initial code in *Sentiment* sets up the necessary context for ML.NET to perform its prediction: | ||
|
||
```CSharp | ||
MLContext mlContext = new MLContext(); | ||
|
||
ITransformer mlModel = mlContext | ||
.Model | ||
.Load(modelPath, out var modelInputSchema); | ||
|
||
PredictionEngine<Review, ReviewPrediction> predEngine = mlContext | ||
.Model | ||
.CreatePredictionEngine<Review, ReviewPrediction>(mlModel); | ||
``` | ||
bamurtaugh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You may notice the use of *Review* and *ReviewPrediction.* These are classes we define in our project to represent the review data we're evaluating: | ||
|
||
```CSharp | ||
public class Review | ||
{ | ||
// Represents the input review text | ||
[LoadColumn(0)] | ||
public string ReviewText; | ||
} | ||
``` | ||
|
||
```CSharp | ||
public class ReviewPrediction : Review | ||
{ | ||
[ColumnName("PredictedLabel")] | ||
public bool Prediction { get; set; } | ||
|
||
public float Probability { get; set; } | ||
|
||
public float Score { get; set; } | ||
} | ||
``` | ||
|
||
The latter part of *Sentiment* passes the review from **yelptest.csv** to the ML model and returns a prediction (either *true* for positive sentiment or *false* for negative): | ||
|
||
```CSharp | ||
ReviewPrediction result = predEngine.Predict(new Review { ReviewText = text }); | ||
return result.Prediction; | ||
``` | ||
|
||
### 4. Spark SQL and Running Your Code | ||
|
||
Now that you've read in your data and incorporated ML, use Spark SQL to call the UDF that will run sentiment analysis on each row of your DataFrame: | ||
|
||
```CSharp | ||
DataFrame sqlDf = spark.Sql("SELECT ReviewText, MLudf(ReviewText) FROM Reviews"); | ||
``` | ||
|
||
Once you run your code, you'll be performing sentiment analysis with ML.NET and .NET for Spark! | ||
|
||
## Running Your App | ||
|
||
There are a few steps you'll need to follow to build and run your app: | ||
|
||
* Move to your app's root folder (i.e. *Sentiment*) | ||
* Clean and publish your app | ||
* Move to your app's `publish` folder | ||
* `spark-submit` your app from within the `publish` folder | ||
|
||
#### Windows Example: | ||
|
||
```powershell | ||
spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local /path/to/microsoft-spark-<version>.jar Microsoft.Spark.CSharp.Examples.exe MachineLearning.Sentiment.Program /path/to/yelptest.csv /path/to/MLModel.zip | ||
``` | ||
|
||
> **Note:** Be sure to update the above command with the actual paths to your Microsoft Spark jar file, yelptest.csv, and MLModel.zip. yelptest.csv and MLModel.zip are included in this sample app in the [Resources folder](./Resources). | ||
|
||
## Next Steps | ||
|
||
Check out the [full coding example](./Program.cs). You can also view a live video explanation of this app and combining ML.NET + .NET for Spark in the [.NET for Apache Spark 101 video series](https://www.youtube.com/watch?v=i1AaZXzZsFY&list=PLdo4fOcmZ0oXklB5hhg1G1ZwOJTEjcQ5z&index=6&t=2s). | ||
|
||
Rather than performing batch processing (analyzing data that's already been stored), we can adapt our Spark + ML.NET app to instead perform real-time processing with structured streaming. | ||
|
||
Check out SentimentAnalysisStream in the [MachineLearning folder](../) to see the adapted version of the sentiment analysis program that will determine the sentiment of text live as it's typed into a terminal. |
Binary file added
BIN
+96.5 KB
examples/Microsoft.Spark.CSharp.Examples/MachineLearning/Sentiment/Resources/MLModel.zip
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems different from what we are referencing in our csproj?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Steve and I learned that depending on what algorithm is used in the model, you may have to add a different reference in your .csproj. Originally we needed
FastTree
, but once I changed to a new model that used a different algorithm, it no longer usedFastTree
and thus we could just referenceMicrosoft.ML
. In case users train and use their own model and then run into issues, we wanted to add this explanation.