Skip to content

Commit 1bac111

Browse files
committed
Merge branch 'develop'
2 parents 70a0dba + 7fa979c commit 1bac111

File tree

278 files changed

+20897
-1146
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

278 files changed

+20897
-1146
lines changed

.gitlab-ci.yml

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -60,16 +60,6 @@ build-default:
6060
- flowman-dist/target/flowman-dist-*-bin.tar.gz
6161
expire_in: 5 days
6262

63-
# List additional build variants (some of them will be built on pushes)
64-
build-hadoop2.6-spark2.3:
65-
stage: build
66-
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.3 -Ddockerfile.skip'
67-
artifacts:
68-
name: "flowman-dist-hadoop2.6-spark2.3"
69-
paths:
70-
- flowman-dist/target/flowman-dist-*-bin.tar.gz
71-
expire_in: 5 days
72-
7363
build-hadoop2.6-spark2.4:
7464
stage: build
7565
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.4 -Ddockerfile.skip'
@@ -133,17 +123,6 @@ build-hadoop3.2-spark3.1:
133123
- flowman-dist/target/flowman-dist-*-bin.tar.gz
134124
expire_in: 5 days
135125

136-
build-cdh5.15:
137-
stage: build
138-
except:
139-
- pushes
140-
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-5.15 -Ddockerfile.skip'
141-
artifacts:
142-
name: "flowman-dist-cdh5.15"
143-
paths:
144-
- flowman-dist/target/flowman-dist-*-bin.tar.gz
145-
expire_in: 5 days
146-
147126
build-cdh6.3:
148127
stage: build
149128
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-6.3 -Ddockerfile.skip'

.travis.yml

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,6 @@ jobs:
1919
jdk: openjdk8
2020
script: mvn clean install
2121

22-
- name: Hadoop 2.6 with Spark 2.3
23-
jdk: openjdk8
24-
script: mvn clean install -Phadoop-2.6 -Pspark-2.3 -Ddockerfile.skip
25-
26-
- name: Hadoop 2.7 with Spark 2.3
27-
jdk: openjdk8
28-
script: mvn clean install -Phadoop-2.7 -Pspark-2.3 -Ddockerfile.skip
29-
3022
- name: Hadoop 2.6 with Spark 2.4
3123
jdk: openjdk8
3224
script: mvn clean install -Phadoop-2.6 -Pspark-2.4
@@ -51,10 +43,6 @@ jobs:
5143
jdk: openjdk8
5244
script: mvn clean install -Phadoop-3.2 -Pspark-3.1
5345

54-
- name: CDH 5.15
55-
jdk: openjdk8
56-
script: mvn clean install -PCDH-5.15 -Ddockerfile.skip
57-
5846
- name: CDH 6.3
5947
jdk: openjdk8
6048
script: mvn clean install -PCDH-6.3 -Ddockerfile.skip

BUILDING.md

Lines changed: 21 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,18 @@
33
The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
44
is installed on the build machine.
55

6-
## Build with Maven
6+
## Prerequisites
7+
8+
You need the following tools installed on your machine:
9+
* JDK 1.8 or later. If you build a variant with Scala 2.11, you have to use JDK 1.8 (and not anything newer like
10+
Java 11). This mainly affects builds with Spark 2.x
11+
* Apache Maven (install via package manager download from https://maven.apache.org/download.cgi)
12+
* npm (install via package manager or download from https://www.npmjs.com/get-npm)
13+
* Windows users also need Hadoop winutils installed. Those can be retrieved from https://github.com/cdarlint/winutils
14+
and later. See some additional details for building on Windows below.
15+
16+
17+
# Build with Maven
718

819
Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
920

@@ -22,9 +33,11 @@ in a complex environment with Kerberos. You can find the `tar.gz` file in the di
2233

2334
## Build on Windows
2435

25-
Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
26-
the binaries from https://github.com/steveloughran/winutils and install an appropriate version somewhere onto your
27-
machine. Do not forget to set the HADOOP_HOME environment variable to the installation directory of these utils!
36+
Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
37+
is still supported to some extend, but requires some extra care. You will need the Hadoop WinUtils installed. You can
38+
download the binaries from https://github.com/cdarlint/winutils and install an appropriate version somewhere onto
39+
your machine. Do not forget to set the HADOOP_HOME or PATH environment variable to the installation directory of these
40+
utils!
2841

2942
You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise
3043
some unittests may fail and Docker images might not be useable. This can be done by setting the git configuration
@@ -46,24 +59,23 @@ the `master` branch really builds clean with all unittests passing on Linux.
4659

4760
## Build for Custom Spark / Hadoop Version
4861

49-
Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5).
62+
Per default, Flowman will be built for fairly recent versions of Spark (3.0.2 as of this writing) and Hadoop (3.2.0).
5063
But of course you can also build for a different version by either using a profile
5164

5265
```shell
53-
mvn install -Pspark2.3 -Phadoop2.7 -DskipTests
66+
mvn install -Pspark2.4 -Phadoop2.7 -DskipTests
5467
```
5568

5669
This will always select the latest bugfix version within the minor version. You can also specify versions explicitly
5770
as follows:
5871

5972
```shell
60-
mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
73+
mvn install -Dspark.version=2.4.3 -Dhadoop.version=2.7.3
6174
```
6275
6376
Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
6477
using the correct version. The following profiles are available:
6578

66-
* spark-2.3
6779
* spark-2.4
6880
* spark-3.0
6981
* spark-3.1
@@ -73,37 +85,12 @@ using the correct version. The following profiles are available:
7385
* hadoop-2.9
7486
* hadoop-3.1
7587
* hadoop-3.2
76-
* CDH-5.15
7788
* CDH-6.3
7889

7990
With these profiles it is easy to build Flowman to match your environment.
8091

8192
## Building for Open Source Hadoop and Spark
8293

83-
### Spark 2.3 and Hadoop 2.6:
84-
85-
```shell
86-
mvn clean install -Pspark-2.3 -Phadoop-2.6
87-
```
88-
89-
### Spark 2.3 and Hadoop 2.7:
90-
91-
```shell
92-
mvn clean install -Pspark-2.3 -Phadoop-2.7
93-
```
94-
95-
### Spark 2.3 and Hadoop 2.8:
96-
97-
```shell
98-
mvn clean install -Pspark-2.3 -Phadoop-2.8
99-
```
100-
101-
### Spark 2.3 and Hadoop 2.9:
102-
103-
```shell
104-
mvn clean install -Pspark-2.3 -Phadoop-2.9
105-
```
106-
10794
### Spark 2.4 and Hadoop 2.6:
10895

10996
```shell
@@ -148,13 +135,7 @@ mvn clean install -Pspark-3.1 -Phadoop-3.2
148135

149136
## Building for Cloudera
150137

151-
The Maven project also contains preconfigured profiles for Cloudera.
152-
153-
```shell
154-
mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
155-
```
156-
157-
Or for Cloudera 6.3
138+
The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
158139

159140
```shell
160141
mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
# Version 0.17.0 - 2021-06-02
2+
3+
* New Flowman Kernel and Flowman Studio application prototypes
4+
* New ParallelExecutor
5+
* Fix before/after dependencies in `count` target
6+
* Default build is now Spark 3.1 + Hadoop 3.2
7+
* Remove build profiles for Spark 2.3 and CDH 5.15
8+
* Add MS SQL Server plugin containing JDBC driver
9+
* Speed up file listing for `file` relations
10+
* Use Spark JobGroups
11+
* Better support running Flowman on Windows with appropriate batch scripts
12+
13+
114
# Version 0.16.0 - 2021-04-23
215

316
* Add logo to Flowman Shell

NOTICE

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,12 @@ MariaDB Java Client
6666
* HOMEPAGE:
6767
* https://mariadb.com
6868

69+
MSSQL JDBC Client
70+
* LICENSE
71+
* license/LICENSE-mssql-jdbc.txt
72+
* HOMEPAGE:
73+
* https://github.com/Microsoft/mssql-jdbc
74+
6975
Apache Derby
7076
* LICENSE
7177
* license/LICENSE-derby.txt (Apache 2.0 License)

build-release.sh

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,10 @@ build_profile() {
1515

1616
build_profile hadoop-2.6 spark-2.3
1717
build_profile hadoop-2.6 spark-2.4
18-
build_profile hadoop-2.7 spark-2.3
1918
build_profile hadoop-2.7 spark-2.4
20-
build_profile hadoop-2.8 spark-2.3
21-
build_profile hadoop-2.8 spark-2.4
22-
build_profile hadoop-2.9 spark-2.3
23-
build_profile hadoop-2.9 spark-2.4
24-
build_profile hadoop-2.9 spark-3.0
25-
build_profile hadoop-3.1 spark-3.0
19+
build_profile hadoop-2.7 spark-3.0
2620
build_profile hadoop-3.2 spark-3.0
21+
build_profile hadoop-2.7 spark-3.1
2722
build_profile hadoop-3.2 spark-3.1
2823
build_profile CDH-5.15
2924
build_profile CDH-6.3

docker/pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
<parent>
1111
<groupId>com.dimajix.flowman</groupId>
1212
<artifactId>flowman-root</artifactId>
13-
<version>0.16.0</version>
14-
<relativePath>..</relativePath>
13+
<version>0.17.0</version>
14+
<relativePath>../pom.xml</relativePath>
1515
</parent>
1616

1717
<properties>

docs/building.md

Lines changed: 4 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -60,20 +60,19 @@ You might also want to skip unittests (the HBase plugin is currently failing und
6060

6161
### Build for Custom Spark / Hadoop Version
6262

63-
Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5).
63+
Per default, Flowman will be built for fairly recent versions of Spark (3.0.2 as of this writing) and Hadoop (3.2.0).
6464
But of course you can also build for a different version by either using a profile
6565

66-
mvn install -Pspark2.2 -Phadoop2.7 -DskipTests
66+
mvn install -Pspark2.4 -Phadoop2.7 -DskipTests
6767

6868
This will always select the latest bugfix version within the minor version. You can also specify versions explicitly
6969
as follows:
7070

71-
mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
71+
mvn install -Dspark.version=2.4.1 -Dhadoop.version=2.7.3
7272
7373
Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
7474
using the correct version. The following profiles are available:
7575

76-
* spark-2.3
7776
* spark-2.4
7877
* spark-3.0
7978
* spark-3.1
@@ -83,29 +82,12 @@ using the correct version. The following profiles are available:
8382
* hadoop-2.9
8483
* hadoop-3.1
8584
* hadoop-3.2
86-
* CDH-5.15
8785
* CDH-6.3
8886

8987
With these profiles it is easy to build Flowman to match your environment.
9088

9189
### Building for Open Source Hadoop and Spark
9290

93-
Spark 2.3 and Hadoop 2.6:
94-
95-
mvn clean install -Pspark-2.3 -Phadoop-2.6
96-
97-
Spark 2.3 and Hadoop 2.7:
98-
99-
mvn clean install -Pspark-2.3 -Phadoop-2.7
100-
101-
Spark 2.3 and Hadoop 2.8:
102-
103-
mvn clean install -Pspark-2.3 -Phadoop-2.8
104-
105-
Spark 2.3 and Hadoop 2.9:
106-
107-
mvn clean install -Pspark-2.3 -Phadoop-2.9
108-
10991
Spark 2.4 and Hadoop 2.6:
11092

11193
mvn clean install -Pspark-2.4 -Phadoop-2.6
@@ -137,11 +119,7 @@ Spark 3.1 and Hadoop 3.2
137119

138120
### Building for Cloudera
139121

140-
The Maven project also contains preconfigured profiles for Cloudera.
141-
142-
mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
143-
144-
Or for Cloudera 6.3
122+
The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
145123

146124
mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
147125

docs/config.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,11 @@ the existence of targets to decide if a rebuild is required.
3131

3232
- `flowman.execution.executor.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleExecutor`)*
3333
Configure the executor to use. The default `SimpleExecutor` will process all targets in the correct order
34-
sequentially.
34+
sequentially. The alternative implementation `com.dimajix.flowman.execution.ParallelExecutor` will run multiple
35+
targets in parallel (if they are not depending on each other)
36+
37+
- `flowman.execution.executor.parallelism` *(type: int)* *(default: 4)*
38+
The number of targets to be executed in parallel, when the `ParallelExecutor` is used.
3539

3640
- `flowman.execution.scheduler.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleScheduler`)*
3741
Configure the scheduler to use. The default `SimpleScheduler` will sort all targets according to their dependency.

docs/spec/mapping/mock.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,32 @@ mappings:
1515
1616
```yaml
1717
mappings:
18-
empty_mapping:
18+
some_other_mapping:
1919
kind: mock
2020
mapping: some_mapping
2121
records:
2222
- [1,2,"some_string",""]
2323
- [2,null,"cat","black"]
2424
```
2525
26+
```yaml
27+
mappings:
28+
some_mapping:
29+
kind: mock
30+
mapping: some_mapping
31+
records:
32+
- Campaign ID: DIR_36919
33+
LineItemID ID: DIR_260390
34+
SiteID ID: 23374
35+
CreativeID ID: 292668
36+
PlacementID ID: 108460
37+
- Campaign ID: DIR_36919
38+
LineItemID ID: DIR_260390
39+
SiteID ID: 23374
40+
CreativeID ID: 292668
41+
PlacementID ID: 108460
42+
```
43+
2644
## Fields
2745
* `kind` **(mandatory)** *(type: string)*: `mock`
2846

@@ -39,7 +57,7 @@ mappings:
3957
* `MEMORY_AND_DISK_SER`
4058

4159
* `mapping` **(optional)** *(type: string)*:
42-
Specifies the name of the mapping to be mocked. If no name is given, the a mapping with the same name will be
60+
Specifies the name of the mapping to be mocked. If no name is given, then a mapping with the same name will be
4361
mocked. Note that this will only work when used as an override mapping in test cases, otherwise an infinite loop
4462
would be created by referencing to itself.
4563

0 commit comments

Comments
 (0)