Skip to content

Conversation

artur-chopikian
Copy link
Contributor

@artur-chopikian artur-chopikian commented Mar 6, 2025

PR Details

Memory allocations

Description

xml.NewEncoder uses bufio.NewWriter, which allocates 4096 bytes to every call (every sell with text in the xlsx, you can imagine how much it can be).

const (
	defaultBufSize = 4096
)

func NewWriter(w io.Writer) *Writer {
	return NewWriterSize(w, defaultBufSize)
}

And this xml.EscapeText shows new lines properly in the xlsx file.

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@artur-chopikian
Copy link
Contributor Author

artur-chopikian commented Mar 6, 2025

@xuri, please take a look at this. I hope we can roll back this change

The commit where this change was added: 9999221

Copy link

codecov bot commented Mar 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.20%. Comparing base (aef20e2) to head (271c282).
Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2100   +/-   ##
=======================================
  Coverage   99.20%   99.20%           
=======================================
  Files          32       32           
  Lines       30096    30102    +6     
=======================================
+ Hits        29858    29864    +6     
  Misses        158      158           
  Partials       80       80           
Flag Coverage Δ
unittests 99.20% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xuri xuri added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 7, 2025
Copy link
Member

@xuri xuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will cause xml:space="preserve" attribute of t element missing. The \n new line will doesn't work.

Before:

<c r="A1" s="1" t="inlineStr">
    <is>
        <t xml:space="preserve">text
</t>
    </is>
</c>

After this PR change:

<c r="A1" s="1" t="inlineStr">
    <is>
        <t>text&#xA;</t>
    </is>
</c>

For example:

package main

import (
	"fmt"

	"github.com/xuri/excelize/v2"
)

func main() {
	f := excelize.NewFile()
	defer func() {
		if err := f.Close(); err != nil {
			fmt.Println(err)
		}
	}()
	sw, err := f.NewStreamWriter("Sheet1")
	if err != nil {
		fmt.Println(err)
		return
	}
	styleID, err := f.NewStyle(&excelize.Style{
		Alignment: &excelize.Alignment{WrapText: true},
	})
	if err != nil {
		fmt.Println(err)
		return
	}
	if err := sw.SetRow("A1", []interface{}{excelize.Cell{Value: "text\n", StyleID: styleID}}); err != nil {
		fmt.Println(err)
		return
	}
	if err := sw.Flush(); err != nil {
		fmt.Println(err)
		return
	}
	if err = f.SaveAs("Book1.xlsx"); err != nil {
		fmt.Println(err)
	}
}

This change will caused no-new line after A1 cell value text:

text

After this PR change:

text

So, I don't think we need to roll back the change 9999221.

@artur-chopikian
Copy link
Contributor Author

@xuri Thanks, I got it! Then I do not see another way like copy this small method and make it work as we expect it.

@artur-chopikian
Copy link
Contributor Author

artur-chopikian commented Mar 7, 2025

@xuri Or what if we check it before? Can you imagine some problem that can cause it?

// trimCellValue provides a function to set string type to cell.
func trimCellValue(value string, escape bool) (v string, ns xml.Attr) {
	if utf8.RuneCountInString(value) > TotalCellChars {
		value = string([]rune(value)[:TotalCellChars])
	}
	if value != "" {
		prefix, suffix := value[0], value[len(value)-1]
		for _, ascii := range []byte{9, 10, 13, 32} {
			if prefix == ascii || suffix == ascii {
				ns = xml.Attr{
					Name:  xml.Name{Space: NameSpaceXML, Local: "space"},
					Value: "preserve",
				}
				break
			}
		}

		if escape {
			var buf bytes.Buffer
			_ = xml.EscapeText(&buf, []byte(value))
			value = buf.String()
		}
	}
	v = bstrMarshal(value)
	return
}

And we have this one

<c r="A1" s="1" t="inlineStr">
    <is>
        <t xml:space="preserve">text&#xA;</t>
    </is>
</c>

@artur-chopikian artur-chopikian requested a review from xuri March 7, 2025 14:47
Copy link
Member

@xuri xuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, your lasted change will escape \n in different way:

Before:

<c r="A1" s="1" t="inlineStr">
    <is>
        <t xml:space="preserve">text
</t>
    </is>
</c>

After this PR change:

<c r="A1" s="1" t="inlineStr">
    <is>
        <t xml:space="preserve">text&#xA;</t>
    </is>
</c>

This change will caused no-new line after A1 cell value text in Windows Office 2007, but works on Windows Office 2010, Excel for Mac.

@artur-chopikian
Copy link
Contributor Author

artur-chopikian commented Mar 7, 2025

What about others? I think we also have a problem with those symbols because we will replace them with:

\t -> &#x9;
\r -> &#xD;

@xuri
Copy link
Member

xuri commented Mar 8, 2025

The xml.EscapeText will not transform \t to &#x9;, it could be works in all version Excel applications.

The \r symbol cannot be used to add a new line in the cell, so it may not function correctly in all versions of Excel.

Therefore, I suggest maintaining the current trimCellValue code for better compatibility.

@artur-chopikian artur-chopikian requested a review from xuri March 10, 2025 21:10
Copy link
Member

@xuri xuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. Any benchmark data on the performance impact of using xml.EscapeText instead of xml.NewEncoder? Specifically, how much memory is saved, and what percentage of speed improvement can be expected? I don't recommend copying code from the standard library; if necessary, it would be better to submit a patch to improve the Go standard library directly.

@xuri xuri added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 14, 2025
@artur-chopikian
Copy link
Contributor Author

artur-chopikian commented Mar 21, 2025

Hi! The file contains around 80k lines

github.com/xuri/excelize/v2 v2.8.1

         .          .    502:func trimCellValue(value string, escape bool) (v string, ns xml.Attr) {
         .          .    503:   if utf8.RuneCountInString(value) > TotalCellChars {
         .          .    504:           value = string([]rune(value)[:TotalCellChars])
         .          .    505:   }
         .          .    506:   if escape {
   56.50MB    56.50MB    507:           var buf bytes.Buffer
      14MB       49MB    508:           _ = xml.EscapeText(&buf, []byte(value))
         .     7.50MB    509:           value = buf.String()
         .          .    510:   }
         .          .    511:   if len(value) > 0 {
         .          .    512:           prefix, suffix := value[0], value[len(value)-1]
         .          .    513:           for _, ascii := range []byte{9, 10, 13, 32} {
         .          .    514:                   if prefix == ascii || suffix == ascii {

github.com/xuri/excelize/v2 v2.9.0

         .          .    509:func trimCellValue(value string, escape bool) (v string, ns xml.Attr) {
         .          .    510:   if utf8.RuneCountInString(value) > TotalCellChars {
         .          .    511:           value = string([]rune(value)[:TotalCellChars])
         .          .    512:   }
         .          .    513:   if escape {
      69MB       69MB    514:           var buf bytes.Buffer
         .     6.57GB    515:           enc := xml.NewEncoder(&buf)
      11MB       11MB    516:           _ = enc.EncodeToken(xml.CharData(value))
         .       55MB    517:           enc.Flush()
         .       15MB    518:           value = buf.String()
         .          .    519:   }
         .          .    520:   if len(value) > 0 {
         .          .    521:           prefix, suffix := value[0], value[len(value)-1]
         .          .    522:           for _, ascii := range []byte{9, 10, 13, 32} {
         .          .    523:                   if prefix == ascii || suffix == ascii {
         .          .    524:                           ns = xml.Attr{
         .          .    525:                                   Name:  xml.Name{Space: NameSpaceXML, Local: "space"},
         .          .    526:                                   Value: "preserve",
         .          .    527:                           }
         .          .    528:                           break
         .          .    529:                   }
         .          .    530:           }
         .          .    531:   }
         .     9.27MB    532:   v = bstrMarshal(value)
         .          .    533:   return
         .          .    534:}
         .          .    535:
         .          .    536:// setCellValue set cell data type and value for (inline) rich string cell or
         .          .    537:// formula cell.

Speed is the same because the function under the hood is familiar, but the xml.NewEncoder creates a buffer of 4096 bytes for each run while xml.EscapeText uses an empty buffer.

It is around 119 times bigger than was. I would say too much :)

I think we need to find the best solution here. Extending the original lib you will see as the best solution?

@artur-chopikian
Copy link
Contributor Author

artur-chopikian commented Mar 23, 2025

As an alternative solution, it can be something like that, but here we need to use global var if it is ok, we go with this as well. If we need concurrency we can add the mutex.

package excelize

import (
	"bytes"
	"encoding/xml"
)

var xmlEncoder = newEncoder()

type encoder struct {
	*xml.Encoder

	buf bytes.Buffer
}

func newEncoder() *encoder {
	e := new(encoder)
	e.Encoder = xml.NewEncoder(&e.buf)
	return e
}

func (x *encoder) encode(str string) string {
	if str == "" {
		return ""
	}

	_ = x.EncodeToken(xml.CharData(str))
	_ = x.Flush()

	defer x.buf.Reset()

	return x.buf.String()
}

@xuri
Copy link
Member

xuri commented Mar 24, 2025

Hi @artur-chopikian, I think using following changes would be better approach:

-var buf bytes.Buffer
-enc := xml.NewEncoder(&buf)
-_ = enc.EncodeToken(xml.CharData(value))
-enc.Flush()
-value = buf.String()
+var buf strings.Builder
+_ = xml.EscapeText(&buf, []byte(value))
+value = strings.ReplaceAll(buf.String(), "&#xA;", "\n")

@artur-chopikian artur-chopikian requested a review from xuri March 24, 2025 08:40
Copy link
Member

@xuri xuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @artur-chopikian, please using strings.Builder instead of bytes.Buffer to get better performance.

@xuri xuri added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 24, 2025
@artur-chopikian artur-chopikian requested a review from xuri March 24, 2025 17:41
Copy link
Member

@xuri xuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your contribution.

@xuri xuri merged commit 91d36cc into qax-os:master Mar 25, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants