A Watermarking using linguistic features

What I Did These Days

Before I begin, I know that there are some awkward points in my English text. I usually used the language model to generate or to refine my English, but on this page, I decided to write the whole page by myself. The reason why I avoid the LLM this time is because what I’m saying in this page is based on the idea that human writers’ authentic writing patterns should be protected from other artificial patterns. We have to make a plan to prevent someone from pretending to be other real author. To do this, for me, I design a watermark using linguistic patterns via binary slot to guarantee that the specific author did write the specific text.

To clarify this idea, you can imagine the situation where Alice sends Bob some messages. In this case, the messages consists of language expressions. Like online situation these days, a lot of hackers and automatic language models tend to ruin her messages. The very first way to ruin this text is to paraphrase all the text subtly so maybe Alice’s messages lost its authorial authority. Finally the receiver, Bob should determine whether the text where he got lost its purity or not.

Something should help Bob decide the author. For me, ‘A fragile watermark using linguistic features’ came up with in my head. Using this watermarking system, Alice and Bob can be protected from almost all ruiners, I guess.

Surely, I’m just a dreamer, who is not good at cryptology, so I know that this idea is just kind of ‘dream.’ Despite I know this limit, the only way to crash the limit is to write down what I’m thinking of. This text is from my hesitation, consideration, and finally, the courage, so please, give me constructive criticisms, questions, or whatever.

The Whole Idea. What is it?

For now, watermarking. As I said above, this is the (fragile) watermark using linguistic features. I made one sentence, which is very weird. I did this to examine whether the specific sentence can contain one byte. I know that this sentence is not natural but at first, I should decide an arbitrary sentence which contains at least 8 bits. Basically, I failed to hide 8 bits in this sentence. This sentence only contains 7 bits based on my linguistic features. Fortunately, later, I and my friend found some sentences from high school English test from Korean educational institution and we succeed to give them 30 bits. Before looking at this thirty-bit-sentence, we should look at the sentence (A) first. (A) is one that only contain 7 bits at first, but this has importance to explain the whole idea.

(A) A man who is taught English spoken by her thinks that it is uncommon.

I decided four features at first and give them 0 or 1 via options. So each feature becomes a ‘slot,’ and either options offer 0 or 1, respectively. Here is the slot examples.

The bit stream has 4 slots and each slot can have bit value based on a binary system. The slot order can be differ so this order per se would be a personal key, but inside each slot, the order follows the sentence sequence. Here is an example:

‘A man who is taught English’ and ‘English spoken by her’ are both relative clauses, but based on the sequence, ‘A man who is taught English’ is first and ‘English spoken by her’ is next. Also, the first relative clause, ‘A man who is taught English’ has a relative pronoun, whereas the next clause, ‘English spoken by her’ doesn’t have one. So each clause has value 0 and 1 respectively. So the bit stream in this slot is ‘01,’ not ‘10,’ because the order in one slot should follow the order in sentence sequence.

Likewise, about the slot ‘Polarity,’ we consider every tense phrase and if it has negation, we give it value 1 and if not, give it value 0. In this slot, ‘01’ is confirmed via sentence sequence.

Therefore, the bit stream is ‘1101010,’ based on the personal key, the slot-order, which is ‘Voice-Polarity-Relative Clause-Complementizer.’

The Future Scenarios

Then when we use this watermark? These days we can mimic other’s language patterns by using large language model. Authorial Authority is not protected anymore. Before then, communication focused more on ‘what’ the message is, but now, ‘Who’ wrote the message is more precise. If someone decide to cheat others and pretend to be the specific author, they would use language model and it would look similar, as we predict. Then who and what protects the genuine author? We need the whole new watermarking. If we hide some bitstreams behind the passage, language model can’t correct the exact bit stream and mimic the natural language at the same time. This is because model should consider every 2^n cases and spontaneity at the same time. If it tries to, it is not a big matter because the slot order, personal key, can be differ every time.

What I and my friend did

We succeeded to give 30 bits to two sentences. About this, I’ll write down next time.

Contact: seogyeong@sogang.ac.kr

recent posts

about

Leave a comment Cancel reply

recent posts

about

A Watermarking using linguistic features

Share this:

Leave a comment Cancel reply