• XSS.stack #1 – первый литературный журнал от юзеров форума

Help please

Пожалуйста, обратите внимание, что пользователь заблокирован
I need figure it out how to minify a 13GB txt file and transform him in json.

Any thoughts?
i don't think txt file can be minified without altering it's content. you probably have to remove irrelevant spaces, tabs, and newlines but find character for newline or you just trim spaces and explode to array using newline. then encode array to json but you have to check memory usage and you have to write the code.
 
if you only want to remove the spaces , \t , \n and this stuff
you can use this :


Python:
import json

input_path = '13gb.txt'
output_path = 'output.json'

def clean_chunk(chunk):
    # Remove newlines, tabs, multiple spaces and etc
    return ' '.join(chunk.replace('\n', ' ').replace('\t', ' ').split())

with open(input_path, 'r', encoding='utf-8') as infile, open(output_path, 'w', encoding='utf-8') as outfile:
    outfile.write('{"data":"')  # Start JSON
    for line in infile:
        minified = clean_chunk(line)
        # Escape quotes and backslashes for valid JSON string
        safe = minified.replace('\\', '\\\\').replace('"', '\\"')
        outfile.write(safe)
    outfile.write('"}')  # End json
 
Пожалуйста, обратите внимание, что пользователь заблокирован
if you only want to remove the spaces , \t , \n and this stuff
you can use this :


Python:
import json

input_path = '13gb.txt'
output_path = 'output.json'

def clean_chunk(chunk):
    # Remove newlines, tabs, multiple spaces and etc
    return ' '.join(chunk.replace('\n', ' ').replace('\t', ' ').split())

with open(input_path, 'r', encoding='utf-8') as infile, open(output_path, 'w', encoding='utf-8') as outfile:
    outfile.write('{"data":"')  # Start JSON
    for line in infile:
        minified = clean_chunk(line)
        # Escape quotes and backslashes for valid JSON string
        safe = minified.replace('\\', '\\\\').replace('"', '\\"')
        outfile.write(safe)
    outfile.write('"}')  # End json
this code would require about 16GB RAM to run without memory related error. you have to optimize it
 
I need figure it out how to minify a 13GB txt file and transform him in json.

Any thoughts?
You could split the TXT file in multiple subfiles and then use pandas using batches, i think is probably the easiest solution currently. You'll need some edge cases depending on the file.
 


Напишите ответ...
  • Вставить:
Прикрепить файлы
Верх